Abstract:
Successful depth completion from a single RGB-D image requires both extracting plentiful 2D and 3D features and merging these heterogeneous features appropriately. We propose a novel depth completion framework, CostDCNet, based on the cost volume-based depth estimation approach that has been successfully employed for multi-view stereo (MVS). The key to high-quality depth map estimation in the approach is constructing an accurate cost volume. To produce a quality cost volume tailored to single-view depth completion, we present a simple but effective architecture that can fully exploit the 3D information, three options to make an RGB-D feature volume, and per-plane pixel shuffle for efficient volume upsampling. Our CostDCNet framework consists of lightweight deep neural networks ( 1.8M parameters), running in real time ( 30ms). Nevertheless, thanks to our simple but effective design, CostDCNet demonstrates depth completion results comparable to or better than the state-of-the-art methods.Paperid:58
Authors:Muhammad Zubair Irshad; Sergey Zakharov; Rareș Ambruș; Thomas Kollar; Zsolt Kira; Adrien Gaidon
Title: "ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization"
Abstract:
Our method studies the complex task of object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose and size estimation in complex multi-object scenarios with occlusions. We present ShAPO, a method for joint multi-object detection, 3D textured reconstruction, 6D object pose and size estimation. Key to ShAPO is a single-shot pipeline to regress shape, appearance and pose latent codes along with the masks of each object instance, which is then further refined in a sparse-to-dense fashion. A novel disentangled shape and appearance database of priors is first learned to embed objects in their respective shape and appearance space. We also propose a novel, octree-based differentiable optimization step, allowing us to further improve object shape, pose and appearance simultaneously under the learned latent space, in an analysis-by synthesis fashion. Our novel joint implicit textured object representation allows us to accurately identify and reconstruct novel unseen objects without having access to their 3D meshes. Through extensive experiments, we show that our method, trained on simulated indoor scenes, accurately regresses the shape, appearance and pose of novel objects in the real-world with minimal fine-tuning. Our method significantly out-performs all baselines on the NOCS dataset with an 8% absolute improvement in mAP for 6D pose estimation.Paperid:59
Authors:Le Hui; Lingpeng Wang; Linghua Tang; Kaihao Lan; Jin Xie; Jian Yang
Title: 3D Siamese Transformer Network for Single Object Tracking on Point Clouds
Abstract:
Siamese network based trackers formulate 3D single object tracking as cross-correlation learning between point features of a template and a search area. Due to the large appearance variation between the template and search area during tracking, how to learn the robust cross correlation between them for identifying the potential target in the search area is still a challenging problem. In this paper, we explicitly use Transformer to form a 3D Siamese Transformer network for learning robust cross correlation between the template and the search area of point clouds. Specifically, we develop a Siamese point Transformer network to learn shape context information of the target. Its encoder uses self-attention to capture non-local information of point clouds to characterize the shape information of the object, and the decoder utilizes cross-attention to upsample discriminative point features. After that, we develop an iterative coarse-to-fine correlation network to learn the robust cross correlation between the template and the search area. It formulates the cross-feature augmentation to associate the template with the potential target in the search area via cross attention. To further enhance the potential target, it employs the ego-feature augmentation that applies self-attention to the local k-NN graph of the feature space to aggregate target features. Experiments on the KITTI, nuScenes, and Waymo datasets show that our method achieves state-of-the-art performance on the 3D single object tracking task.Paperid:60
Authors:Ji Yang; Xinxin Zuo; Sen Wang; Zhenbo Yu; Xingyu Li; Bingbing Ni; Minglun Gong; Li Cheng
Title: Object Wake-Up: 3D Object Rigging from a Single Image
Abstract:
Given a single chair image, could we wake it up by reconstructing its 3D shape and skeleton, as well as animating its plausible articulations and motions, similar to that of human modeling? It is a new problem that not only goes beyond image-based object reconstruction but also involves articulated animation of generic objects in 3D, which could give rise to numerous downstream augmented and virtual reality applications. In this paper, we propose an automated approach to tackle the entire process of reconstruct such generic 3D objects, rigging and animation, all from single images. A two-stage pipeline has thus been proposed, which specifically contains a multi-head structure to utilize the deep implicit functions for skeleton prediction. Two in-house 3D datasets of generic objects with high-fidelity rendering and annotated skeletons have also been constructed. Empirically our approach demonstrated promising results; when evaluated on the related sub-tasks of 3D reconstruction and skeleton prediction, our results surpass those of the state-of-the-arts by a noticeable margin. Our code and datasets are made publicly available at the dedicated project website.Paperid:61
Authors:Kennard Yanting Chan; Guosheng Lin; Haiyu Zhao; Weisi Lin
Title: IntegratedPIFu: Integrated Pixel Aligned Implicit Function for Single-View Human Reconstruction
Abstract:
We propose IntegratedPIFu, a new pixel-aligned implicit model that builds on the foundation set by PIFuHD. IntegratedPIFu shows how depth and human parsing information can be predicted and capitalized upon in a pixel-aligned implicit model. In addition, IntegratedPIFu introduces depth-oriented sampling, a novel training scheme that improve any pixel-aligned implicit model’s ability to reconstruct important human features without noisy artefacts. Lastly, IntegratedPIFu presents a new architecture that, despite using less model parameters than PIFuHD, is able to improves the structural correctness of reconstructed meshes. Our results show that IntegratedPIFu significantly outperforms existing state-of-the-arts methods on single-view human reconstruction. We provide the code in our supplementary materials. Our code is available at https://github.com/kcyt/IntegratedPIFu.Paperid:62
Authors:Taras Khakhulin; Vanessa Sklyarova; Victor Lempitsky; Egor Zakharov
Title: Realistic One-Shot Mesh-Based Head Avatars
Abstract:
We present a system for the creation of realistic one-shot mesh-based (ROME) human head avatars. From a single photograph, our system estimates the head mesh (with person-specific details in both the facial and non-facial head parts) as well as the neural texture encoding local photometric and geometric details. The resulting avatars are rigged and can be rendered using a deep rendering network, which is trained alongside the mesh and texture estimators on a dataset of in-the-wild videos. In the experiments, we observe that our system performs competitively both in terms of head geometry recovery and the quality of renders, especially for strong pose and expression changes.Paperid:63
Authors:Martha Paskin; Daniel Baum; Mason N. Dean; Christoph von Tycowicz
Title: A Kendall Shape Space Approach to 3D Shape Estimation from 2D Landmarks
Abstract:
3D shapes provide substantially more information than 2D images. However, the acquisition of 3D shapes is sometimes very difficult or even impossible in comparison with acquiring 2D images, making it necessary to derive the 3D shape from 2D images. Although this is, in general, a mathematically ill-posed problem, it might be solved by constraining the problem formulation using prior information. Here, we present a new approach based on Kendall’s shape space to reconstruct 3D shapes from single monocular 2D images. The work is motivated by an application to study the feeding behavior of the basking shark, an endangered species whose massive size and mobility render 3D shape data nearly impossible to obtain, hampering understanding of their feeding behaviors and ecology. 2D images of these animals in feeding position, however, are readily available. We compare our approach with state-of-the-art shape-based approaches, both on human stick models and on shark head skeletons. Using a small set of training shapes, we show that the Kendall shape space approach is substantially more robust than previous methods and results in plausible shapes. This is essential for the motivating application in which specimens are rare and therefore only few training shapes are available.Paperid:64
Authors:Zian Wang; Wenzheng Chen; David Acuna; Jan Kautz; Sanja Fidler
Title: Neural Light Field Estimation for Street Scenes with Differentiable Virtual Object Insertion
Abstract:
We consider the challenging problem of outdoor lighting estimation for the goal of photorealistic virtual object insertion into photographs. Existing works on outdoor lighting estimation typically simplify the scene lighting into an environment map which cannot capture the spatially-varying lighting effects in outdoor scenes. In this work, we propose a neural approach that estimates the 5D HDR light field from a single image, and a differentiable object insertion formulation that enables end-to-end training with image-based losses that encourage realism. Specifically, we design a hybrid lighting representation tailored to outdoor scenes, which contains an HDR sky dome that handles the extreme intensity of the sun, and a volumetric lighting representation that models the spatially-varying appearance of the surrounding scene. With the estimated lighting, our shadow-aware object insertion is fully differentiable, which enables adversarial training over the composited image to provide additional supervisory signal to the lighting prediction. We experimentally demonstrate that our hybrid lighting representation is more performant than existing outdoor lighting estimation methods. We further show the benefits of our AR object insertion in an autonomous driving application, where we obtain performance gains for a 3D object detector when trained on our augmented data.Paperid:65
Authors:Guangcheng Chen; Li He; Yisheng Guan; Hong Zhang
Title: Perspective Phase Angle Model for Polarimetric 3D Reconstruction
Abstract:
Current polarimetric 3D reconstruction methods, including those in the well-established shape from polarization literature, are all developed under the orthographic projection assumption. In the case of a large field of view, however, this assumption does not hold and may result in significant reconstruction errors in methods that make this assumption. To address this problem, we present the perspective phase angle (PPA) model that is applicable to perspective cameras. Compared with the orthographic model, the proposed PPA model accurately describes the relationship between polarization phase angle and surface normal under perspective projection. In addition, the PPA model makes it possible to estimate surface normals from only one single-view phase angle map and does not suffer from the so-called π-ambiguity problem. Experiments on real data show that the PPA model is more accurate for surface normal estimation with a perspective camera than the orthographic model.Paperid:66
Authors:Asaf Karnieli; Ohad Fried; Yacov Hel-Or
Title: DeepShadow: Neural Shape from Shadow
Abstract:
This paper presents ‘DeepShadow’, a one-shot method for recovering the depth map and surface normals from photometric stereo shadow maps. Previous works that try to recover the surface normals from photometric stereo images treat cast shadows as a disturbance. We show that the self and cast shadows not only do not disturb 3D reconstruction, but can be used alone, as a strong learning signal, to recover the depth map and surface normals. We demonstrate that 3D reconstruction from shadows can even outperform shape-from-shading in certain cases. To the best of our knowledge, our method is the first to reconstruct 3D shape-from-shadows using neural networks. The method does not require any pre-training or expensive labeled data, and is optimized during inference time.Paperid:67
Authors:Yu Liu; Hui Zhang
Title: Camera Auto-Calibration from the Steiner Conic of the Fundamental Matrix
Abstract:
This paper addresses the problem of camera auto-calibration from the fundamental matrix under general motion. The fundamental matrix can be decomposed into a symmetric part (a Steiner conic) and a skew-symmetric part (a fixed point), which we find useful for fully calibrating camera parameters. We first obtain a fixed line from the image of the symmetric, skew-symmetric parts of the fundamental matrix and the image of the absolute conic. Then the properties of this fixed line are presented and proved, from which new constraints on general eigenvectors between the Steiner conic and the image of the absolute conic are derived. We thus propose a method to fully calibrate the camera. First, the three camera intrinsic parameters, i.e., the two focal lengths and the skew, can be solved from our new constraints on the imaged absolute conic obtained from at least three images. On this basis, we can initialize and then iteratively restore the optimal pair of projection centers of the Steiner conic, thereby obtaining the corresponding vanishing lines and images of circular points. Finally, all five camera parameters are fully calibrated using images of circular points obtained from at least three images. Experimental results on synthetic and real data demonstrate that our method achieves state-of-the-art performance in terms of accuracy.Paperid:68
Authors:Marco Pesavento; Marco Volino; Adrian Hilton
Title: Super-Resolution 3D Human Shape from a Single Low-Resolution Image
Abstract:
We propose a novel framework to reconstruct super-resolution human shape from a single low-resolution input image. The approach overcomes limitations of existing approaches that reconstruct 3D human shape from a single image, which require high-resolution images together with auxiliary data such as surface normal or a parametric model to reconstruct high-detail shape. The proposed framework represents the reconstructed shape with a high-detail implicit function. Analogous to the objective of 2D image super-resolution, the approach learns the mapping from a low-resolution shape to its high-resolution counterpart and it is applied to reconstruct 3D shape detail from low-resolution images. The approach is trained end-to-end employing a novel loss function which estimates the information lost between a low and high-resolution representation of the same 3D surface shape. Evaluation for single image reconstruction of clothed people demonstrates that our method achieves high-detail surface reconstruction from low-resolution images without auxiliary data. Extensive experiments show that the proposed approach can estimate super-resolution human geometries with a significantly higher level of detail than that obtained with previous approaches when applied to low-resolution images.Paperid:69
Authors:Weng Fei Low; Gim Hee Lee
Title: Minimal Neural Atlas: Parameterizing Complex Surfaces with Minimal Charts and Distortion
Abstract:
Explicit neural surface representations allow for exact and efficient extraction of the encoded surface at arbitrary precision, as well as analytic derivation of differential geometric properties such as surface normal and curvature. Such desirable properties, which are absent in its implicit counterpart, makes it ideal for various applications in computer vision, graphics and robotics. However, SOTA works are limited in terms of the topology it can effectively describe, distortion it introduces to reconstruct complex surfaces and model efficiency. In this work, we present Minimal Neural Atlas, a novel atlas-based explicit neural surface representation. At its core is a fully learnable parametric domain, given by an implicit probabilistic occupancy field defined on an open square of the parametric space. In contrast, prior works generally predefine the parametric domain. The added flexibility enables charts to admit arbitrary topology and boundary. Thus, our representation can learn a minimal atlas of 3 charts with distortion-minimal parameterization for surfaces of arbitrary topology, including closed and open surfaces with arbitrary connected components. Our experiments support the hypotheses and show that our reconstructions are more accurate in terms of the overall geometry, due to the separation of concerns on topology and geometry.Paperid:70
Authors:Daxuan Ren; Jianmin Zheng; Jianfei Cai; Jiatong Li; Junzhe Zhang
Title: ExtrudeNet: Unsupervised Inverse Sketch-and-Extrude for Shape Parsing
Abstract:
Sketch-and-extrude is a common and intuitive modeling process in computer aided design. This paper studies the problem of learning the shape given in the form of point clouds by “inverse” sketch-and-extrude. We present ExtrudeNet, an unsupervised end-to-end network for discovering sketch and extrude from point clouds. Behind ExtrudeNet are two new technical components: 1) the use of a specially-designed rational Bézier representation for sketch and extrude, which can model extrusion with freeform sketches and conventional cylinder and box primitives as well; and 2) a numerical method for computing the signed distance field which is used in the network learning. This is the first attempt that uses machine learning to reverse engineer the sketch-and-extrude modeling process of a shape in an unsupervised fashion. ExtrudeNet not only outputs a compact, editable and interpretable representation of the shape that can be seamlessly integrated into modern CAD software, but also aligns with the standard CAD modeling process facilitating various editing applications, which distinguishes our work from existing shape parsing research. Code will be open-sourced upon acceptance.Paperid:71
Authors:Xingyu Liu; Gu Wang; Yi Li; Xiangyang Ji
Title: CATRE: Iterative Point Clouds Alignment for Category-Level Object Pose Refinement
Abstract:
While category-level 9DoF object pose estimation has emerged recently, previous correspondence-based or direct regression methods are both limited in accuracy due to the huge intra-category variances in object shape and color, etc. Orthogonal to them, this work presents a category-level object pose and size refiner CATRE, which is able to iteratively enhance pose estimate from point clouds to produce accurate results. Given an initial pose estimate, CATRE predicts a relative transformation between the initial pose and ground truth by means of aligning the partially observed point cloud and an abstract shape prior. In specific, we propose a novel disentangled architecture being aware of the inherent distinctions between rotation and translation/size estimation. Extensive experiments show that our approach remarkably outperforms state-of-the-art methods on REAL275, CAMERA25, and LM benchmarks up to a speed of approximately 85.32Hz, and achieves competitive results on category-level tracking. We further demonstrate that CATRE can perform pose refinement on unseen category. Code and trained models are available.Paperid:72
Authors:Jingyu Gong; Fengqi Liu; Jiachen Xu; Min Wang; Xin Tan; Zhizhong Zhang; Ran Yi; Haichuan Song; Yuan Xie; Lizhuang Ma
Title: Optimization over Disentangled Encoding: Unsupervised Cross-Domain Point Cloud Completion via Occlusion Factor Manipulation
Abstract:
Recently, studies considering domain gaps in shape completion attracted more attention, due to the undesirable performance of supervised methods on real scans. They only noticed the gap in input scans, but ignored the gap in output prediction, which is specific for completion. In this paper, we disentangle partial scans into three (domain, shape, and occlusion) factors to handle the output gap in cross-domain completion. For factor learning, we design view-point prediction and domain classification tasks in a self-supervised manner and bring a factor permutation consistency regularization to ensure factor independence. Thus, scans can be completed by simply manipulating occlusion factors while preserving domain and shape information. To further adapt to instances in the target domain, we introduce an optimization stage to maximize the consistency between completed shapes and input scans. Extensive experiments on real scans and synthetic datasets show that ours outperforms previous methods by a large margin and is encouraging for the following works. Code is available at https://github.com/azuki-miho/OptDE.Paperid:73
Authors:Haocheng Yuan; Chen Zhao; Shichao Fan; Jiaxi Jiang; Jiaqi Yang
Title: Unsupervised Learning of 3D Semantic Keypoints with Mutual Reconstruction
Abstract:
Semantic 3D keypoints are category-level semantic consistent points on 3D objects. Detecting 3D semantic keypoints is a foundation for a number of 3D vision tasks but remains challenging, due to the ambiguity of semantic information, especially when the objects are represented by unordered 3D point clouds. Existing unsupervised methods tend to generate category-level keypoints in implicit manners, making it difficult to extract high-level information, such as semantic labels and topology. From a novel mutual reconstruction perspective, we present an unsupervised method to generate consistent semantic keypoints from point clouds explicitly. To achieve this, we train our unsupervised model to reconstruct both the input object and other objects from the same category based on predicted keypoints. To the best of our knowledge, the proposed method is the first to mine 3D semantic consistent keypoints from a mutual reconstruction view. Experiments under various evaluation metrics as well as comparisons with the state-of-the-arts have verified the efficacy of our new solution to mining semantic consistent keypoints with mutual reconstruction.Paperid:74
Authors:Gopal Sharma; Kangxue Yin; Subhransu Maji; Evangelos Kalogerakis; Or Litany; Sanja Fidler
Title: MvDeCor: Multi-View Dense Correspondence Learning for Fine-Grained 3D Segmentation
Abstract:
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks. This is inspired by the observation that view-based surface representations are more effective at modeling high-resolution surface details and texture than their 3D counterparts based on point clouds or voxel occupancy. Specifically, given a 3D shape, we render it from multiple views, and set up a dense correspondence learning task within the contrastive learning framework. As a result, the learned 2D representations are view-invariant and geometrically consistent, leading to better generalization when trained on a limited number of labeled shapes than alternatives based on self-supervision in 2D or 3D alone. Experiments on textured (RenderPeople) and untextured (PartNet) 3D datasets show that our method outperforms state-of-the-art alternatives in fine-grained part segmentation. The improvements over baselines are greater when only a sparse set of views is available for training or when shapes are textured, indicating that \mvd benefits from both 2D processing and 3D geometric reasoning.Paperid:75
Authors:Ahmed A. A. Osman; Timo Bolkart; Dimitrios Tzionas; Michael J. Black
Title: SUPR: A Sparse Unified Part-Based Human Representation
Abstract:
Statistical 3D shape models of the head, hands, and full body are widely used in computer vision and graphics. Despite their wide use, we show that existing models of the head and hands fail to capture the full range of motion for these parts. Moreover, existing work largely ignores the feet, which are crucial for modeling human movement and have applications in biomechanics, animation, and the footwear industry. The problem is that previous body part models are trained using 3D scans that are isolated to the individual parts. Such data does not capture the full range of motion for such parts, e.g. the motion of head relative to the neck. Our observation is that full-body scans provide important information about the motion of the body parts. Consequently, we propose a new learning scheme that jointly trains a full-body model and specific part models using a federated dataset of full-body and body-part scans. Specifically, we train an expressive human body model called SUPR (Sparse Unified Part-Based Representation), where each joint strictly influences a sparse set of model vertices. The factorized representation enables separating SUPR into an entire suite of body part models: an expressive head (SUPR-Head), an articulated hand (SUPR-Hand), and a novel foot (SUPR-Foot). Note that feet have received little attention and existing 3D body models have highly under-actuated feet. Using novel 4D scans of feet, we train a model with an extended kinematic tree that captures the range of motion of the toes. Additionally, feet deform due to ground contact. To model this, we include a novel non-linear deformation function that predicts foot deformation conditioned on the foot pose, shape, and ground contact. We train SUPR on an unprecedented number of scans: 1.2 million body, head, hand and foot scans. We quantitatively compare SUPR and the separate body parts to existing expressive human body models and body-part models and find that our suite of models generalizes better and captures the body parts’ full range of motion. SUPR is publicly available for research purposes.Paperid:76
Authors:Rolandos Alexandros Potamias; Giorgos Bouritsas; Stefanos Zafeiriou
Title: Revisiting Point Cloud Simplification: A Learnable Feature Preserving Approach
Abstract:
The recent advances in 3D sensing technology have made possible the capture of point clouds in significantly high resolution. However, increased detail usually comes at the expense of high storage, as well as computational costs in terms of processing and visualization operations. Mesh and Point Cloud simplification methods aim to reduce the complexity of 3D models while retaining visual quality and relevant salient features. Traditional simplification techniques usually rely on solving a time-consuming optimization problem, hence they are impractical for large-scale datasets. In an attempt to alleviate this computational burden, we propose a fast point cloud simplification method by learning to sample salient points. The proposed method relies on a graph neural network architecture trained to select an arbitrary, user-defined, number of points according to their latent encodings and re-arrange their positions so as to minimize the visual perception error. The approach is extensively evaluated on various datasets using several perceptual metrics. Importantly, our method is able to generalize to out-of-distribution shapes, hence demonstrating zero-shot capabilities.Paperid:77
Authors:Yatian Pang; Wenxiao Wang; Francis E.H. Tay; Wei Liu; Yonghong Tian; Li Yuan
Title: Masked Autoencoders for Point Cloud Self-Supervised Learning
Abstract:
As a promising scheme of self-supervised learning, masked autoencoding has significantly advanced natural language processing and computer vision. Inspired by this, we propose a neat scheme of masked autoencoders for point cloud self-supervised learning, addressing the challenges posed by point cloud’s properties, including leakage of location information and uneven information density. Concretely, we divide the input point cloud into irregular point patches and randomly mask them at a high ratio. Then, a standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches, aiming to reconstruct the masked point patches. Extensive experiments show that our approach is efficient during pre-training and generalizes well on various downstream tasks. The pre-trained models achieve 85.18% accuracy on ScanObjectNN and 94.04% accuracy on ModelNet40, outperforming all the other self-supervised learning methods. We show with our scheme, a simple architecture entirely based on standard Transformers can surpass dedicated Transformer models from supervised learning. Our approach also advances state-of-the-art accuracies by 1.5%-2.3% in the few-shot classification. Furthermore, our work inspires the feasibility of applying unified architectures from languages and images to the point cloud. Codes are available at https://github.com/Pang-Yatian/Point-MAE.Paperid:78
Authors:Lukas Koestler; Daniel Grittner; Michael Moeller; Daniel Cremers; Zorah Lähner
Title: Intrinsic Neural Fields: Learning Functions on Manifolds
Abstract:
Neural fields have gained significant attention in the computer vision community due to their excellent performance in novel view synthesis, geometry reconstruction, and generative modeling. Some of their advantages are a sound theoretic foundation and an easy implementation in current deep learning frameworks. While neural fields have been applied to signals on manifolds, e.g., for texture reconstruction, their representation has been limited to extrinsically embedding the shape into Euclidean space. The extrinsic embedding ignores known intrinsic manifold properties and is inflexible wrt. transfer of the learned function. To overcome these limitations, this work introduces intrinsic neural fields, a novel and versatile representation for neural fields on manifolds. Intrinsic neural fields combine the advantages of neural fields with the spectral properties of the Laplace-Beltrami operator. We show theoretically that intrinsic neural fields inherit many desirable properties of the extrinsic neural field framework but exhibit additional intrinsic qualities, like isometry invariance. In experiments, we show intrinsic neural fields can reconstruct high-fidelity textures from images with state-of-the-art quality and are robust to the discretization of the underlying manifold. We demonstrate the versatility of intrinsic neural fields by tackling various applications: texture transfer between deformed shapes & different shapes, texture reconstruction from real-world images with view dependence, and discretization-agnostic learning on meshes and point clouds.Paperid:79
Authors:Zhouyingcheng Liao; Jimei Yang; Jun Saito; Gerard Pons-Moll; Yang Zhou
Title: Skeleton-Free Pose Transfer for Stylized 3D Characters
Abstract:
We present the first method that automatically transfers poses between stylized 3D characters without skeletal rigging. In contrast to previous attempts to learn pose transformations on fixed or topology-equivalent skeleton templates, our method focuses on a novel scenario to handle skeleton-free characters with diverse shapes, topologies, and mesh connectivities. The key idea of our method is to represent the characters in a unified articulation model so that the pose can be transferred through the correspondent parts. To achieve this, we propose a novel pose transfer network that predicts the character skinning weights and deformation transformations jointly to articulate the target character to match the desired pose. Our method is trained in a semi-supervised manner absorbing all existing character data with paired/unpaired poses and stylized shapes. It generalizes well to unseen stylized characters and inanimate objects. We conduct extensive experiments and demonstrate the effectiveness of our method on this novel task.Paperid:80
Authors:Haotian Liu; Mu Cai; Yong Jae Lee
Title: Masked Discrimination for Self-Supervised Learning on Point Clouds
Abstract:
Masked autoencoding has achieved great success for self-supervised learning in the image and language domains. However, mask based pretraining has yet to show benefits for point cloud understanding, likely due to standard backbones like PointNet being unable to properly handle the training versus testing distribution mismatch introduced by masking during training. In this paper, we bridge this gap by proposing a discriminative mask pretraining Transformer framework, MaskPoint, for point clouds. Our key idea is to represent the point cloud as discrete occupancy values (1 if part of the point cloud; 0 if not), and perform simple binary classification between masked object points and sampled noise points as the proxy task. In this way, our approach is robust to the point sampling variance in point clouds, and facilitates learning rich representations. We evaluate our pretrained models across several downstream tasks, including 3D shape classification, segmentation, and real-word object detection, and demonstrate state-of-the-art results while achieving a significant pretraining speedup (e.g., 4.1x on ScanNet) compared to the prior state-of-the-art Transformer baseline. Code is available at https://github.com/haotian-liu/MaskPoint.Paperid:81
Authors:Xuejun Yan; Hongyu Yan; Jingjing Wang; Hang Du; Zhihong Wu; Di Xie; Shiliang Pu; Li Lu
Title: FBNet: Feedback Network for Point Cloud Completion
Abstract:
The rapid development of point cloud learning has driven point cloud completion into a new era. However, the information flows of most existing completion methods are solely feedforward, and high-level information is rarely reused to improve low-level feature learning. To this end, we propose a novel Feedback Network (FBNet) for point cloud completion, in which present features are efficiently refined by rerouting subsequent fine-grained ones. Firstly, partial inputs are fed to a Hierarchical Graph-based Network (HGNet) to generate coarse shapes. Then, we cascade several Feedback-Aware Completion (FBAC) Blocks and unfold them across time recurrently. Feedback connections between two adjacent time steps exploit fine-grained features to improve present shape generations. The main challenge of building feedback connections is the dimension mismatching between present and subsequent features. To address this, the elaborately designed point Cross Transformer exploits efficient information from feedback features via cross attention strategy and then refines present features with the enhanced feedback features. Quantitative and qualitative experiments on several datasets demonstrate the superiority of proposed FBNet compared to state-of-the-art methods on point completion task. The source code and model are available at https://github.com/hikvision-research/3DVision/tree/main/PointCompletion/FBNet.Paperid:82
Authors:Ta-Ying Cheng; Qingyong Hu; Qian Xie; Niki Trigoni; Andrew Markham
Title: Meta-Sampler: Almost-Universal yet Task-Oriented Sampling for Point Clouds
Abstract:
Sampling is a key operation in point-cloud task and acts to increase computational efficiency and tractability by discarding redundant points. Universal sampling algorithms (e.g., Farthest Point Sampling) work without modification across different tasks, models, and datasets, but by their very nature are agnostic about the downstream task/model. As such, they have no implicit knowledge about which points would be best to keep and which to reject. Recent work has shown how task-specific point cloud sampling (e.g., SampleNet) can be used to outperform traditional sampling approaches by learning which points are more informative. However, these learnable samplers face two inherent issues: i) overfitting to a model rather than a task, and ii) requiring training of the sampling network from scratch, in addition to the task network, somewhat countering the original objective of down-sampling to increase efficiency. In this work, we propose an almost-universal sampler, in our quest for a sampler that can learn to preserve the most useful points for a particular task, yet be inexpensive to adapt to different tasks, models or datasets. We first demonstrate how training over multiple models for the same task (e.g., shape reconstruction) significantly outperforms the vanilla SampleNet in terms of accuracy by not overfitting the sample network to a particular task network. Second, we show how we can train an almost-universal meta-sampler across multiple tasks. This meta-sampler can then be rapidly fine-tuned when applied to different datasets, networks, or even different tasks, thus amortizing the initial cost of training.Paperid:83
Authors:Ishit Mehta; Manmohan Chandraker; Ravi Ramamoorthi
Title: A Level Set Theory for Neural Implicit Evolution under Explicit Flows
Abstract:
Coordinate-based neural networks parameterizing implicit surfaces have emerged as efficient representations of geometry. They effectively act as parametric level sets with the zero-level set defining the surface of interest. We present a framework that allows applying deformation operations defined for triangle meshes onto such implicit surfaces. Several of these operations can be viewed as energy-minimization problems that induce an instantaneous flow field on the explicit surface. Our method uses the flow field to deform parametric implicit surfaces by extending the classical theory of level sets. We also derive a consolidated view for existing methods on differentiable surface extraction and rendering, by formalizing connections to the level-set theory. We show that these methods drift from the theory and that our approach exhibits improvements for applications like surface smoothing, mean-curvature flow, inverse rendering and user-defined editing on implicit geometry.Paperid:84
Authors:Wanli Chen; Xinge Zhu; Guojin Chen; Bei Yu
Title: Efficient Point Cloud Analysis Using Hilbert Curve
Abstract:
Previous state-of-the-art research on analyzing point cloud mainly rely on the voxelization quantization because it keeps the better spatial locality and geometry. However, these 3D voxelization methods and subsequent 3D convolution networks often bring the large computational overhead and GPU occupation. A straightforward alternative is to flatten 3D voxelization into 2D structure or utilize the pillar representation to perform the dimension reduction, while all of them would inevitably alter the spatial locality and 3D geometric information. In this way, we propose the HilbertNet to maintain the locality advantage of voxel-based methods while significantly reducing the computational cost. Here the key component is a new flattening mechanism based on Hilbert curve, which is a famous locality and geometry preserving function. Namely, if flattening 3D voxels using Hilbert curve encoding, the resulting structure will have similar spatial topology compared with original voxels. Through the Hilbert flattening, we can not only use 2D convolution (more lightweight than 3D convolution) to process voxels, but also incorporate technologies suitable in 2D space, such as transformer, to boost the performance. Our proposed HilbertNet achieves state-of-the-art performance on ShapeNet and ModelNet40 datasets with smaller cost and GPU occupation.Paperid:85
Authors:Keyang Zhou; Bharat Lal Bhatnagar; Jan Eric Lenssen; Gerard Pons-Moll
Title: TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement
Abstract:
We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous works focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. TOCH fields are a point-wise, object-centric representation, which encode the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D hand-object reconstruction methods and transferring grasps across objects.Paperid:86
Authors:Ashkan Mirzaei; Yash Kant; Jonathan Kelly; Igor Gilitschenski
Title: LaTeRF: Label and Text Driven Object Radiance Fields
Abstract:
Obtaining 3D object representations is important for creating photo-realistic simulators and collecting assets for AR/VR applications. Neural fields have shown their effectiveness in learning a continuous volumetric representation of a scene from 2D images, but acquiring object representations from these models with weak supervision remains an open challenge. In this paper we introduce LaTeRF, a method for extracting an object of interest from a scene given 2D images of the entire scene and known camera poses, a natural language description of the object, and a small number of point-labels of object and non-object points in the input images. To faithfully extract the object from the scene, LaTeRF extends the NeRF formulation with an additional ‘objectness’ probability at each 3D point. Additionally, we leverage the rich latent space of a pre-trained CLIP model combined with our differentiable object renderer, to inpaint the occluded parts of the object. We demonstrate high-fidelity object extraction on both synthetic and real datasets and justify our design choices through an extensive ablation study.Paperid:87
Authors:Yaqian Liang; Shanshan Zhao; Baosheng Yu; Jing Zhang; Fazhi He
Title: MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis
Abstract:
Recently, self-supervised pre-training has advanced Vision Transformers on various tasks w.r.t. different data modalities, e.g., image and 3D point cloud data. In this paper, we explore this learning paradigm for 3D mesh data analysis based on Transformers. Since applying Transformer architectures to new modalities is usually non-trivial, we first adapt Vision Transformer to 3D mesh data processing, i.e., Mesh Transformer. In specific, we divide a mesh into several non-overlapping local patches with each containing the same number of faces and use the 3D position of each patch’s center point to form positional embeddings. Inspired by MAE, we explore how pre-training on 3D mesh data with the Transformer-based structure benefits downstream 3D mesh analysis tasks. We first randomly mask some patches of the mesh and feed the corrupted mesh into Mesh Transformers. Then, through reconstructing the information of masked patches, the network is capable of learning discriminative representations for mesh data. Therefore, we name our method MeshMAE, which can yield state-of-the-art or comparable performance on mesh analysis tasks, i.e., classification and segmentation. In addition, we also conduct comprehensive ablation studies to show the effectiveness of key designs in our method.Paperid:88
Authors:Dongliang Cao; Florian Bernard
Title: Unsupervised Deep Multi-Shape Matching
Abstract:
3D shape matching is a long-standing problem in computer vision and computer graphics. While deep neural networks were shown to lead to state-of-the-art results in shape matching, existing learning-based approaches are limited in the context of multi-shape matching: (i) either they focus on matching pairs of shapes only and thus suffer from cycle-inconsistent multi-matchings, or (ii) they require an explicit template shape to address the matching of a collection of shapes. In this paper, we present a novel approach for deep multi-shape matching that ensures cycle-consistent multi-matchings while not depending on an explicit template shape. To this end, we utilise a shape-to-universe multi-matching representation that we combine with powerful functional map regularisation, so that our multi-shape matching neural network can be trained in a fully unsupervised manner. While the functional map regularisation is only considered during training time, functional maps are not computed for predicting correspondences, thereby allowing for fast inference. We demonstrate that our method achieves state-of-the-art results on several challenging benchmark datasets, and, most remarkably, that our unsupervised method even outperforms recent supervised methods.Paperid:89
Authors:Yawar Siddiqui; Justus Thies; Fangchang Ma; Qi Shan; Matthias Nießner; Angela Dai
Title: Texturify: Generating Textures on 3D Shape Surfaces
Abstract:
Texture cues on 3D objects are key to compelling visual representations, with the possibility to create high visual fidelity with inherent spatial consistency across different views. Since the availability of textured 3D shapes remains very limited, learning a 3D-supervised data-driven method that predicts a texture based on the 3D input is very challenging. We thus propose Texturify, a GAN-based method that leverages a 3D shape dataset of an object class and learns to reproduce the distribution of appearances observed in real images by generating high-quality textures. In particular, our method does not require any 3D color supervision or correspondence between shape geometry and images to learn the texturing of 3D objects. Texturify operates directly on the surface of the 3D objects by introducing face convolutional operators on a hierarchical 4-RoSy parametrization to generate plausible object-specific textures. Employing differentiable rendering and adversarial losses that critique individual views and consistency across views, we effectively learn the high-quality surface texturing distribution from real-world images. Experiments on car and chair shape collections show that our approach outperforms state of the art by an average of 22% in FID score.Paperid:90
Authors:An-Chieh Cheng; Xueting Li; Sifei Liu; Min Sun; Ming-Hsuan Yang
Title: Autoregressive 3D Shape Generation via Canonical Mapping
Abstract:
With the capacity of modeling long-range dependencies in sequential data, transformers have shown remarkable performances in a variety of generative tasks such as image, audio, and text generation. Yet, taming them in generating less structured and voluminous data formats such as high-resolution point clouds have seldom been explored due to ambiguous sequentialization processes and infeasible computation burden. In this paper, we aim to further exploit the power of transformers and employ them for the task of 3D point cloud generation. The key idea is to decompose point clouds of one category into semantically aligned sequences of shape compositions, via a learned canonical space. These shape compositions can then be quantized and used to learn a context-rich composition codebook for point cloud generation. Experimental results on point cloud reconstruction and unconditional generation show that our model performs favorably against state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape completion as an application for conditional shape generation.Paperid:91
Authors:Jun-Kun Chen; Yu-Xiong Wang
Title: PointTree: Transformation-Robust Point Cloud Encoder with Relaxed K-D Trees
Abstract:
Being able to learn an effective semantic representation directly on raw point clouds has become a central topic in 3D understanding. Despite rapid progress, state-of-the-art encoders are restrictive to canonicalized point clouds, and have weaker than necessary performance when encountering geometric transformation distortions. To overcome this challenge, we propose PointTree, a general-purpose point cloud encoder that is robust to transformations based on relaxed K-D trees. Key to our approach is the design of the division rule in K-D trees by using principal component analysis (PCA). We use the structure of the relaxed K-D tree as our computational graph, and model the features as border descriptors which are merged with pointwise-maximum operation. In addition to this novel architecture design, we further improve the robustness by introducing pre-alignment -- a simple yet effective PCA-based normalization scheme. Our PointTree encoder combined with pre-alignment consistently outperforms state-of-the-art methods by large margins, for applications from object classification to semantic segmentation on various transformed versions of the widely-benchmarked datasets. Code and pre-trained models are available at https://github.com/immortalCO/PointTree.Paperid:92
Authors:Shenhan Qian; Jiale Xu; Ziwei Liu; Liqian Ma; Shenghua Gao
Title: UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation
Abstract:
We propose united implicit functions (UNIF), a part-based method for clothed human reconstruction and animation with raw scans and skeletons as the input. Previous part-based methods for human reconstruction rely on ground-truth part labels from SMPL and thus are limited to minimal-clothed humans. In contrast, our method learns to separate parts from body motions instead of part supervision, thus can be extended to clothed humans and other articulated objects. Our Partition-from-Motion is achieved by a bone-centered initialization, a bone limit loss, and a section normal loss that ensure stable part division even when the training poses are limited. We also present a minimal perimeter loss for SDF to suppress extra surfaces and part overlapping. Another core of our method is an adjacent part seaming algorithm that produces non-rigid deformations to maintain the connection between parts which significantly relieves the part-based artifacts. Under this algorithm, we further propose ""Competing Parts”, a method that defines blending weights by the relative position of a point to bones instead of the absolute position, avoiding the generalization problem of neural implicit functions with inverse LBS (linear blend skinning). We demonstrate the effectiveness of our method by clothed human body reconstruction and animation on the CAPE and the ClothSeq datasets.Paperid:93
Authors:Brandon Y. Feng; Yinda Zhang; Danhang Tang; Ruofei Du; Amitabh Varshney
Title: PRIF: Primary Ray-Based Implicit Function
Abstract:
We introduce a new implicit shape representation called Primary Ray-based Implicit Function (PRIF). In contrast to most existing approaches based on the signed distance function (SDF) which handles spatial locations, our representation operates on oriented rays. Specifically, PRIF is formulated to directly produce the surface hit point of a given input ray, without the expensive sphere-tracing operations, hence enabling efficient shape extraction and differentiable rendering. We demonstrate that neural networks trained to encode PRIF achieve successes in various tasks including single shape representation, category-wise shape generation, shape completion from sparse or noisy observations, inverse rendering for camera pose estimation, and neural rendering with color.Paperid:94
Authors:Hanxue Liang; Hehe Fan; Zhiwen Fan; Yi Wang; Tianlong Chen; Yu Cheng; Zhangyang Wang
Title: Point Cloud Domain Adaptation via Masked Local 3D Structure Prediction
Abstract:
The superiority of deep learning based point cloud representations relies on large-scale labeled datasets, while the annotation of point clouds is notoriously expensive. One of the most effective solutions is to transfer the knowledge from existing labeled source data to unlabeled target data. However, domain bias typically hinders knowledge transfer and leads to accuracy degradation. In this paper, we propose a Masked Local Structure Prediction (MLSP) method to encode target data. Along with the supervised learning on the source domain, our method enables models to embed source and target data in a shared feature space. Specifically, we predict masked local structure via estimating point cardinality, position and normal. Our design philosophies lie in: 1) Point cardinality reflects basic structures (e.g., line, edge and plane) that are invariant to specific domains. 2) Predicting point positions in masked areas generalizes learned representations so that they are robust to incompletion-caused domain bias. 3) Point normal is generated by neighbors and thus robust to noise across domains. We conduct experiments on shape classification and semantic segmentation with different transfer permutations and the results demonstrate the effectiveness of our method. Code is available at https://github.com/VITA-Group/MLSP.Paperid:95
Authors:Kim Youwang; Kim Ji-Yeon; Tae-Hyun Oh
Title: CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes
Abstract:
We propose CLIP-Actor, a text-driven motion recommendation and neural mesh stylization system for human mesh animation. CLIP-Actor animates a 3D human mesh to conform to a text prompt by recommending a motion sequence and optimizing mesh style attributes. We build a text-driven human motion recommendation system by leveraging a large-scale human motion dataset with language labels. Given a natural language prompt, CLIP-Actor suggests a text-conforming human motion in a coarse-to-fine manner. Then, our novel zero-shot neural style optimization detailizes and texturizes the recommended mesh sequence to conform to the prompt in a temporally-consistent and pose-agnostic manner. This is distinctive in that prior work fails to generate plausible results when the pose of an artist-designed mesh does not conform to the text from the beginning. We further propose the spatio-temporal view augmentation and mask-weighted embedding attention, which stabilize the optimization process by leveraging multi-frame human motion and rejecting poorly rendered views. We demonstrate that CLIP-Actor produces plausible and human-recognizable style 3D human mesh in motion with detailed geometry and texture solely from a natural language prompt.Paperid:96
Authors:Samir Agarwala; Linyi Jin; Chris Rockwell; David F. Fouhey
Title: PlaneFormers: From Sparse View Planes to 3D Reconstruction
Abstract:
We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success.Paperid:97
Authors:Siyou Lin; Hongwen Zhang; Zerong Zheng; Ruizhi Shao; Yebin Liu
Title: Learning Implicit Templates for Point-Based Clothed Human Modeling
Abstract:
We present FITE, a First-Implicit-Then-Explicit framework for modeling human avatars in clothing. Our framework first learns implicit surface templates representing the coarse clothing topology, and then employs the templates to guide the generation of point sets which further capture pose-dependent clothing deformations such as wrinkles. Our pipeline incorporates the merits of both implicit and explicit representations, namely, the ability to handle varying topology and the ability to efficiently capture fine details. We also propose diffused skinning to facilitate template training especially for loose clothing, and projection-based pose-encoding to extract pose information from mesh templates without predefined UV map or connectivity. Our code is publicly available at https://github.com/jsnln/fite.Paperid:98
Authors:Qianjiang Hu; Daizong Liu; Wei Hu
Title: Exploring the Devil in Graph Spectral Domain for 3D Point Cloud Attacks
Abstract:
With the maturity of depth sensors, point clouds have received increasing attention in various applications such as autonomous driving, robotics, surveillance, \etc., while deep point cloud learning models have shown to be vulnerable to adversarial attacks. Existing attack methods generally add/delete points or perform point-wise perturbation over point clouds to generate adversarial examples in the data space, which may neglect the geometric characteristics of point clouds. Instead, we propose point cloud attacks from a new perspective---Graph Spectral Domain Attack (GSDA), aiming to perturb transform coefficients in the graph spectral domain that corresponds to varying certain geometric structure. In particular, we naturally represent a point cloud over a graph, and adaptively transform the coordinates of points into the graph spectral domain via graph Fourier transform (GFT) for compact representation. We then analyze the influence of different spectral bands on the geometric structure of the point cloud, based on which we propose to perturb the GFT coefficients in a learnable manner guided by an energy constraint loss function. Finally, the adversarial point cloud is generated by transforming the perturbed spectral representation back to the data domain via the inverse GFT (IGFT). Experimental results demonstrate the effectiveness of the proposed GSDA in terms of both imperceptibility and attack success rates under a variety of defense strategies.Paperid:99
Authors:Jingwang Ling; Zhibo Wang; Ming Lu; Quan Wang; Chen Qian; Feng Xu
Title: Structure-Aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation
Abstract:
Morphable models are essential for the statistical modeling of 3D faces. Previous works on morphable models mostly focus on large-scale facial geometry but ignore facial details. This paper augments morphable models in representing facial details by learning a Structure-aware Editable Morphable Model (SEMM). SEMM introduces a detail structure representation based on the distance field of wrinkle lines, jointly modeled with detail displacements to establish better correspondences and enable intuitive manipulation of wrinkle structure. Besides, SEMM introduces two transformation modules to translate expression blendshape weights and age values into changes in latent space, allowing effective semantic detail editing while maintaining identity. Extensive experiments demonstrate that the proposed model compactly represents facial details, outperforms previous methods in expression animation qualitatively and quantitatively, and achieves effective age editing and wrinkle line editing of facial details. Code and model are available at https://github.com/gerwang/facial-detail-manipulation.Paperid:100
Authors:Yiyu Zhuang; Hao Zhu; Xusen Sun; Xun Cao
Title: MoFaNeRF: Morphable Facial Neural Radiance Field
Abstract:
We propose a parametric model that maps free-view images into a vector space of coded facial shape, expression and appearance with a neural radiance field, namely Morphable Facial NeRF. Specifically, MoFaNeRF takes the coded facial shape, expression and appearance along with space coordinate and view direction as input to an MLP, and outputs the radiance of the space point for photo-realistic image synthesis. Compared with conventional 3D morphable models (3DMM), MoFaNeRF shows superiority in directly synthesizing photo-realistic facial details even for eyes, mouths, and beards. Also, continuous face morphing can be easily achieved by interpolating the input shape, expression and appearance codes. By introducing identity-specific modulation and texture encoder, our model synthesizes accurate photometric details and shows strong representation ability. Our model shows strong ability on multiple applications including image-based fitting, random generation, face rigging, face editing, and novel view synthesis. Experiments show that our method achieves higher representation ability than previous parametric models, and achieves competitive performance in several applications. To the best of our knowledge, our work is the first facial parametric model built upon a neural radiance field that can be used in fitting, generation and manipulation. The code and data is available at https://github.com/zhuhao-nju/mofanerf.Paperid:101
Authors:Tong He; Wei Yin; Chunhua Shen; Anton van den Hengel
Title: PointInst3D: Segmenting 3D Instances by Points
Abstract:
The current state-of-the-art methods in 3D instance segmentation typically involve a clustering step, despite the tendency towards heuristics, greedy algorithms, and a lack of robustness to the changes in data statistics. In contrast, we propose a fully convolutional 3D point cloud instance segmentation method that works in a per-point prediction fashion. In doing so it avoids the challenges that clustering-based methods face: introducing dependencies among different tasks of the model. We find the key to its success is assigning a suitable target to each sampled point. Instead of the commonly used static or distance-based assignment strategies, we propose to use an Optimal Transport approach to optimally assign target masks to the sampled points according to the dynamic matching costs. Our approach achieves promising results on both ScanNet and S3DIS benchmarks. The proposed approach removes intertask dependencies and thus represents a simpler and more flexible 3D instance segmentation framework than other competing methods, while achieving improved segmentation accuracy.Paperid:102
Authors:Zezhou Cheng; Menglei Chai; Jian Ren; Hsin-Ying Lee; Kyle Olszewski; Zeng Huang; Subhransu Maji; Sergey Tulyakov
Title: Cross-Modal 3D Shape Generation and Manipulation
Abstract:
Creating and editing the shape and color of 3D objects require tremendous human effort and expertise. Compared to direct manipulation in 3D interfaces, 2D interactions such as sketches and scribbles are usually much more natural and intuitive for the users. In this paper, we propose a generic multi-modal generative model that couples the 2D modalities and implicit 3D representations through shared latent spaces. With the proposed model, versatile 3D generation and manipulation are enabled by simply propagating the editing from a specific 2D controlling modality through the latent spaces. For example, editing the 3D shape by drawing a sketch, re-colorizing the 3D surface via painting color scribbles on the 2D rendering, or generating 3D shapes of a certain category given one or a few reference images. Unlike prior works, our model does not require re-training or fine-tuning per editing task and is also conceptually simple, easy to implement, robust to input domain shifts, and flexible to diverse reconstruction on partial 2D inputs. We evaluate our framework on two representative 2D modalities of grayscale line sketches and rendered color images, and demonstrate that our method enables various shape manipulation and generation tasks with these 2D modalities.Paperid:103
Authors:Chao Chen; Yu-Shen Liu; Zhizhong Han
Title: Latent Partition Implicit with Surface Codes for 3D Representation
Abstract:
Deep implicit functions have shown remarkable shape modeling ability in various 3D computer vision tasks. One drawback is that it is hard for them to represent a 3D shape as multiple parts. Current solutions learn various primitives and blend the primitives directly in the spatial space, which still struggle to approximate the 3D shape accurately. To resolve this problem, we introduce a novel implicit representation to represent a single 3D shape as a set of parts in the latent space, towards both highly accurate and plausibly interpretable shape modeling. Our insight here is that both the part learning and the part blending can be conducted much easier in the latent space than in the spatial space. We name our method Latent Partition Implicit (LPI), because of its ability of casting the global shape modeling into multiple local part modeling, which partitions the global shape unity. LPI represents a shape as Signed Distance Functions (SDFs) using surface codes. Each surface code is a latent code representing a part whose center is on the surface, which enables us to flexibly employ intrinsic attributes of shapes or additional surface properties. Eventually, LPI can reconstruct both the shape and the parts on the shape, both of which are plausible meshes. LPI is a multi-level representation, which can partition a shape into different numbers of parts after training. LPI can be learned without ground truth signed distances, point normals or any supervision for part partition. LPI outperforms the state-of-the-art methods under the widely used benchmarks in terms of reconstruction accuracy and modeling interpretability.Paperid:104
Authors:Ramana Sundararaman; Gautam Pai; Maks Ovsjanikov
Title: Implicit Field Supervision for Robust Non-rigid Shape Matching
Abstract:
Establishing a correspondence between two non-rigidly deforming shapes is one of the most fundamental problems in visual computing. Existing methods often show weak resilience when presented with challenges innate to real-world data such as noise, outliers, self-occlusion etc. On the other hand, auto-decoders have demonstrated strong expressive power in learning geometrically meaningful latent embeddings. However, their use in shape analysis has been limited. In this paper, we introduce an approach based on an auto-decoder framework, that learns a continuous shape-wise deformation field over a fixed template. By supervising the deformation field for points on-surface and regularizing for points off-surface through a novel Signed Distance Regularization (SDR), we learn an alignment between the template and shape volumes. Trained on clean water-tight meshes, without any data-augmentation, we demonstrate compelling performance on compromised data and real-world scans.Paperid:105
Authors:Shota Hattori; Tatsuya Yatagawa; Yutaka Ohtake; Hiromasa Suzuki
Title: Learning Self-Prior for Mesh Denoising Using Dual Graph Convolutional Networks
Abstract:
This study proposes a deep-learning framework for mesh denoising from a single noisy input, where two graph convolutional networks are trained jointly to filter vertex positions and facet normals apart. The prior obtained only from a single input is particularly referred to as a self-prior. The proposed method leverages the framework of the deep image prior (DIP), which obtains the self-prior for image restoration using a convolutional neural network (CNN). Thus, we obtain a denoised mesh without any ground-truth noise-free meshes. Compared to the original DIP that transforms a fixed random code into a noise-free image by the neural network, we reproduce vertex displacement from a fixed random code and reproduce facet normals from feature vectors that summarize local triangle arrangements. After tuning several hyperparameters with a few validation samples, our method achieved significantly higher performance than traditional approaches working with a single noisy input mesh. Moreover, its performance is better than the other methods using deep neural networks trained with a large-scale shape dataset. The independence of our method of either large-scale datasets or ground-truth noise-free mesh will allow us to easily denoise meshes whose shapes are rarely included in the shape datasets.Paperid:106
Authors:Manxi Lin; Aasa Feragen
Title: diffConv: Analyzing Irregular Point Clouds with an Irregular View
Abstract:
Standard spatial convolutions assume input data with a regular neighborhood structure. Existing methods typically generalize convolution to the irregular point cloud domain by fixing a regular ""view"" through e.g. a fixed neighborhood size, where the convolution kernel size remains the same for each point. However, since point clouds are not as structured as images, the fixed neighbor number gives an unfortunate inductive bias. We present a novel graph convolution named Difference Graph Convolution (diffConv), which does not rely on a regular view. diffConv operates on spatially-varying and density-dilated neighborhoods, which are further adapted by a learned masked attention mechanism. Experiments show that our model is very robust to the noise, obtaining state-of-the-art performance in 3D shape classification and scene understanding tasks, along with a faster inference speed.Paperid:107
Authors:Aihua Mao; Zihui Du; Yu-Hui Wen; Jun Xuan; Yong-Jin Liu
Title: PD-Flow: A Point Cloud Denoising Framework with Normalizing Flows
Abstract:
Point cloud denoising aims to restore clean point clouds from raw observations corrupted by noise and outliers while preserving the fine-grained details. We present a novel deep learning-based denoising model, that incorporates normalizing flows and noise disentanglement techniques to achieve high denoising accuracy. Unlike the existing works that extract features of point clouds for point-wise correction, we formulate the denoising process from the perspective of distribution learning and feature disentanglement. By considering noisy point clouds as a joint distribution of clean points and noise, the denoised results can be derived from disentangling the noise counterpart from latent point representation, whereas the mapping between Euclidean and latent spaces is modeled by normalizing flows. We evaluate our method on synthesized 3D models and real-world datasets with various noise settings. Qualitative and quantitative results show that our method surpasses previous state-of-the-art deep learning-based approaches in terms of detail preservation and distribution uniformity. The source code is available at https://github.com/unknownue/pdflow.Paperid:108
Authors:Haoran Zhou; Yun Cao; Wenqing Chu; Junwei Zhu; Tong Lu; Ying Tai; Chengjie Wang
Title: SeedFormer: Patch Seeds Based Point Cloud Completion with Upsample Transformer
Abstract:
Point cloud completion has become increasingly popular among generation tasks of 3D point clouds, as it is a challenging yet indispensable problem to recover the complete shape of a 3D object from its partial observation. In this paper, we propose a novel SeedFormer to improve the ability of detail preservation and recovery in point cloud completion. Unlike previous methods based on a global feature vector, we introduce a new shape representation, namely Patch Seeds, which not only captures general structures from partial inputs but also preserves regional information of local patterns. Then, by integrating seed features into the generation process, we can recover faithful details for complete point clouds in a coarse-to-fine manner. Moreover, we devise an Upsample Transformer by extending the transformer structure into basic operations of point generators, which effectively incorporates spatial and semantic relationships between neighboring points. Qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art completion networks on several benchmark datasets. Our code is available at https://github.com/hrzhou2/seedformer.Paperid:109
Authors:Nikolas Lamb; Sean Banerjee; Natasha Kholgade Banerjee
Title: DeepMend: Learning Occupancy Functions to Represent Shape for Repair
Abstract:
We present DeepMend, a novel approach to reconstruct restorations to fractured shapes using learned occupancy functions. Existing shape repair approaches predict low-resolution voxelized restorations or smooth restorations, or require symmetries or access to a pre-existing complete oracle. We represent the occupancy of a fractured shape as the conjunction of the occupancy of an underlying complete shape and a break surface, which we model as functions of latent codes using neural networks. Given occupancy samples from a fractured shape, we estimate latent codes using an inference loss augmented with novel penalties to avoid empty or voluminous restorations. We use the estimated codes to reconstruct a restoration shape. We show results with simulated fractures on synthetic and real-world scanned objects, and with scanned real fractured mugs. Compared to existing approaches and to two baseline methods, our work shows state-of-the-art results in accuracy and avoiding restoration artifacts over non-fracture regions of the fractured shape.Paperid:110
Authors:Qingyang Tan; Yi Zhou; Tuanfeng Wang; Duygu Ceylan; Xin Sun; Dinesh Manocha
Title: A Repulsive Force Unit for Garment Collision Handling in Neural Networks
Abstract:
Despite recent success, deep learning-based methods for predicting 3D garment deformation under body motion suffer from interpenetration problems between the garment and the body. To address this problem, we propose a novel collision handling neural network layer called Repulsive Force Unit (ReFU). Based on the signed distance function (SDF) of the underlying body and the current garment vertex positions, ReFU predicts the per-vertex offsets that push any interpenetrating vertex to a collision-free configuration while preserving the fine geometric details. We show that ReFU is differentiable with trainable parameters and can be integrated into different network backbones that predict 3D garment deformations. Our experiments show that ReFU significantly reduces the number of collisions between the body and the garment and better preserves geometric details compared to prior methods based on collision loss or post-processing optimization.Paperid:111
Authors:Oren Katzir; Dani Lischinski; Daniel Cohen-Or
Title: Shape-Pose Disentanglement Using SE(3)-Equivariant Vector Neurons
Abstract:
We introduce an unsupervised technique for encoding point clouds into a canonical shape representation, by disentangling shape and pose. Our encoder is stable and consistent, meaning that the shape encoding is purely pose-invariant, while the extracted rotation and translation are able to semantically align different input shapes of the same class to a common canonical pose. Specifically, we design an auto-encoder based on Vector Neuron Networks, a rotation-equivariant neural network, whose layers we extend to provide translation-equivariance in addition to rotation-equivariance only. The resulting encoder produces pose-invariant shape encoding by construction, enabling our approach to focus on learning a consistent canonical pose for a class of objects. Quantitative and qualitative experiments validate the superior stability and consistency of our approach.Paperid:112
Authors:Yunlu Chen; Basura Fernando; Hakan Bilen; Matthias Nießner; Efstratios Gavves
Title: 3D Equivariant Graph Implicit Functions
Abstract:
In recent years, neural implicit representations have made remarkable progress in modeling of 3D shapes with arbitrary topology. In this work, we address two key limitations of such representations, in failing to capture local 3D geometric fine details, and to learn from and generalize to shapes with unseen 3D transformations. To this end, we introduce a novel family of graph implicit functions with equivariant layers that facilitates modeling fine local details and guaranteed robustness to various groups of geometric transformations, through local k-NN graph embeddings with sparse point set observations at multiple resolutions. Our method improves over the existing rotation-equivariant implicit function from 0.69 to 0.89 (IoU) on the ShapeNet reconstruction task. We also show that our equivariant implicit function can be extended to other types of similarity transformations and generalizes to unseen translations and scaling.Paperid:113
Authors:Bo Sun; Vladimir G. Kim; Noam Aigerman; Qixing Huang; Siddhartha Chaudhuri
Title: PatchRD: Detail-Preserving Shape Completion by Learning Patch Retrieval and Deformation
Abstract:
This paper introduces a data-driven shape completion approach that focuses on completing geometric details of missing regions of 3D shapes. We observe that existing generative methods do not have enough training data and representation capacity to synthesize plausible, fine-grained details with complex geometry and topology. Thus, our key insight is to copy and deform the patches from the partial input to complete the missing regions. This enables us to preserve the style of local geometric features, even if it is drastically different from the training data. Our fully automatic approach proceeds in two stages. First, we learn to retrieve candidate patches from the input shape. Second, we select and deform some of the retrieved candidates to seamlessly blend them into the complete shape. This method combines the advantages of the two most common completion methods: similarity-based single-instance completion, and completion by learning a shape space. We leverage repeating patterns by retrieving patches from the partial input, and learn global structural priors by using a neural network to guide the retrieval and deformation steps. Experimental results show that our approach considerably outperforms baseline approaches across multiple datasets and shape categories.Paperid:114
Authors:Emery Pierson; Mohamed Daoudi; Sylvain Arguillere
Title: 3D Shape Sequence of Human Comparison and Classification Using Current and Varifolds
Abstract:
In this paper we address the task of the comparison and the classification of 3D shape sequences of human. The non-linear dynamics of the human motion and the changing of the surface parametrization over the time make this task very challenging. To tackle this issue, we propose to embed the 3D shape sequences in an infinite dimensional space, the space of varifolds, endowed with an inner product that comes from a given positive definite kernel. More specifically, our approach involves two steps: 1) the surfaces are represented as varifolds, this representation induces metrics equivariant to rigid motions and invariant to parametrization; 2) the sequences of 3D shapes are represented by Gram matrices derived from their infinite dimensional Hankel matrices, and we use Frobenius distance between two Symmetric Positive definite (SPD) matrices to compare two sequences. Extensive experiments show that our method is competitive with state-of-the-art in 3D sequence motion retrieval.Paperid:115
Authors:Jianxiong Shen; Antonio Agudo; Francesc Moreno-Noguer; Adria Ruiz
Title: Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification
Abstract:
A critical limitation of current methods based on Neural Radiance Fields (NeRF) is that they are unable to quantify the uncertainty associated with the learned appearance and geometry of the scene. This information is paramount in real applications such as medical diagnosis or autonomous driving where, to reduce potentially catastrophic failures, the confidence on the model outputs must be included into the decision-making process. In this context, we introduce Conditional-Flow NeRF (CF-NeRF), a novel probabilistic framework to incorporate uncertainty quantification into NeRF-based approaches. For this purpose, our method learns a distribution over all possible radiance fields modelling the scene which is used to quantify the uncertainty associated with the modelled scene. In contrast to previous approaches enforcing strong constraints over the radiance field distribution, CF-NeRF learns it in a flexible and fully data-driven manner by coupling Latent Variable Modelling and Conditional Normalizing Flows. This strategy allows to obtain reliable uncertainty estimation while preserving model expressivity. Compared to previous state-of-the-art methods proposed for uncertainty quantification in NeRF, our experiments show that the proposed method achieves significantly lower prediction errors and more reliable uncertainty values for synthetic novel view and depth-map estimation.Paperid:116
Authors:Yuki Kawana; Yusuke Mukuta; Tatsuya Harada
Title: Unsupervised Pose-Aware Part Decomposition for Man-Made Articulated Objects
Abstract:
Man-made articulated objects exist widely in the real world. However, previous methods for unsupervised part decomposition are unsuitable for such objects because they assume a spatially fixed part location, resulting in inconsistent part parsing. In this paper, we propose PPD (unsupervised Pose-aware Part Decomposition) to address a novel setting that explicitly targets man-made articulated objects with mechanical joints, considering the part poses in part parsing. As an analysis-by-synthesis approach, We show that category-common prior learning for both part shapes and poses facilitates the unsupervised learning of (1) part parsing with abstracted part shapes, and (2) part poses as joint parameters under single-frame shape supervision. We evaluate our method on synthetic and real datasets, and we show that it outperforms previous works in consistent part parsing of the articulated objects based on comparable part pose estimation performance to the supervised baseline.Paperid:117
Authors:Benoît Guillard; Federico Stella; Pascal Fua
Title: MeshUDF: Fast and Differentiable Meshing of Unsigned Distance Field Networks
Abstract:
Unsigned Distance Fields (UDFs) can be used to represent non-watertight surfaces. However, current approaches to converting them into explicit meshes tend to either be expensive or to degrade the accuracy. Here, we extend the marching cube algorithm to handle UDFs, both fast and accurately. Moreover, our approach to surface extraction is differentiable, which is key to using pretrained UDF networks to fit sparse data.Paperid:118
Authors:Zhaofan Qiu; Yehao Li; Yu Wang; Yingwei Pan; Ting Yao; Tao Mei
Title: SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness Enhancement
Abstract:
In this paper, we propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net. The embedded ""Selective Position Encoding (SPE)"" procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input. Such encoded rotation condition then determines which part of the network parameters to be focused on, and is shown to efficiently help reduce the degree of freedom of the optimization during training. This mechanism henceforth can better leverage the rotation augmentations through reduced training difficulties, making SPE-Net robust against rotated data both during training and testing. The new findings in our paper also urge us to rethink the relationship between the extracted rotation information and the actual test accuracy. Intriguingly, we reveal evidences that by locally encoding the rotation information through SPE-Net, the rotation-invariant features are still of critical importance in benefiting the test samples without any actual global rotation. We empirically demonstrate the merits of the SPE-Net and the associated hypothesis on four benchmarks, showing evident improvements on both rotated and unrotated test data over SOTA methods. Source code is available at https://github.com/ZhaofanQiu/SPE-Net.Paperid:119
Authors:Kai Wang; Paul Guerrero; Vladimir G. Kim; Siddhartha Chaudhuri; Minhyuk Sung; Daniel Ritchie
Title: The Shape Part Slot Machine: Contact-Based Reasoning for Generating 3D Shapes from Parts
Abstract:
We present the Shape Part Slot Machine, a new method for assembling novel 3D shapes from existing parts by performing contact-based reasoning. Our method represents each shape as a graph of ""slots,"" where each slot is a region of contact between two shape parts. Based on this representation, we design a graph-neural-network-based model for generating new slot graphs and retrieving compatible parts, as well as a gradient-descent-based optimization scheme for assembling the retrieved parts into a complete shape that respects the generated slot graph. This approach does not require any semantic part labels; interestingly, it also does not require complete part geometries---reasoning about the regions where parts connect proves sufficient to generate novel, high-quality 3D shapes. We demonstrate that our method generates shapes that outperform existing modeling-by-assembly approaches in terms of quality, diversity, and structural complexity.Paperid:120
Authors:Wangmeng Xiang; Chao Li; Biao Wang; Xihan Wei; Xian-Sheng Hua; Lei Zhang
Title: Spatiotemporal Self-Attention Modeling with Temporal Patch Shift for Action Recognition
Abstract:
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention using nearly the same computation and memory cost as 2D self-attention. TPS is a plug-and-play module and can be inserted into existing 2D transformer models to enhance spatiotemporal feature learning. The proposed method achieves competitive performance with state-of-the-arts on Something-something V1 \& V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost. The source code of TPS can be found at https://github.com/MartinXM/TPS.Paperid:121
Authors:Sauradip Nag; Xiatian Zhu; Yi-Zhe Song; Tao Xiang
Title: Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning
Abstract:
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is 20x faster to train and 1.6x more efficient for inference. Our PyTorch implementation of TAGS is available at https://github.com/sauradip/TAGS"Paperid:122
Authors:Sauradip Nag; Xiatian Zhu; Yi-Zhe Song; Tao Xiang
Title: Semi-Supervised Temporal Action Detection with Proposal-Free Masking
Abstract:
Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and consequently much under-studied. Prior SS-TAD methods directly combine an existing proposal-based TAD method and a SSL method. Due to their sequential localization (e.g, proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT) with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our SPOT outperforms state-of-the-art alternatives, often by a large margin. The PyTorch implementation of SPOT is available at https://github.com/sauradip/SPOT"Paperid:123
Authors:Sauradip Nag; Xiatian Zhu; Yi-Zhe Song; Tao Xiang
Title: Zero-Shot Temporal Action Detection via Vision-Language Prompting
Abstract:
Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available on https://github.com/sauradip/STALE.Paperid:124
Authors:Wei Lin; Anna Kukleva; Kunyang Sun; Horst Possegger; Hilde Kuehne; Horst Bischof
Title: CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video
Abstract:
Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation. We leverage the joint spatial information in images and videos on the one hand and, on the other hand, train an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation.Paperid:125
Authors:Zengyu Wan; Yang Wang; Ganchao Tan; Yang Cao; Zheng-Jun Zha
Title: S2N: Suppression-Strengthen Network for Event-Based Recognition under Variant Illuminations
Abstract:
The emerging event-based sensors have demonstrated out-standing potential in visual tasks thanks to their high speed and high dynamic range. However, the event degradation due to imaging under low illumination obscures the correlation between event signals and brings uncertainty into event representation. Targeting this issue, we present a novel suppression-strengthen network (S2N) to augment the event feature representation after suppressing the influence of degradation. Specifically, a suppression sub-network is devised to obtain intensity mapping between the degraded and denoised enhancement frames by unsupervised learning. To further restrain the degradation’s influence, a strengthen sub-network is presented to generate robust event representation by adaptively perceiving the local variations between the center and surrounding regions. After being trained on a single illumination condition, our S2N can be directly generalized to other illuminations to boost the recognition performance. Experimental results on three challenging recognition tasks demonstrate the superiority of our method. The codes and datasets could refer to https://github.com/wanzengy/S2N-Suppression-Strengthen-Network.Paperid:126
Authors:Yunyao Mao; Wengang Zhou; Zhenbo Lu; Jiajun Deng; Houqiang Li
Title: CMD: Self-Supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation
Abstract:
In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at https://github.com/maoyunyao/CMD"Paperid:127
Authors:Bolin Ni; Houwen Peng; Minghao Chen; Songyang Zhang; Gaofeng Meng; Jianlong Fu; Shiming Xiang; Haibin Ling
Title: Expanding Language-Image Pretrained Models for General Video Recognition
Abstract:
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12× fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://github.com/microsoft/VideoX/tree/master/X-CLIP"Paperid: