TPAMI2024

Abstract:
We propose a framework that combines traditional, hand-crafted algorithms and recent advances in deep learning to obtain high-quality, high-resolution disparity maps from stereo images. By casting the refinement process as a continuous feature sampling strategy, our neural disparity refinement network can estimate an enhanced disparity map at any output resolution. Our solution can process any disparity map produced by classical stereo algorithms, as well as those predicted by modern stereo networks or even different depth-from-images approaches, such as the COLMAP structure-from-motion pipeline. Nonetheless, when deployed in the former configuration, our framework performs at its best in terms of zero-shot generalization from synthetic to real images. Moreover, its continuous formulation allows for easily handling the unbalanced stereo setup very diffused in mobile phones.

Abstract:
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying and disentangling the underlying factors hidden in the observable data in representation form. The process of separating underlying factors of variation into variables with semantic meaning benefits in learning explainable representations of data, which imitates the meaningful understanding process of humans when observing an object or relation. As a general learning strategy, DRL has demonstrated its power in improving the model explainability, controlability, robustness, as well as generalization capacity in a wide range of scenarios such as computer vision, natural language processing, and data mining. In this article, we comprehensively investigate DRL from various aspects including motivations, definitions, methodologies, evaluations, applications, and model designs. We first present two well-recognized definitions, i.e., Intuitive Definition and Group Theory Definition for disentangled representation learning. We further categorize the methodologies for DRL into four groups from the following perspectives, the model type, representation structure, supervision signal, and independence assumption. We also analyze principles to design different DRL models that may benefit different tasks in practical applications. Finally, we point out challenges in DRL as well as potential research directions deserving future investigations. We believe this work may provide insights for promoting the DRL research in the community.

Abstract:
Removing redundant parameters and computations before the model training has attracted a great interest as it can effectively reduce the storage space of the model, speed up the training and inference of the model, and save energy consumption during the running of the model. In addition, the simplification of deep neural network models can enable high-performance network models to be deployed to resource-constrained edge devices, thus promoting the development of the intelligent world. However, current pruning at initialization methods exhibit poor performance at extreme sparsity. In order to improve the performance of the model under extreme sparsity, this paper proposes a dual-grained lightweight strategy-TEDEPR. This is the first time that TEDEPR has used tensor theory in the pruning at initialization method to optimize the structure of a sparse sub-network model and improve its performance. Specifically, first, at the coarse-grained level, we represent the weight matrix or weight tensor of the model as a low-rank tensor decomposition form and use multi-step chain operations to enhance the feature extraction capability of the base module to construct a low-rank compact network model. Second, unimportant weights are pruned at a fine-grained level based on the trainability of the weights in the low-rank model before the training of the model, resulting in the final compressed model. To evaluate the superiority of TEDEPR, we conducted extensive experiments on MNIST, UCF11, CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet datasets with LeNet, LSTM, VGGNet, ResNet and Transformer architectures, and compared with state-of-the-art methods. The experimental results show that TEDEPR has higher accuracy, faster training and inference, and less storage space than other pruning at initialization methods under extreme sparsity.

Abstract:
Object detection, a fundamental and challenging problem in computer vision, has experienced rapid development due to the effectiveness of deep learning. The current objects to be detected are mostly rigid solid substances with apparent and distinct visual characteristics. In this paper, we endeavor on a scarcely explored task named Gaseous Object Detection (GOD), which is undertaken to explore whether the object detection techniques can be extended from solid substances to gaseous substances. Nevertheless, the gas exhibits significantly different visual characteristics: 1) saliency deficiency, 2) arbitrary and ever-changing shapes, 3) lack of distinct boundaries. To facilitate the study on this challenging task, we construct a GOD-Video dataset comprising 600 videos (141,017 frames) that cover various attributes with multiple types of gases. A comprehensive benchmark is established based on this dataset, allowing for a rigorous evaluation of frame-level and video-level detectors. Deduced from the Gaussian dispersion model, the physics-inspired Voxel Shift Field (VSF) is designed to model geometric irregularities and ever-changing shapes in potential 3D space. By integrating VSF into Faster RCNN, the VSF RCNN serves as a simple but strong baseline for gaseous object detection. Our work aims to attract further research into this valuable albeit challenging area.

Abstract:
Removing raindrops in images has been addressed as a significant task for various computer vision applications. In this paper, we propose the first method using a dual-pixel (DP) sensor to better address raindrop removal. Our key observation is that raindrops attached to a glass window yield noticeable disparities in DP's left-half and right-half images, while almost no disparity exists for in-focus backgrounds. Therefore, the DP disparities can be utilized for robust raindrop detection. The DP disparities also bring the advantage that the occluded background regions by raindrops are slightly shifted between the left-half and the right-half images. Therefore, fusing the information from the left-half and the right-half images can lead to more accurate background texture recovery. Based on the above motivation, we propose a DP Raindrop Removal Network (DPRRN) consisting of DP raindrop detection and DP fused raindrop removal. To efficiently generate a large amount of training data, we also propose a novel pipeline to add synthetic raindrops to real-world background DP images. Experimental results on constructed synthetic and real-world datasets demonstrate that our DPRRN outperforms existing state-of-the-art methods, especially showing better robustness to real-world situations.

Abstract:
Deep learning based semantic segmentation solutions have yielded compelling results over the preceding decade. They encompass diverse network architectures (FCN based or attention based), along with various mask decoding schemes (parametric softmax based or pixel-query based). Despite the divergence, they can be grouped within a unified framework by interpreting the softmax weights or query vectors as learnable class prototypes. In light of this prototype view, we reveal inherent limitations within the parametric segmentation regime, and accordingly develop a nonparametric alternative based on non-learnable prototypes. In contrast to previous approaches that entail the learning of a single weight/query vector per class in a fully parametric manner, our approach represents each class as a set of non-learnable prototypes, relying solely upon the mean features of training pixels within that class. The pixel-wise prediction is thus achieved by nonparametric nearest prototype retrieving. This allows our model to directly shape the pixel embedding space by optimizing the arrangement between embedded pixels and anchored prototypes. It is able to accommodate an arbitrary number of classes with a constant number of learnable parameters. Through empirical evaluation with FCN based and Transformer based segmentation models (i.e., HRNet, Swin, SegFormer, Mask2Former) and backbones (i.e., ResNet, HRNet, Swin, MiT), our nonparametric framework shows superior performance on standard segmentation datasets (i.e., ADE20 K, Cityscapes, COCO-Stuff), as well as in large-vocabulary semantic segmentation scenarios. We expect that this study will provoke a rethink of the current de facto semantic segmentation model design.

Abstract:
Helmholtz stereopsis (HS) exploits the reciprocity principle of light propagation (i.e., the Helmholtz reciprocity) for 3D reconstruction of surfaces with arbitrary reflectance. In this paper, we present the polarimetric Helmholtz stereopsis (polar-HS), which extends the classical HS by considering the polarization state of light in the reciprocal paths. With the additional phase information from polarization, polar-HS requires only one reciprocal image pair. We derive the reciprocity relationship of Mueller matrix and formulate new reciprocity constraint that takes polarization state into account. We also utilize polarimetric constraints and extend them to the case of perspective projection. For the recovery of surface depths and normals, we incorporate reciprocity constraint with diffuse/specular polarimetric constraints in a unified optimization framework. For depth estimation, we further propose to utilize the consistency of diffuse angle of polarization. For normal estimation, we develop a normal refinement strategy based on degree of linear polarization. Using a hardware prototype, we show that our approach produces high-quality 3D reconstruction for different types of surfaces, ranging from diffuse to highly specular.

Abstract:
Query-oriented micro-video summarization task aims to generate a concise sentence with two properties: (a) summarizing the main semantic of the micro-video and (b) being expressed in the form of search queries to facilitate retrieval. Despite its enormous application value in the retrieval area, this direction has barely been explored. Previous studies of summarization mostly focus on the content summarization for traditional long videos. Directly applying these studies is prone to gain unsatisfactory results because of the unique features of micro-videos and queries: diverse entities and complex scenes within a short time, semantic gaps between modalities, and various queries in distinct expressions. To specifically adapt to these characteristics, we propose a query-oriented micro-video summarization model, dubbed QMS. It employs an encoder-decoder-based transformer architecture as the skeleton. The multi-modal (visual and textual) signals are passed through two modal-specific encoders to obtain their representations, followed by an entity-aware representation learning module to identify and highlight critical entity information. As to the optimization, regarding the large semantic gaps between modalities, we assign different confidence scores according to their semantic relevance in the optimization process. Additionally, we develop a novel strategy to sample the effective target query among the diverse query set with various expressions. Extensive experiments demonstrate the superiority of the QMS scheme, on both the summarization and retrieval tasks, over several state-of-the-art methods.

Abstract:
For many inverse problems, the data on which the solution is based is acquired sequentially. We present an approach to the solution of such inverse problems where a sensor can be directed (or otherwise reconfigured on the fly) to acquire a particular measurement. An example problem is magnetic resonance image reconstruction. We use an estimate of mutual information derived from an empirical conditional distribution provided by a generative model to guide our measurement acquisition given measurements acquired so far. The conditionally generated data is a set of samples which are representative of the plausible solutions that satisfy the acquired measurements. We present experiments on toy and real world data sets. We focus on image data but we demonstrate that the method is applicable to a broader class of problems. We also show how a learned model such as a deep neural network can be leveraged to allow generalisation to unseen data. Our informed adaptive sensing method outperforms random sampling, variance based sampling, sparsity based methods, and compressed sensing.

Abstract:
The isomorphism problem, crucial in network analysis, involves analyzing both low-order and high-order structural information. Graph isomorphism algorithms focus on structural equivalence to simplify solver space, aiding applications like protein design, chemical pathways, and community detection. However, they fall short in capturing complex high-order relationships, unlike hypergraph isomorphism methods. Traditional hypergraph methods face challenges like high memory use and inaccurate identification, leading to poor performance. To overcome these, we introduce a hypergraph Weisfeiler-Lehman (WL) test algorithm, extending the WL test from graphs to hypergraphs, and develop a hypergraph WL kernel framework with two variants: the Hypergraph WL Subtree Kernel and Hypergraph WL Hyperedge Kernel. The Hypergraph WL Subtree Kernel counts different types of rooted subtrees and generates the final feature vector for a given hypergraph by comparing the number of different types of rooted subtrees. The Subtree Kernel identifies different rooted subtrees, while the Hyperedge Kernel focuses on hyperedges' vertex labels, enhancing feature vector generation. In order to fulfill our research objectives, a comprehensive set of experiments was meticulously designed, including seven graph classification datasets and 12 hypergraph classification datasets. Results on graph classification datasets indicate that the Hypergraph WL Subtree Kernel can achieve the same performance compared with the classical Graph Weisfeiler-Lehman Subtree Kernel. Results on hypergraph classification datasets show significant improvements compared to other typical kernel-based methods, which demonstrates the effectiveness of the proposed methods. In our evaluation, our proposed methods outperform the second-best method in terms of runtime, running over 80 times faster when handling complex hypergraph structures. This significant speed advantage highlights the great potential of our methods in real-world applications.

Abstract:
Schlieren imaging is an optical technique to observe the flow of transparent media, such as air or water, without any particle seeding. However, conventional frame-based techniques require both high spatial and temporal resolution cameras, which impose bright illumination and expensive computation limitations. Event cameras offer potential advantages (high dynamic range, high temporal resolution, and data efficiency) to overcome such limitations due to their bio-inspired sensing principle. This article presents a novel technique for perceiving air convection using events and frames by providing the first theoretical analysis that connects event data and schlieren. We formulate the problem as a variational optimization one combining the linearized event generation model with a physically-motivated parameterization that estimates the temporal derivative of the air density. The experiments with accurately aligned frame- and event camera data reveal that the proposed method enables event cameras to obtain on par results with existing frame-based optical flow techniques. Moreover, the proposed method works under dark conditions where frame-based schlieren fails, and also enables slow-motion analysis by leveraging the event camera's advantages. Our work pioneers and opens a new stack of event camera applications, as we publish the source code as well as the first schlieren dataset with high-quality frame and event data.

Abstract:
Humans are able to recognize structured relations in observation, allowing us to decompose complex scenes into simpler parts and abstract the visual world at multiple levels. However, such hierarchical reasoning ability of human perception remains largely unexplored in current literature of semantic segmentation. Existing works are often aware of flatten labels and distinguish all the semantic categories exclusively for each pixel. In this work, we instead address hierarchical semantic segmentation (HSS), with the aim of providing a structured, pixel-wise description of visual observation in terms of a class hierarchy. We devise Hssn, a general HSS framework that tackles two critical issues in this task: i) how to efficiently adapt existing hierarchy-agnostic segmentation networks to the HSS setting, and ii) how to leverage the class hierarchy to regularize HSS network learning. To address i), Hssn directly casts HSS as a pixel-wise multi-label classification task, only bringing minimal architecture change to current segmentation models. To solve ii), Hssn first explores inherent properties of the hierarchy as a training objective, which enforces segmentation predictions to obey the hierarchy structure. Furthermore, with a set of hierarchy-induced margin constraints, Hssn efficiently reshapes the learned pixel embedding space, so as to generate hierarchy-aware pixel representations and facilitate structured segmentation eventually. Building upon Hssn, we further exploit the mutual exclusion relation between semantic labels and strengthen the margin based regularization strategy with more meaningful constrains, leading to Hssn+, a more effective framework for HSS. We conduct extensive experiments on six semantic segmentation datasets (i.e., Mapillary Vistas 2.0, Cityscapes, LIP, PASCAL-Person-Part, PASCAL-Part-58, and PASCAL-Part-108), with different class hierarchies, network architectures, and backbones, and the results confirm the generalization and superiority of our algorithms.

Abstract:
A number of advanced image editing technologies have demonstrated impressive performance in synthesizing visually pleasing results in accordance with user instructions. In this paper, we further extend the practicalities of image editing technology by proposing the conditional image repainting (CIR) task, which requires the model to synthesize realistic visual content based on multiple cross-modality conditions provided by the user. We first define condition inputs and formulate two-phased CIR models as the baseline. After that, we further design unified CIR models with novel condition fusion modules to improve the performance. For allowing users to express their intent more freely, our CIR models support both attributes and language to represent colors of repainted visual content. We demonstrate the effectiveness of CIR models by collecting and processing four datasets. Finally, we present a number of practical application scenarios of CIR models to demonstrate its usability.

Abstract:
In this paper, we propose the Generalized Parametric Contrastive Learning (GPaCo/PaCo) which works well on both imbalanced and balanced data. Based on theoretical analysis, we observe supervised contrastive loss tends to bias on high-frequency classes and thus increases the difficulty of imbalanced learning. We introduce a set of parametric class-wise learnable centers to rebalance from an optimization perspective. Further, we analyze our GPaCo/PaCo loss under a balanced setting. Our analysis demonstrates that GPaCo/PaCo can adaptively enhance the intensity of pushing samples of the same class close as more samples are pulled together with their corresponding centers and benefit hard example learning. Experiments on long-tailed benchmarks manifest the new state-of-the-art for long-tailed recognition. On full ImageNet, models from CNNs to vision transformers trained with GPaCo loss show better generalization performance and stronger robustness compared with MAE models. Moreover, GPaCo can be applied to semantic segmentation task and obvious improvements are observed on 4 most popular benchmarks.

Abstract:
In this work, we propose a novel approach called Operational Support Estimator Networks (OSENs) for the support estimation task. Support Estimation (SE) is defined as finding the locations of non-zero elements in sparse signals. By its very nature, the mapping between the measurement and sparse signal is a non-linear operation. Traditional support estimators rely on computationally expensive iterative signal recovery techniques to achieve such non-linearity. Contrary to the convolutional layers, the proposed OSEN approach consists of operational layers that can learn such complex non-linearities without the need for deep networks. In this way, the performance of non-iterative support estimation is greatly improved. Moreover, the operational layers comprise so-called generative super neurons with non-local kernels. The kernel location for each neuron/feature map is optimized jointly for the SE task during training. We evaluate the OSENs in three different applications: i. support estimation from Compressive Sensing (CS) measurements, ii. representation-based classification, and iii. learning-aided CS reconstruction where the output of OSENs is used as prior knowledge to the CS algorithm for enhanced reconstruction. Experimental results show that the proposed approach achieves computational efficiency and outperforms competing methods, especially at low measurement rates by significant margins.

Abstract:
Low-rank tensor completion (LRTC) aims to recover missing data of high-dimensional structures from a limited set of observed entries. Despite recent significant successes, the original structures of data tensors are still not effectively preserved in LRTC algorithms, yielding less accurate restoration results. Moreover, LRTC algorithms often incur high computational costs, which hinder their applicability. In this work, we propose an attention-guided low-rank tensor completion (AGTC) algorithm, which can faithfully restore the original structures of data tensors using deep unfolding attention-guided tensor factorization. First, we formulate the LRTC task as a robust factorization problem based on low-rank and sparse error assumptions. Low-rank tensor recovery is guided by an attention mechanism to better preserve the structures of the original data. We also develop implicit regularizers to compensate for modeling inaccuracies. Then, we solve the optimization problem by employing an iterative technique. Finally, we design a multistage deep network by unfolding the iterative algorithm, where each stage corresponds to an iteration of the algorithm; at each stage, the optimization variables and regularizers are updated by closed-form solutions and learned deep networks, respectively. Experimental results for high dynamic range imaging and hyperspectral image restoration show that the proposed algorithm outperforms state-of-the-art algorithms.

Abstract:
Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs — the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in class-incremental learning and summarize these methods from several aspects. We also provide a rigorous and unified evaluation of 17 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures.

Abstract:
Adversarial training is effective in improving the robustness of deep neural networks. However, existing studies still exhibit significant drawbacks in terms of the robustness, generalization, and fairness of models. In this study, we validate the importance of different perturbation directions (i.e., adversarial and anti-adversarial) and bounds from both theoretical and practical perspectives. The influence of adversarial training on deep learning models in terms of fairness, robustness, and generalization is theoretically investigated under a more general perturbation scope that different samples can have different perturbation directions and varied perturbation bounds. Our theoretical explorations suggest that combining adversaries and anti-adversaries with varied bounds in training can be more effective in achieving better fairness among classes and a better tradeoff among robustness, accuracy, and fairness in some typical learning scenarios compared with standard adversarial training. Inspired by our theoretical findings, a more general learning objective that combines adversaries and anti-adversaries with varied bounds on each training sample is presented. To solve this objective, two adversarial training frameworks based on meta-learning and reinforcement learning are proposed, in which the perturbation direction and bound for each sample are determined by its training characteristics. Furthermore, the role of the combination strategy with varied bounds is explained from a regularization perspective. Extensive experiments under different learning scenarios verify our theoretical findings and the effectiveness of the proposed methodology.

Abstract:
In artificial intelligence, it is crucial for pattern recognition systems to process data with uncertain information, necessitating uncertainty reasoning approaches such as evidence theory. As an orderable extension of evidence theory, random permutation set (RPS) theory has received increasing attention. However, RPS theory lacks a suitable generation method for the element order of permutation mass function (PMF) and an efficient determination method for the fusion order of permutation orthogonal sum (POS). To solve these two issues, this paper proposes a reasoning model for RPS theory, called random permutation set reasoning (RPSR). RPSR consists of three techniques, including RPS generation method (RPSGM), RPSR rule of combination, and ordered probability transformation (OPT). Specifically, RPSGM can construct RPS based on Gaussian discriminant model and weight analysis; RPSR rule incorporates POS with reliability vector, which can combine RPS sources with reliability in fusion order; OPT is used to convert RPS into a probability distribution for the final decision. Besides, numerical examples are provided to illustrate the proposed RPSR. Moreover, the proposed RPSR is applied to classification problems. An RPSR-based classification algorithm (RPSRCA) and its hyperparameter tuning method are presented. The results demonstrate the efficiency and stability of RPSRCA compared to existing classifiers.

Abstract:
Previous animation techniques mainly focus on leveraging explicit structure representations (e.g., meshes or keypoints) for transferring motion from driving videos to source images. However, such methods are challenged with large appearance variations between source and driving data, as well as require complex additional modules to respectively model appearance and motion. Towards addressing these issues, we introduce the Latent Image Animator (LIA), streamlined to animate high-resolution images. LIA is designed as a simple autoencoder that does not rely on explicit representations. Motion transfer in the pixel space is modeled as linear navigation of motion codes in the latent space. Specifically such navigation is represented as an orthogonal motion dictionary learned in a self-supervised manner based on proposed Linear Motion Decomposition (LMD). Extensive experimental results demonstrate that LIA outperforms state-of-the-art on VoxCeleb, TaichiHD, and TED-talk datasets with respect to video quality and spatio-temporal consistency. In addition LIA is well equipped for zero-shot high-resolution image animation. Code, models, and demo video are available at https://github.com/wyhsirius/LIA.

Abstract:
Geometric Deep Learning has recently made striking progress with the advent of continuous deep implicit fields. They allow for detailed modeling of watertight surfaces of arbitrary topology while not relying on a 3D euclidean grid, resulting in a learnable parameterization that is unlimited in resolution. Unfortunately, these methods are often unsuitable for applications that require an explicit mesh-based surface representation because converting an implicit field to such a representation relies on the Marching Cubes algorithm, which cannot be differentiated with respect to the underlying implicit field. In this work, we remove this limitation and introduce a differentiable way to produce explicit surface mesh representations from Deep Implicit Fields. Our key insight is that by reasoning on how implicit field perturbations impact local surface geometry, one can ultimately differentiate the 3D location of surface samples with respect to the underlying deep implicit field. We exploit this to define DeepMesh — an end-to-end differentiable mesh representation that can vary its topology. We validate our theoretical insight through several applications: Single view 3D Reconstruction via Differentiable Rendering, Physically-Driven Shape Optimization, Full Scene 3D Reconstruction from Scans and End-to-End Training. In all cases our end-to-end differentiable parameterization gives us an edge over state-of-the-art algorithms.

Abstract:
Urban safety plays an essential role in the quality of citizens’ lives and in the sustainable development of cities. In recent years, researchers have attempted to apply machine learning techniques to identify the role of location-specific attributes in the development of urban safety. However, existing studies have mainly relied on limited images (e.g., map images, single- or four-directional images) of areas based on a relatively large geographical unit and have narrowly focused on severe crime rates, which limits their predictive performance and implications for urban safety. In this work, we propose a novel method that predicts “deviance,” which includes formal deviant crimes (e.g., murders) and informal deviant behaviors (e.g., loud parties at night). To do this, we first collect a large-scale geo-tagged dataset consisting of incident report data for seven metropolitan cities, along with corresponding sequential images around incident sites obtained from Google Street View. We then design a convolutional neural network that learns spatio-temporal visual attributes of deviant streets. Experimental results show that our framework is able to reliably recognize real-world deviance in various cities. Furthermore, we analyze which visual attribute is important for deviance identification and severity estimation with respect to social science as well as activated feature maps in the neural network.

Abstract:
Implicit neural representation (INR) characterizes the attributes of a signal as a function of corresponding coordinates which emerges as a sharp weapon for solving inverse problems. However, the expressive power of INR is limited by the spectral bias in the network training. In this paper, we find that such a frequency-related problem could be greatly solved by re-arranging the coordinates of the input signal, for which we propose the disorder-invariant implicit neural representation (DINER) by augmenting a hash-table to a traditional INR backbone. Given discrete signals sharing the same histogram of attributes and different arrangement orders, the hash-table could project the coordinates into the same distribution for which the mapped signal can be better modeled using the subsequent INR network, leading to significantly alleviated spectral bias. Furthermore, the expressive power of the DINER is determined by the width of the hash-table. Different width corresponds to different geometrical elements in the attribute space, e.g., 1D curve, 2D curved-plane and 3D curved-volume when the width is set as 1, 2 and 3, respectively. More covered areas of the geometrical elements result in stronger expressive power. Experiments not only reveal the generalization of the DINER for different INR backbones (MLP versus SIREN) and various tasks (image/video representation, phase retrieval, refractive index recovery, and neural radiance field optimization) but also show the superiority over the state-of-the-art algorithms both in quality and speed.

Abstract:
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. First, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Second, we find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.

Abstract:
There are two mainstream approaches for object detection: top-down and bottom-up. The state-of-the-art approaches are mainly top-down methods. In this paper, we demonstrate that bottom-up approaches show competitive performance compared with top-down approaches and have higher recall rates. Our approach, named CenterNet, detects each object as a triplet of keypoints (top-left and bottom-right corners and the center keypoint). We first group the corners according to some designed cues and confirm the object locations based on the center keypoints. The corner keypoints allow the approach to detect objects of various scales and shapes and the center keypoint reduces the confusion introduced by a large number of false-positive proposals. Our approach is an anchor-free detector because it does not need to define explicit anchor boxes. We adapt our approach to backbones with different structures, including ‘hourglass’-like networks and ‘pyramid’-like networks, which detect objects in single-resolution and multi-resolution feature maps, respectively. On the MS-COCO dataset, CenterNet with Res2Net-101 and Swin-Transformer achieve average precisions (APs) of 53.7% and 57.1%, respectively, outperforming all existing bottom-up detectors and achieving state-of-the-art performance. We also design a real-time CenterNet model, which achieves a good trade-off between accuracy and speed, with an AP of 43.6% at 30.5 frames per second (FPS).

Abstract:
We propose a novel end-to-end method for cross-view pose estimation. Given a ground-level query image and an aerial image that covers the query's local neighborhood, the 3 Degrees-of-Freedom camera pose of the query is estimated by matching its image descriptor to descriptors of local regions within the aerial image. The orientation-aware descriptors are obtained by using a translationally equivariant convolutional ground image encoder and contrastive learning. The Localization Decoder produces a dense probability distribution in a coarse-to-fine manner with a novel Localization Matching Upsampling module. A smaller Orientation Decoder produces a vector field to condition the orientation estimate on the localization. Our method is validated on the VIGOR and KITTI datasets, where it surpasses the state-of-the-art baseline by 72% and 36% in median localization error for comparable orientation estimation accuracy. The predicted probability distribution can represent localization ambiguity, and enables rejecting possible erroneous predictions. Without re-training, the model can infer on ground images with different field of views and utilize orientation priors if available. On the Oxford RobotCar dataset, our method can reliably estimate the ego-vehicle's pose over time, achieving a median localization error under 1 m and a median orientation error of around 1^\circ ∘ at 14 FPS.

Abstract:
This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce overconfident judgments during inference. Motivated by this, we argue that a good similarity model should consider the semantic discrepancies with awareness of the uncertainty to better deal with ambiguous images for more robust training. To achieve this, we propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively. We further propose an introspective similarity metric to make similarity judgments between images considering both their semantic differences and ambiguities. The gradient analysis of the proposed metric shows that it enables the model to learn at an adaptive and slower pace to deal with the uncertainty during training. Our framework attains state-of-the-art performance on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets for image retrieval. We further evaluate our framework for image classification on the ImageNet-1 K, CIFAR-10, and CIFAR-100 datasets, which shows that equipping existing data mixing methods with the proposed introspective metric consistently achieves better results (e.g., +0.44% for CutMix on ImageNet-1 K).

Abstract:
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Although early diagnosis and treatment can greatly improve the chances of survival, it remains a major challenge, especially in developing countries. Recently, computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. To address this, we establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11 K) dataset, which contains 11 200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD. Furthermore, we propose a strong baseline, SymFormer, for simultaneous CXR image classification and TB infection area detection. SymFormer incorporates Symmetric Search Attention (SymAttention) to tackle the bilateral symmetry property of CXR images for learning discriminative features. Since CXR images may not strictly adhere to the bilateral symmetry property, we also propose Symmetric Positional Encoding (SPE) to facilitate SymAttention through feature recalibration. To promote future research on CTD, we build a benchmark by introducing evaluation metrics, evaluating baseline models reformed from existing detectors, and running an online challenge. Experiments show that SymFormer achieves state-of-the-art performance on the TBX11 K dataset.

Abstract:
Satellites are capable of capturing high-resolution videos. It makes vehicle perception from satellite become possible. Compared to street surveillance, drive recorder or other equipments, satellite videos provide a much broader city-scale view, so that the global dynamic scene of the traffic are captured and displayed. Traffic monitoring from satellite is a new task with great potential applications, including traffic jams prediction, path planning, vehicle dispatching, etc. Practically, limited by the resolution and view, the captured vehicles are very tiny (a few pixels) and move slowly. Worse still, these satellites are in Low Earth Orbit (LEO) to capture such high-resolution videos, so the background is also moving. Under this circumstance, traffic monitoring from the satellite view is an extremely challenging task. To attract more researchers into this field, we build a large-scale benchmark for traffic monitoring from satellite. It supports several tasks, including tiny object detection, counting and density estimation. The dataset is constructed based on 12 satellite videos and 14 synthetic videos recorded from GTA-V. They are separated into 408 video clips, which contain 7,336 real satellite images and 1,960 synthetic images. 128,801 vehicles are annotated totally, and the number of vehicles in each image varies from 0 to 101. Several classic and state-of-the-art approaches in traditional computer vision are evaluated on the datasets, so as to compare the performance of different approaches, analyze the challenges in this task, and discuss the future prospects.

Abstract:
MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, by migrating our focus away from the token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and demonstrate their gratifying performance. We summarize our observations as follows: 1)MetaFormer ensures solid lower bound of performance: By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves > >80% accuracy on ImageNet-1 K. 2)MetaFormer works well with arbitrary token mixers: When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of > >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. 3)MetaFormer effortlessly offers state-of-the-art results: With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. a)ConvFormer outperforms ConvNeXt: Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. b)CAFormer sets new record on ImageNet-1 K: By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1 K: it achieves an accuracy of 85.5% at 224 × 224224×224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with commonly-used GELU yet achieves better performance. Specifically, StarReLU is a variant of Squared ReLU dedicated to alleviating distribution shift. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.

Abstract:
The training and testing data for deep-neural-network-based classifiers are usually assumed to be sampled from the same distribution. When part of the testing samples are drawn from a distribution that is sufficiently far away from that of the training samples (a.k.a. out-of-distribution (OOD) samples), the trained neural network has a tendency to make high-confidence predictions for these OOD samples. Detection of the OOD samples is critical when training a neural network used for image classification, object detection, etc. It can enhance the classifier's robustness to irrelevant inputs, and improve the system's resilience and security under different forms of attacks. Detection of OOD samples has three main challenges: (i) the proposed OOD detection method should be compatible with various architectures of classifiers (e.g., DenseNet, ResNet) without significantly increasing the model complexity and requirements on computational resources; (ii) the OOD samples may come from multiple distributions, whose class labels are commonly unavailable; (iii) a score function needs to be defined to effectively separate OOD samples from in-distribution (InD) samples. To overcome these challenges, we propose a Wasserstein-based out-of-distribution detection (WOOD) method. The basic idea is to define a Wasserstein-based score that evaluates the dissimilarity between a test sample and the distribution of InD samples. An optimization problem is then formulated and solved based on the proposed score function. The statistical learning bound of the proposed method is investigated to guarantee that the loss value achieved by the empirical optimizer approximates the global optimum. The comparison study results demonstrate that the proposed WOOD consistently outperforms other existing OOD detection methods.

Abstract:
Graph-structured data, where nodes exhibit either pair-wise or high-order relations, are ubiquitous and essential in graph learning. Despite the great achievement made by existing graph learning models, these models use the direct information (edges or hyperedges) from graphs and do not adopt the underlying indirect information (hidden pair-wise or high-order relations). To address this issue, in this paper, we propose a general framework named Simplicial Complex Neural (SCN) network, in which we construct a simplicial complex based on the direct and indirect graph information from a graph so that all information can be employed in the complex network learning. Specifically, we learn representations of simplices by aggregating and integrating information from all the simplices together via layer-by-layer simplicial complex propagation. In consequence, the representations of nodes, edges, and other high-order simplices are obtained simultaneously and can be used for learning purposes. By making use of block matrix properties, we derive the theoretical bound of the simplicial complex filter learnt by the propagation and establish the generalization error bound of the proposed simplicial complex network. We perform extensive experiments on node (0-simplex), edge (1-simplex), and triangle (2-simplex) classifications, and promising results demonstrate the performance of the proposed method is better than that of existing graph and hypergraph network approaches.

Abstract:
Graph data collected from the real world often contains noise, making it imperative to develop robust representation learning tools for graphs. While existing research has primarily focused on feature smoothing, the robustness of the underlying geometric structure is frequently overlooked. In addition, the prevalent use of the \mathbb L_2L2-norm for achieving global smoothness in graph neural networks shrinks many local characteristics, limiting their expressivity on a node's neighboring information. This article introduces novel regularizers designed to address noise in both feature and structural aspects of graph data. We employ the alternating direction method of multipliers (ADMM) to optimize the objective function. Our proposed approach effectively prevents oversmoothing graph signal representations when applying multiple layers and ensures convergence to optimal solutions. Empirical results from our study demonstrate the superior performance of our proposed DoT over popular graph convolutions, especially in scenarios where the graph is heavily contaminated.

Abstract:
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision and autonomous driving. Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored. This reduces the model generalization ability and deep understanding on video content, leading to serious error accumulation and degraded performance. In this paper, we address the uncertainty learning problem and propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results. The uncertainty value is used to derive a temperature parameter in the softmax function to modulate the predicted target activity distribution. To guarantee the distribution adjustment, we construct a reasonable target activity label representation by incorporating the activity evolution from the temporal class correlation and the semantic relationship. Moreover, we quantify the uncertainty into relative values by comparing the uncertainty among sample pairs and their temporal-lengths. This relative strategy provides a more accessible way in uncertainty modeling than quantifying the absolute uncertainty values on the whole dataset. Experiments on multiple backbones and benchmarks show our framework achieves promising performance and better robustness/interpretability.

Abstract:
Inspired by the masked language modeling (MLM) in natural language processing tasks, the masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. However, the high random mask ratio of MIM results in two serious problems: 1) the inadequate data utilization of images within each iteration brings prolonged pre-training, and 2) the high inconsistency of predictions results in unreliable generations, i.e.i.e., the prediction of the identical patch may be inconsistent in different mask rounds, leading to divergent semantics in the ultimately generated outcomes. To tackle these problems, we propose the efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency and increase the consistency of MIM. In particular, we present a parallel mask strategy that divides the image into K non-overlapping parts, each of which is generated by a random mask with the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and the model minimizes the loss between the predictions and the masked patches. Besides, we design the self-consistency learning to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, our method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves the best performance on ViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After pre-training on diverse datasets, EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.

Abstract:
Sequential learning using transformer has achieved state-of-the-art performance in natural language tasks and many others. The key to this success is the multi-head self attention which encodes and gathers the features from individual tokens of an input sequence. The mapping or decoding is performed to produce an output sequence via cross attention. There are threefold weaknesses by using such an attention framework. First, since the attention would mix up the features of different tokens in input and output sequences, it is likely that redundant information exists in sequence data representation. Second, the patterns of attention weights among different heads tend to be similar. The model capacity is bounded. Third, the robustness in an encoder-decoder network against the model uncertainty is disregarded. To handle these weaknesses, this paper presents a Bayesian semantic and disentangled mask attention to learn latent disentanglement in multi-head attention where the redundant features in transformer are compensated with the latent topic information. The attention weights are filtered by a mask which is optimized through semantic clustering. This attention mechanism is implemented according to Bayesian learning for clustered disentanglement. The experiments on machine translation and speech recognition show the merit of Bayesian clustered disentanglement for mask attention.

Abstract:
Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, have considerably surpassed previous convolutional or recurrent approaches in various vision processing tasks. Specifically, vision transformers offer robust, unified, and even simpler solutions for various segmentation tasks. This survey provides a thorough overview of transformer-based visual segmentation, summarizing recent advancements. We first review the background, encompassing problem definitions, datasets, and prior convolutional methods. Next, we summarize a meta-architecture that unifies all recent transformer-based approaches. Based on this meta-architecture, we examine various method designs, including modifications to the meta-architecture and associated applications. We also present several specific subfields, including 3D point cloud segmentation, foundation model tuning, domain-aware segmentation, efficient segmentation, and medical segmentation. Additionally, we compile and re-evaluate the reviewed methods on several well-established datasets. Finally, we identify open challenges in this field and propose directions for future research.

Abstract:
Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.

Affiliations: MoE Key Laboratory of Intelligent Perception and Human-Machine Collaboration, Shanghai Frontiers Science Center of Human-centered Artificial Intelligence, ShanghaiTech University, Shanghai, China; Shanghai AI Lab, Shanghai, China; Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; University of Queensland, St Lucia, QLD, Australia; University of Maryland, College Park, MD, USA; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Beijing, China; Shanghai Jiao Tong University, Shanghai, China; Chinese University of Hong Kong, Ma Liu Shui, Hong Kong

Abstract:
In recent years, vision-centric Bird’s Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion. The rapid advancements in deep learning have led to the proposal of numerous methods for addressing vision-centric BEV perception challenges. However, there has been no recent survey encompassing this novel and burgeoning research field. To catalyze future research, this paper presents a comprehensive survey of the latest developments in vision-centric BEV perception and its extensions. It compiles and organizes up-to-date knowledge, offering a systematic review and summary of prevalent algorithms. Additionally, the paper provides in-depth analyses and comparative results on various BEV perception tasks, facilitating the evaluation of future works and sparking new research directions. Furthermore, the paper discusses and shares valuable empirical implementation details to aid in the advancement of related algorithms.

Abstract:
Federated learning is a distributed paradigm that allows multiple parties to collaboratively train deep learning models without direct exchange of raw data. Nevertheless, the inherent non-independent and identically distributed (non-i.i.d.) nature of data distribution among clients results in significant degradation of the acquired model. The primary goal of this study is to develop a robust federated learning algorithm to address feature shift in clients’ samples, potentially arising from a range of factors such as acquisition discrepancies in medical imaging. To reach this goal, we first propose federated feature augmentation (FedFA^ll), a novel feature augmentation technique tailored for federated learning. FedFA^ll is based on a crucial insight that each client's data distribution can be characterized by first-/second-order statistics (a.k.a., mean and standard deviation) of latent features; and it is feasible to manipulate these local statistics globally, i.e., based on information in the entire federation, to let clients have a better sense of the global distribution across clients. Grounded on this insight, we propose to augment each local feature statistic based on a normal distribution, wherein the mean corresponds to the original statistic, and the variance defines the augmentation scope. Central to FedFA^ll is the determination of a meaningful Gaussian variance, which is accomplished by taking into account not only biased data of each individual client, but also underlying feature statistics represented by all participating clients. Beyond consideration of low-order statistics in FedFA^ll, we propose a federated feature alignment component (FedFA^hh) that exploits higher-order feature statistics to gain a more detailed understanding of local feature distribution and enables explicit alignment of augmented features in different clients to promote more consistent feature learning. Combining FedFA^ll and FedFA^hh yields our full approach FedFA++. FedFA++ is non-parametric, incurs negligible additional communication costs, and can be seamlessly incorporated into popular CNN and Transformer architectures. We offer rigorous theoretical analysis, as well as extensive empirical justifications to demonstrate the effectiveness of the algorithm.

Abstract:
Detecting out-of-distribution (OOD) samples is essential for ensuring the reliability of deep neural networks (DNNs) in real-world scenarios. While previous research has predominantly investigated the disparity between in-distribution (ID) and OOD data through forward information analysis, the discrepancy in parameter gradients during the backward process of DNNs has received insufficient attention. Existing studies on gradient disparities mainly focus on the utilization of gradient norms, neglecting the wealth of information embedded in gradient directions. To bridge this gap, in this paper, we conduct a comprehensive investigation into leveraging the entirety of gradient information for OOD detection. The primary challenge arises from the high dimensionality of gradients due to the large number of network parameters. To solve this problem, we propose performing linear dimension reduction on the gradient using a designated subspace that comprises principal components. This innovative technique enables us to obtain a low-dimensional representation of the gradient with minimal information loss. Subsequently, by integrating the reduced gradient with various existing detection score functions, our approach demonstrates superior performance across a wide range of detection tasks. For instance, on the ImageNet benchmark with ResNet50 model, our method achieves an average reduction of 11.15%% in the false positive rate at 95%% recall (FPR95) compared to the current state-of-the-art approach.

Abstract:
The mean shift (MS) algorithm seeks a mode of the kernel density estimate (KDE). This study presents a convergence guarantee of the mode estimate sequence generated by the MS algorithm and an evaluation of the convergence rate, under fairly mild conditions, with the help of the argument concerning the Łojasiewicz inequality. Our findings extend existing ones covering analytic kernels and the Epanechnikov kernel. Those are significant in that they cover the biweight kernel, which is optimal among non-negative kernels in terms of the asymptotic statistical efficiency for the KDE-based mode estimation.

Abstract:
A long-standing topic in artificial intelligence is the effective recognition of patterns from noisy images. In this regard, the recent data-driven paradigm considers 1) improving the representation robustness by adding noisy samples in training phase (i.e., data augmentation) or 2) pre-processing the noisy image by learning to solve the inverse problem (i.e., image denoising). However, such methods generally exhibit inefficient process and unstable result, limiting their practical applications. In this paper, we explore a non-learning paradigm that aims to derive robust representation directly from noisy images, without the denoising as pre-processing. Here, the noise-robust representation is designed as Fractional-order Moments in Radon space (FMR), with also beneficial properties of orthogonality and rotation invariance. Unlike earlier integer-order methods, our work is a more generic design taking such classical methods as special cases, and the introduced fractional-order parameter offers time-frequency analysis capability that is not available in classical methods. Formally, both implicit and explicit paths for constructing the FMR are discussed in detail. Extensive simulation experiments and robust visual applications are provided to demonstrate the uniqueness and usefulness of our FMR, especially for noise robustness, rotation invariance, and time-frequency discriminability.

Abstract:
Multi-Source-Free Unsupervised Domain Adaptation (MSFUDA) requires aggregating knowledge from multiple source models and adapting it to the target domain. Two challenges remain: 1) suboptimal coarse-grained (domain-level) aggregation of multiple source models, and 2) risky semantics propagation based on local structures. In this article, we propose an evidential learning method for MSFUDA, where we formulate two uncertainties, i.e. Evidential Prediction Uncertainty (EPU) and Evidential Adjacency-Consistent Uncertainty (EAU), respectively for addressing the two challenges. The former, EPU, captures the uncertainty of a sample fitted to a source model, which can suggest the preferences of target samples for different source models. Based on this, we develop an EPU-Based Multi-Source Aggregation module to achieve fine-grained, instance-level source knowledge aggregation. The latter, EAU, provides a robust measure of consistency among adjacent samples in the target domain. Utilizing this, we develop an EAU-Guided Local Structure Mining module to ensure the trustworthy propagation of semantics. The two modules are integrated into the Evidential Aggregation and Adaptation Framework (EAAF), and we demonstrated that this framework achieves state-of-the-art performances on three MSFUDA benchmarks.

Abstract:
This survey is for the remembrance of one of the creators of the information bottleneck theory, Prof. Naftali Tishby, passing away at the age of 68 on August, 2021. Information bottleneck (IB), a novel information theoretic approach for pattern analysis and representation learning, has gained widespread popularity since its birth in 1999. It provides an elegant balance between data compression and information preservation, and improves its prediction or representation ability accordingly. This survey summarizes both the theoretical progress and practical applications on IB over the past 20-plus years, where its basic theory, optimization, extensive models and task-oriented algorithms are systematically explored. Existing IB methods are roughly divided into two parts: traditional and deep IB, where the former contains the IBs optimized by traditional machine learning analysis techniques without involving any neural networks, and the latter includes the IBs involving the interpretation, optimization and improvement of deep neural works (DNNs). Specifically, based on the technique taxonomy, traditional IBs are further classified into three categories: Basic, Informative and Propagating IB; While the deep IBs, based on the taxonomy of problem settings, contain Debate: Understanding DNNs with IB, Optimizing DNNs Using IB, and DNN-based IB methods. Furthermore, some potential issues deserving future research are discussed. This survey attempts to draw a more complete picture of IB, from which the subsequent studies can benefit.

Abstract:
Federated human activity recognition (FHAR) has attracted much attention due to its great potential in privacy protection. Existing FHAR methods can collaboratively learn a global activity recognition model based on unimodal or multimodal data distributed on different local clients. However, it is still questionable whether existing methods can work well in a more common scenario where local data are from different modalities, e.g., some local clients may provide motion signals while others can only provide visual data. In this article, we study a new problem of cross-modal federated human activity recognition (CM-FHAR), which is conducive to promote the large-scale use of the HAR model on more local devices. CM-FHAR has at least three dedicated challenges: 1) distributive common cross-modal feature learning, 2) modality-dependent discriminate feature learning, 3) modality imbalance issue. To address these challenges, we propose a modality-collaborative activity recognition network (MCARN), which can comprehensively learn a global activity classifier shared across all clients and multiple modality-dependent private activity classifiers. To produce modality-agnostic and modality-specific features, we learn an altruistic encoder and an egocentric encoder under the constraint of a separation loss and an adversarial modality discriminator collaboratively learned in hyper-sphere. To address the modality imbalance issue, we propose an angular margin adjustment scheme to improve the modality discriminator on modality-imbalanced data by enhancing the intra-modality compactness of the dominant modality and increase the inter-modality discrepancy. Moreover, we propose a relation-aware global-local calibration mechanism to constrain class-level pairwise relationships for the parameters of the private classifier. Finally, through decentralized optimization with alternative steps of adversarial local updating and modality-aware global aggregation, the proposed MCARN obtains state-of-the-art performance on both modality-balanced and modality-imbalanced data.

Abstract:
Point clouds have garnered increasing research attention and found numerous practical applications. However, many of these applications, such as autonomous driving and robotic manipulation, rely on sequential point clouds, essentially adding a temporal dimension to the data (i.e., four dimensions) because the information of the static point cloud data could provide is still limited. Recent research efforts have been directed towards enhancing the understanding and utilization of sequential point clouds. This paper offers a comprehensive review of deep learning methods applied to sequential point cloud research, encompassing dynamic flow estimation, object detection & tracking, point cloud segmentation, and point cloud forecasting. This paper further summarizes and compares the quantitative results of the reviewed methods over the public benchmark datasets. Ultimately, the paper concludes by addressing the challenges in current sequential point cloud research and pointing towards promising avenues for future research.

Abstract:
In recent years, the security of deep learning models achieves more and more attentions with the rapid development of neural networks, which are vulnerable to adversarial examples. Almost all existing gradient-based attack methods use the sign function in the generation to meet the requirement of perturbation budget on L_\inftyL∞ norm. However, we find that the sign function may be improper for generating adversarial examples since it modifies the exact gradient direction. Instead of using the sign function, we propose to directly utilize the exact gradient direction with a scaling factor for generating adversarial perturbations, which improves the attack success rates of adversarial examples even with fewer perturbations. At the same time, we also theoretically prove that this method can achieve better black-box transferability. Moreover, considering that the best scaling factor varies across different images, we propose an adaptive scaling factor generator to seek an appropriate scaling factor for each image, which avoids the computational cost for manually searching the scaling factor. Our method can be integrated with almost all existing gradient-based attack methods to further improve their attack success rates. Extensive experiments on the CIFAR10 and ImageNet datasets show that our method exhibits higher transferability and outperforms the state-of-the-art methods.

Affiliations: School of Computer Science and Technology, East China Normal University, Shanghai, China; Distributed and Parallel Software Laboratory, Labs, Huawei Technologies, Hangzhou, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Information Science and Technology, Xiamen University, Fujian, China; School of Computer Science and Techology, East China Normal University, Shanghai, China; College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, China; JD Exploer Academy, China and The University of Sydney, Camperdown, NSW, Australia

Abstract:
Information Bottleneck (IB) provides an information-theoretic principle for multi-view learning by revealing the various components contained in each viewpoint. This highlights the necessity to capture their distinct roles to achieve view-invariance and predictive representations but remains under-explored due to the technical intractability of modeling and organizing innumerable mutual information (MI) terms. Recent studies show that sufficiency and consistency play such key roles in multi-view representation learning, and could be preserved via a variational distillation framework. But when it generalizes to arbitrary viewpoints, such strategy fails as the mutual information terms of consistency become complicated. This paper presents Multi-View Variational Distillation (MV^22D), tackling the above limitations for generalized multi-view learning. Uniquely, MV^22D can recognize useful consistent information and prioritize diverse components by their generalization ability. This guides an analytical and scalable solution to achieving both sufficiency and consistency. Additionally, by rigorously reformulating the IB objective, MV^22D tackles the difficulties in MI optimization and fully realizes the theoretical advantages of the information bottleneck principle. We extensively evaluate our model on diverse tasks to verify its effectiveness, where the considerable gains provide key insights into achieving generalized multi-view representations under a rigorous information-theoretic principle.

Abstract:
Existing deep learning-based video super-resolution (SR) methods usually depend on the supervised learning approach, where the training data is usually generated by the blurring operation with known or predefined kernels (e.g., Bicubic kernel) followed by a decimation operation. However, this does not hold for real applications as the degradation process is complex and cannot be approximated by these idea cases well. Moreover, obtaining high-resolution (HR) videos and the corresponding low-resolution (LR) ones in real-world scenarios is difficult. To overcome these problems, we propose a self-supervised learning method to solve the blind video SR problem, which simultaneously estimates blur kernels and HR videos from the LR videos. As directly using LR videos as supervision usually leads to trivial solutions, we develop a simple and effective method to generate auxiliary paired data from original LR videos according to the image formation of video SR, so that the networks can be better constrained by the generated paired data for both blur kernel estimation and latent HR video restoration. In addition, we introduce an optical flow estimation module to exploit the information from adjacent frames for HR video restoration. Experiments show that our method performs favorably against state-of-the-art ones on benchmarks and real-world videos.

Abstract:
Current point cloud denoising (PCD) models optimize single networks, trying to make their parameters adaptive to each point in a large pool of point clouds. Such a denoising network paradigm neglects that different points are often corrupted by different levels of noise and they may convey different geometric structures. Thus, the intricacy of both noise and geometry poses side effects including remnant noise, wrongly-smoothed edges, and distorted shape after denoising. We propose PathNet, a path-selective PCD paradigm based on reinforcement learning (RL). Unlike existing efforts, PathNet enables dynamic selection of the most appropriate denoising path for each point, best moving it onto its underlying surface. We have two more contributions besides the proposed framework of path-selective PCD for the first time. First, to leverage geometry expertise and benefit from training data, we propose a noise- and geometry-aware reward function to train the routing agent in RL. Second, the routing agent and the denoising network are trained jointly to avoid under- and over-smoothing. Extensive experiments show promising improvements of PathNet over its competitors, in terms of the effectiveness for removing different levels of noise and preserving multi-scale surface geometries. Furthermore, PathNet generalizes itself more smoothly to real scans than cutting-edge models.

Abstract:
Counterfactuals can explain classification decisions of neural networks in a human interpretable way. We propose a simple but effective method to generate such counterfactuals. More specifically, we perform a suitable diffeomorphic coordinate transformation and then perform gradient ascent in these coordinates to find counterfactuals which are classified with great confidence as a specified target class. We propose two methods to leverage generative models to construct such suitable coordinate systems that are either exactly or approximately diffeomorphic. We analyze the generation process theoretically using Riemannian differential geometry and validate the quality of the generated counterfactuals using various qualitative and quantitative measures.

Abstract:
Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at QFormer.

Affiliations: School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey, Guildford, U.K.; School of Control Science and Engineering, Shandong University, Jinan, China; School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane, QLD, Australia; School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
The new generation of organic light emitting diode display is designed to enable the high dynamic range (HDR), going beyond the standard dynamic range (SDR) supported by the traditional display devices. However, a large quantity of videos are still of SDR format. Further, most pre-existing videos are compressed at varying degrees for minimizing the storage and traffic flow demands. To enable movie-going experience on new generation devices, converting the compressed SDR videos to the HDR format (i.e., compressed-SDR to HDR conversion) is in great demands. The key challenge with this new problem is how to solve the intrinsic many-to-many mapping issue. However, without constraining the solution space or simply imitating the inverse camera imaging pipeline in stages, existing SDR-to-HDR methods can not formulate the HDR video generation process explicitly. Besides, they ignore the fact that videos are often compressed. To address these challenges, in this work we propose a novel imaging knowledge-inspired parallel networks (termed as KPNet) for compressed-SDR to HDR (CSDR-to-HDR) video reconstruction. KPNet has two key designs: Knowledge-Inspired Block (KIB) and Information Fusion Module (IFM). Concretely, mathematically formulated using some priors with compressed videos, our conversion from a CSDR-to-HDR video reconstruction is conceptually divided into four synergistic parts: reducing compression artifacts, recovering missing details, adjusting imaging parameters, and reducing image noise. We approximate this process by a compact KIB. To capture richer details, we learn HDR representations with a set of KIBs connected in parallel and fused with the IFM. Extensive evaluations show that our KPNet achieves superior performance over the state-of-the-art methods.

Abstract:
Clustering aims to partition a set of objects into different groups through the internal nature of these objects. Most existing methods face intractable hyper-parameter problems triggered by various regularization terms, which degenerates the applicability of models. Moreover, traditional graph clustering methods always encounter the expensive time overhead. To this end, we propose a Fast Clustering model with Anchor Guidance (FCAG). The proposed model not only avoids trivial solutions without extra regularization terms, but is also suitable to deal with large-scale problems by utilizing the prior knowledge of the bipartite graph. Moreover, the proposed FCAG can cope with out-of-sample extension problems. Three optimization methods Projected Gradient Descent (PGD) method, Iteratively Re-Weighted (IRW) algorithm and Coordinate Descent (CD) algorithm are proposed to solve FCAG. Extensive experiments verify the superiority of the optimization method CD. Besides, compared with other bipartite graph models, FCAG has the better performance with the less time cost. In addition, we prove through theory and experiment that when the learning rate of PGD tends to infinite, PGD is equivalent to IRW.

Abstract:
Converging evidence indicates that deep neural network models that are trained on large datasets are biased toward color and texture information. Humans, on the other hand, can easily recognize objects and scenes from images as well as from bounding contours. Mid-level vision is characterized by the recombination and organization of simple primary features into more complex ones by a set of so-called Gestalt grouping rules. While described qualitatively in the human literature, a computational implementation of these perceptual grouping rules is so far missing. In this article, we contribute a novel set of algorithms for the detection of contour-based cues in complex scenes. We use the medial axis transform (MAT) to locally score contours according to these grouping rules. We demonstrate the benefit of these cues for scene categorization in two ways: (i) Both human observers and CNN models categorize scenes most accurately when perceptual grouping information is emphasized. (ii) Weighting the contours with these measures boosts performance of a CNN model significantly compared to the use of unweighted contours. Our work suggests that, even though these measures are computed directly from contours in the image, current CNN models do not appear to extract or utilize these grouping cues.

Abstract:
This paper introduces a simple yet powerful channel augmentation for visible-infrared re-identification. Most existing augmentation operations designed for single-modality visible images do not fully consider the imagery properties in visible to infrared matching. Our basic idea is to homogeneously generate color-irrelevant images by randomly exchanging the color channels. It can be seamlessly integrated into existing augmentation operations, consistently improving the robustness against color variations. For cross-modality metric learning, we design an enhanced channel-mixed learning strategy to simultaneously handle the intra- and cross-modality variations with squared difference for stronger discriminability. Besides, a weak-and-strong augmentation joint learning strategy is further developed to explicitly optimize the outputs of augmented images, which mutually integrates the channel augmented images (strong) and the general augmentation operations (weak) with consistency regularization. Furthermore, by conducting the label association between the channel augmented images and infrared modalities with modality-specific clustering, a simple yet effective unsupervised learning baseline is designed, which significantly outperforms existing unsupervised single-modality solutions. Extensive experiments with insightful analysis on two visible-infrared recognition tasks show that the proposed strategies consistently improve the accuracy. Without auxiliary information, the Rank-1/mAP achieves 71.48%/68.15% on the large-scale SYSU-MM01 dataset.

Abstract:
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Substantial progress has been made recently in motion data collection technologies and generation methods, laying the foundation for increasing interest in human motion generation. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. While significant advancements have been made in recent years, the task continues to pose challenges due to the intricate nature of human motion and its implicit relationship with conditional signals. In this survey, we present a comprehensive literature review of human motion generation, which, to the best of our knowledge, is the first of its kind in this field. We begin by introducing the background of human motion and generative models, followed by an examination of representative methods for three mainstream sub-tasks: text-conditioned, audio-conditioned, and scene-conditioned human motion generation. Additionally, we provide an overview of common datasets and evaluation metrics. Lastly, we discuss open problems and outline potential future research directions. We hope that this survey could provide the community with a comprehensive glimpse of this rapidly evolving field and inspire novel ideas that address the outstanding challenges.

Abstract:
Image restoration aims to reconstruct the latent sharp image from its corrupted counterpart. Besides dealing with this long-standing task in the spatial domain, a few approaches seek solutions in the frequency domain by considering the large discrepancy between spectra of sharp/degraded image pairs. However, these algorithms commonly utilize transformation tools, e.g., wavelet transform, to split features into several frequency parts, which is not flexible enough to select the most informative frequency component to recover. In this paper, we exploit a multi-branch and content-aware module to decompose features into separate frequency subbands dynamically and locally, and then accentuate the useful ones via channel-wise attention weights. In addition, to handle large-scale degradation blurs, we propose an extremely simple decoupling and modulation module to enlarge the receptive field via global and window-based average pooling. Furthermore, we merge the paradigm of multi-stage networks into a single U-shaped network to pursue multi-scale receptive fields and improve efficiency. Finally, integrating the above designs into a convolutional backbone, the proposed Frequency Selection Network (FSNet) performs favorably against state-of-the-art algorithms on 20 different benchmark datasets for 6 representative image restoration tasks, including single-image defocus deblurring, image dehazing, image motion deblurring, image desnowing, image deraining, and image denoising.

Abstract:
According to the Complementary Learning Systems (CLS) theory (McClelland et al. 1995) in neuroscience, humans do effective continual learning through two complementary systems: a fast learning system centered on the hippocampus for rapid learning of the specifics, individual experiences; and a slow learning system located in the neocortex for the gradual acquisition of structured knowledge about the environment. Motivated by this theory, we propose DualNets (for Dual Networks), a general continual learning framework comprising a fast learning system for supervised learning of pattern-separated representation from specific tasks and a slow learning system for representation learning of task-agnostic general representation via Self-Supervised Learning (SSL). DualNets can seamlessly incorporate both representation types into a holistic framework to facilitate better continual learning in deep neural networks. Via extensive experiments, we demonstrate the promising results of DualNets on a wide range of continual learning protocols, ranging from the standard offline, task-aware setting to the challenging online, task-free scenario. Notably, on the CTrL (Veniat et al. 2020) benchmark that has unrelated tasks with vastly different visual images, DualNets can achieve competitive performance with existing state-of-the-art dynamic architecture strategies (Ostapenko et al. 2021). Furthermore, we conduct comprehensive ablation studies to validate DualNets efficacy, robustness, and scalability.

Abstract:
Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.

Abstract:
Graph Attention (GA) which aims to learn the attention coefficients for graph edges has achieved impressive performance in GNNs on many graph learning tasks. However, existing GAs are usually learned based on edges’ (or connected nodes’) features which fail to fully capture the rich structural information of edges. Some recent research attempts to incorporate the structural information into GA learning but how to fully exploit them in GA learning is still a challenging problem. To address this challenge, in this work, we propose to leverage a new Replicator Dynamics model for graph attention learning, termed Graph Replicator Attention (GRA). The core of GRA is our derivation of replicator dynamics based sparse attention diffusion which can explicitly learn context-aware and sparse preserved graph attentions via a simple self-supervised way. Moreover, GRA can be theoretically explained from an energy minimization model. This provides a more theoretical justification for the proposed GRA method. Experiments on several graph learning tasks demonstrate the effectiveness and advantages of the proposed GRA method on ten benchmark datasets.

Abstract:
Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most methods mainly focus on the instance level information (i.e., the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduce a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as relation metric, which is thus utilized to match the feature embeddings of different augmentations. To boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. The designed asymmetric predictor head and an InfoNCE warm-up strategy enhance the robustness to hyper-parameters and benefit the resulting performance. Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures, including various lightweight networks (e.g., EfficientNet and MobileNet).

Abstract:
This paper provides developments in statistical shape analysis of shape graphs, and demonstrates them using such complex objects as Retinal Blood Vessel (RBV) networks and neurons. The shape graphs are represented by sets of nodes and edges (articulated curves) connecting some nodes. The goals are to utilize nodes (locations, connectivity) and edges (edge weights and shapes) to: (1) characterize shapes, (2) quantify shape differences, and (3) model statistical variability. We develop a mathematical representation, elastic Riemannian metrics, and associated tools for shape graphs. Specifically, we derive tools for shape graph registration, geodesics, statistical summaries, shape modeling, and shape synthesis. Geodesics are convenient for visualizing optimal deformations, and PCA helps in dimension reduction and statistical modeling. One key challenge lies in comparing shape graphs with vastly different complexities (in number of nodes and edges). This paper introduces a novel multi-scale representation to handle this challenge. Using the notions of (1) “effective resistance” to cluster nodes and (2) elastic shape averaging of edge curves, it reduces graph complexity while retaining overall structures. This allows shape comparisons by bringing graphs to similar complexities. We demonstrate these ideas on 2D RBV networks from the STARE and DRIVE databases and 3D neurons from the NeuroMorpho database.

Abstract:
Image restoration aims to reconstruct a high-quality image from its corrupted version, playing essential roles in many scenarios. Recent years have witnessed a paradigm shift in image restoration from convolutional neural networks (CNNs) to Transformer-based models due to their powerful ability to model long-range pixel interactions. In this paper, we explore the potential of CNNs for image restoration and show that the proposed simple convolutional network architecture, termed ConvIR, can perform on par with or better than the Transformer counterparts. By re-examing the characteristics of advanced image restoration algorithms, we discover several key factors leading to the performance improvement of restoration models. This motivates us to develop a novel network for image restoration based on cheap convolution operators. Comprehensive experiments demonstrate that our ConvIR delivers state-of-the-art performance with low computation complexity among 20 benchmark datasets on five representative image restoration tasks, including image dehazing, image motion/defocus deblurring, image deraining, and image desnowing.

Abstract:
A desirable objective in self-supervised learning (SSL) is to avoid feature collapse. Whitening loss guarantees collapse avoidance by minimizing the distance between embeddings of positive pairs under the conditioning that the embeddings from different views are whitened. In this paper, we propose a framework with an informative indicator to analyze whitening loss, which provides a clue to demystify several interesting phenomena and a pivoting point connecting to other SSL methods. We show that batch whitening (BW) based methods do not impose whitening constraints on the embedding but only require the embedding to be full-rank. This full-rank constraint is also sufficient to avoid dimensional collapse. We further demonstrate that the stable rank of the embedding is invariant during training by gradient descent, given the assumption that embedding is updated with an infinitely small learning rate. Based on our analysis, we propose channel whitening with random group partition (CW-RGP), which exploits the advantages of BW-based methods in preventing collapse and avoids their disadvantages requiring large batch size. Experimental results on ImageNet classification and COCO object detection reveal that the proposed CW-RGP possesses a promising potential for learning good representations.

Abstract:
The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 270 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework.

Abstract:
The Self-Attention Mechanism (SAM) excels at distilling important information from the interior of data to improve the computational efficiency of models. Nevertheless, many Quantum Machine Learning (QML) models lack the ability to distinguish the intrinsic connections of information like SAM, which limits their effectiveness on massive high-dimensional quantum data. To tackle the above issue, a Quantum Kernel Self-Attention Mechanism (QKSAM) is introduced to combine the data representation merit of Quantum Kernel Methods (QKM) with the efficient information extraction capability of SAM. Further, a Quantum Kernel Self-Attention Network (QKSAN) framework is proposed based on QKSAM, which ingeniously incorporates the Deferred Measurement Principle (DMP) and conditional measurement techniques to release half of quantum resources by mid-circuit measurement, thereby bolstering both feasibility and adaptability. Simultaneously, the Quantum Kernel Self-Attention Score (QKSAS) with an exponentially large characterization space is spawned to accommodate more information and determine the measurement conditions. Eventually, four QKSAN sub-models are deployed on PennyLane and IBM Qiskit platforms to perform binary classification on MNIST and Fashion MNIST, where the QKSAS tests and correlation assessments between noise immunity and learning ability are executed on the best-performing sub-model. The paramount experimental finding is that the QKSAN subclasses possess the potential learning advantage of acquiring impressive accuracies exceeding 98.05% with far fewer parameters than classical machine learning models. Predictably, QKSAN lays the foundation for future quantum computers to perform machine learning on massive amounts of data while driving advances in areas such as quantum computer vision.

Abstract:
Unsupervised domain adaptation (UDA) intends to transfer knowledge from a labeled source domain to an unlabeled target domain. Many current methods focus on learning feature representations that are both discriminative for classification and invariant across domains by simultaneously optimizing domain alignment and classification tasks. However, these methods often overlook a crucial challenge: the inherent conflict between these two tasks during gradient-based optimization. In this paper, we delve into this issue and introduce two effective solutions known as Gradient Harmonization, including GH and GH++, to mitigate the conflict between domain alignment and classification tasks. GH operates by altering the gradient angle between different tasks from an obtuse angle to an acute angle, thus resolving the conflict and trade-offing the two tasks in a coordinated manner. Yet, this would cause both tasks to deviate from their original optimization directions. We thus further propose an improved version, GH++, which adjusts the gradient angle between tasks from an obtuse angle to a vertical angle. This not only eliminates the conflict but also minimizes deviation from the original gradient directions. Finally, for optimization convenience and efficiency, we evolve the gradient harmonization strategies into a dynamically weighted loss function using an integral operator on the harmonized gradient. Notably, GH/GH++ are orthogonal to UDA and can be seamlessly integrated into most existing UDA models. Theoretical insights and experimental analyses demonstrate that the proposed approaches not only enhance popular UDA baselines but also improve recent state-of-the-art models.

Abstract:
This paper presents a 3D registration method with maximal cliques (MAC) for 3D point cloud registration (PCR). The key insight is to loosen the previous maximum clique constraint and mine more local consensus information in a graph for accurate pose hypotheses generation: 1) A compatibility graph is constructed to render the affinity relationship between initial correspondences. 2) We search for maximal cliques in the graph, each representing a consensus set. 3) Transformation hypotheses are computed for the selected cliques by the SVD algorithm and the best hypothesis is used to perform registration. In addition, we present a variant of MAC if given overlap prior, called MAC-OP. Overlap prior further enhances MAC from many technical aspects, such as graph construction with re-weighted nodes, hypotheses generation from cliques with additional constraints, and hypothesis evaluation with overlap-aware weights. Extensive experiments demonstrate that both MAC and MAC-OP effectively increase registration recall, outperform various state-of-the-art methods, and boost the performance of deep-learned methods. For instance, MAC combined with GeoTransformer achieves a state-of-the-art registration recall of \text95.7% / \text78.9%95.7%/78.9% on 3DMatch / 3DLoMatch. We perform synthetic experiments on 3DMatch-LIR / 3DLoMatch-LIR, a dataset with extremely low inlier ratios for 3D registration in ultra-challenging cases.

Abstract:
Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used to predict target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.

Abstract:
Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.

Abstract:
Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. To date, the dominant methods are CNN-based, leaving plenty of room for improvement. In this work, we propose TransFlow, a transformer architecture for optical flow estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation and trustworthy matching in flow estimation by utilizing spatial self-attention and cross-attention mechanisms between adjacent frames to effectively capture global dependencies; Second, it recovers more compromised information (e.g., occlusion and motion blur) in flow estimation through long-range temporal association in dynamic scenes; Third, it introduces a concise self-learning paradigm, eliminating the need for complex and laborious multi-stage pre-training procedures. The versatility and superiority of TransFlow extend seamlessly to 3D scene motion, yielding competitive outcomes in 3D scene flow estimation. Our approach attains state-of-the-art results on benchmark datasets such as Sintel and KITTI-15, while also exhibiting exceptional performance on downstream tasks, including video object detection using the ImageNet VID dataset, video frame interpolation using the GoPro dataset, and video stabilization using the DeepStab dataset. We believe that the effectiveness of TransFlow positions it as a flexible baseline for both optical flow and scene flow estimation, offering promising avenues for future research and development.

Abstract:
We formulate an optimization problem to estimate probability densities in the context of multidimensional problems that are sampled with uneven probability. It considers detector sensitivity as an heterogeneous density and takes advantage of the computational speed and flexible boundary conditions offered by splines on a grid. We choose to regularize the Hessian of the spline via the nuclear norm to promote sparsity. As a result, the method is spatially adaptive and stable against the choice of the regularization parameter, which plays the role of the bandwidth. We test our computational pipeline on standard densities and provide software. We also present a new approach to PET rebinning as an application of our framework.

Abstract:
Generalizing out-of-distribution (OoD) is critical but challenging in real applications such as unmanned aerial vehicle (UAV) flight control. Previous machine learning-based control has shown promise in dealing with complex real-world environments but suffers huge performance degradation facing OoD scenarios, posing risks to the stability and safety of UAVs. In this paper, we found that the introduced random noises during training surprisingly yield theoretically guaranteed performances via a proposed functional optimization framework. More encouragingly, this framework does not involve common Lyapunov assumptions used in this field, making it more widely applicable. With this framework, the upperbound for control error is induced. We also proved that the induced random noises can lead to lower OoD control errors. Based on our theoretical analysis, we further propose OoD-Control to generalize control in unseen environments. Numerical experiments demonstrate the superiority of the proposed algorithm, surpassing previous state-of-the-art by 65% under challenging unseen environments. We further extend to outdoor real-world experiments and found that the control error is reduced by 50% approximately.

Abstract:
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Extensive experiments demonstrate that DeepNet has superior performance across various benchmarks, including machine translation, language modeling (i.e., BERT, GPT) and vision pre-training (i.e., BEiT). Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

Abstract:
Graph-based multi-view clustering encodes multi-view data into sample affinities to find consensus representation, effectively overcoming heterogeneity across different views. However, traditional affinity measures tend to collapse as the feature dimension expands, posing challenges in estimating a unified alignment that reveals both cross-view and inner relationships. To tackle this challenge, we propose to achieve multi-view uniform clustering via consensus representation co-regularization. First, the sample affinities are encoded by both popular dyadic affinity and recent high-order affinities to comprehensively characterize spatial distributions of the HDLSS data. Second, a fused consensus representation is learned through aligning the multi-view low-dimensional representation by co-regularization. The learning of the fused representation is modeled by a high-order eigenvalue problem within manifold space to preserve the intrinsic connections and complementary correlations of original data. A numerical scheme via manifold minimization is designed to solve the high-order eigenvalue problem efficaciously. Experiments on eight HDLSS datasets demonstrate the effectiveness of our proposed method in comparison with the recent thirteen benchmark methods.

Abstract:
Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for action recognition. This detection paradigm requires multi-stage training and inference, and the feature sampling is only constrained inside the box, failing to effectively leverage richer context information outside. Recently, several query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain the state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.

Abstract:
Insufficient lighting poses challenges to both human and machine visual analytics. While existing low-light enhancement methods prioritize human visual perception, they often neglect machine vision and high-level semantics. In this paper, we make pioneering efforts to build an illumination enhancement model for high-level vision. Drawing inspiration from camera response functions, our model could enhance images from the machine vision perspective despite being lightweight in architecture and simple in formulation. We also introduce two approaches that leverage knowledge from base enhancement curves and self-supervised pretext tasks to train for different downstream normal-to-low-light adaptation scenarios. Our proposed framework overcomes the limitations of existing algorithms without requiring access to labeled data in low-light conditions. It facilitates more effective illumination restoration and feature alignment, significantly improving the performance of downstream tasks in a plug-and-play manner. This research advances the field of low-light machine analytics and broadly applies to various high-level vision tasks, including classification, face detection, optical flow estimation, and video action recognition.

Abstract:
Texture synthesis is a fundamental problem in computer graphics that would benefit various applications. Existing methods are effective in handling 2D image textures. In contrast, many real-world textures contain meso-structure in the 3D geometry space, such as grass, leaves, and fabrics, which cannot be effectively modeled using only 2D image textures. We propose a novel texture synthesis method with Neural Radiance Fields (NeRF) to capture and synthesize textures from given multi-view images. In the proposed NeRF texture representation, a scene with fine geometric details is disentangled into the meso-structure textures and the underlying base shape. This allows textures with meso-structure to be effectively learned as latent features situated on the base shape, which are fed into a NeRF decoder trained simultaneously to represent the rich view-dependent appearance. Using this implicit representation, we can synthesize NeRF-based textures through patch matching of latent features. However, inconsistencies between the metrics of the reconstructed content space and the latent feature space may compromise the synthesis quality. To enhance matching performance, we further regularize the distribution of latent features by incorporating a clustering constraint. In addition to generating NeRF textures over a planar domain, our method can also synthesize NeRF textures over curved surfaces, which are practically useful. Experimental results and evaluations demonstrate the effectiveness of our approach.

Abstract:
Although modern generative models achieve excellent quality in a variety of tasks, they often lack the essential ability to generate examples with requested properties, such as the age of the person in the photo or the weight of the generated molecule. To overcome these limitations we propose PluGeN (Plugin Generative Network), a simple yet effective generative technique that can be used as a plugin for pre-trained generative models. The idea behind our approach is to transform the entangled latent representation using a flow-based module into a multi-dimensional space where the values of each attribute are modeled as an independent one-dimensional distribution. In consequence, PluGeN can generate new samples with desired attributes as well as manipulate labeled attributes of existing examples. Due to the disentangling of the latent representation, we are even able to generate samples with rare or unseen combinations of attributes in the dataset, such as a young person with gray hair, men with make-up, or women with beards. In contrast to competitive approaches, PluGeN can be trained on partially labeled data. We combined PluGeN with GAN and VAE models and applied it to conditional generation and manipulation of images, chemical molecule modeling and 3D point clouds generation.

Abstract:
In this article, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs). Unlike other mixtures of experts with gating models linear in the input, our model employs gating functions built with Gaussian processes (GPs). These processes are based on random features that are non-linear functions of the inputs. Furthermore, the experts in our model are also constructed with GPs. The optimization of the GPHMEs is performed by variational inference. The proposed GPHMEs have several advantages. They outperform tree-based HME benchmarks that partition the data in the input space, and they achieve good performance with reduced complexity. Another advantage is the interpretability they provide for deep GPs, and more generally, for deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets, even with quite modest sizes.

Affiliations: School of Automation Science and Engineering, South China University of Technology, Guangzhou, China; State Key Laboratory of Robotics, Shenyang Institute of Automation, Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE; School of Computer Science and Engineering, Nanjing University of Science and Technology, Jiangsu, China; Department of Computer Science, Tulane University, New Orleans, LA, USA

Abstract:
Text-to-image generative models can produce diverse high-quality images of concepts with a text prompt, which have demonstrated excellent ability in image generation, image translation, etc. We in this work study the problem of synthesizing instantiations of a user's own concepts in a never-ending manner, i.e., create your world, where the new concepts from user are quickly learned with a few examples. To achieve this goal, we propose a Lifelong text-to-image Diffusion Model (L ^22 DM), which intends to overcome knowledge “catastrophic forgetting” for the past encountered concepts, and semantic “catastrophic neglecting” for one or more concepts in the text prompt. In respect of knowledge “catastrophic forgetting”, our L ^22 DM framework devises a task-aware memory enhancement module and an elastic-concept distillation module, which could respectively safeguard the knowledge of both prior concepts and each past personalized concept. When generating images with a user text prompt, the solution to semantic “catastrophic neglecting” is that a concept attention artist module can alleviate the semantic neglecting from concept aspect, and an orthogonal attention module can reduce the semantic binding from attribute aspect. To the end, our model can generate more faithful image across a range of continual text prompts in terms of both qualitative and quantitative metrics, when comparing with the related state-of-the-art models.

Affiliations: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan; Big Geospatial Data Management, Technical University of Munich, Munich, Germany; Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Freiberg, Germany; School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia; Hyperspectral Computing Laboratory, Department of Technology of Computers and Communications, Escuela Politécnica, University of Extremadura, Cáceres, Spain; Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy; Faculty of Electrical and Computer Engineering, University of Iceland, Reykjavik, Iceland; Inria, CNRS, Grenoble INP, LJK, Univ. Grenoble Alpes, Grenoble, France

Abstract:
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which offers valuable information for scene understanding, especially in remote sensing (RS) applications. To fill this gap, we created for the first time a universal RS foundation model, named SpectralGPT, which is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT). Compared to existing foundation models, SpectralGPT 1) accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS Big Data; 2) leverages 3D token generation for spatial-spectral coupling; 3) captures spectrally sequential patterns via multi-target reconstruction; and 4) trains on one million spectral RS images, yielding models with over 600 million parameters. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS Big Data applications within the field of geoscience across four downstream tasks: single/multi-label scene classification, semantic segmentation, and change detection.

Abstract:
This work studies the problem of image semantic segmentation. Current approaches focus mainly on mining “local” context, i.e., dependencies between pixels within individual images, by specifically-designed, context aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization objectives (e.g., IoU-like loss). However, they ignore “global” context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive algorithm, dubbed as PiCo, for semantic segmentation in the fully supervised learning setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely studied before. Our training algorithm is compatible with modern segmentation solutions without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCRNet, SegFormer, Segmenter, MaskFormer) and backbones (i.e., MobileNet, ResNet, HRNet, MiT, ViT), our algorithm brings consistent performance improvements across diverse datasets (i.e., Cityscapes, ADE20 K, PASCAL-Context, COCO-Stuff, CamVid). We expect that this work will encourage our community to rethink the current de facto training paradigm in semantic segmentation.

Abstract:
Multi-modal homography estimation aims to spatially align the images from different modalities, which is quite challenging since both the image content and resolution are variant across modalities. In this paper, we introduce a novel framework namely CrossHomo to tackle this challenging problem. Our framework is motivated by two interesting findings which demonstrate the mutual benefits between image super-resolution and homography estimation. Based on these findings, we design a flexible multi-level homography estimation network to align the multi-modal images in a coarse-to-fine manner. Each level is composed of a multi-modal image super-resolution (MISR) module to shrink the resolution gap between different modalities, followed by a multi-modal homography estimation (MHE) module to predict the homography matrix. To the best of our knowledge, CrossHomo is the first attempt to address the homography estimation problem with both modality and resolution discrepancy. Extensive experimental results show that our CrossHomo can achieve high registration accuracy on various multi-modal datasets with different resolution gaps. In addition, the network has high efficiency in terms of both model complexity and running speed.

Abstract:
Near-eye gaze estimation is a task that maps the recording of an eye captured by an adjacent camera to the direction of a person's gaze in space. In contrast to frame-based cameras, event cameras are characterized by high sensing rates, low latency, sparse asynchronous data outputs, and high dynamic range, which are well suited for recording the fast eye movements. However, algorithms and system designs that operate on frame-based cameras are not applicable to event-based data, due to the natural differences in the data characteristics. In this work, we study the pattern of near-eye event-based data streams and extract eye features to estimate gaze. First, by analyzing eye parts and movements, and harnessing the polar, spatial, and temporal distribution of the events, we introduce a real-time pipeline to extract pupil features. Second, we present a recurrent neural network with a proposed coordinate-to-angle loss function to accurately estimate gaze from pupil feature sequence. We demonstrated that our system achieves accurate real-time estimation with angular accuracy of 0.46^\circ∘ and update rates of 950 Hz, thus opening up avenues for novel applications. To our knowledge, this is the first system that operates only on event-based data to perform gaze estimation.

Abstract:
In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective than weakly supervised and zero-shot settings. This paper thoroughly reviews open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by juxtaposing open vocabulary learning with analogous concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Subsequently, we examine several pertinent tasks within the realms of segmentation and detection, encompassing long-tail problems, few-shot, and zero-shot settings. As a foundation for our method survey, we first elucidate the fundamental principles of detection and segmentation in close-set scenarios. Next, we examine various contexts where open vocabulary learning is employed, pinpointing recurring design elements and central themes. This is followed by a comparative analysis of recent detection and segmentation methodologies in commonly used datasets and benchmarks. Our review culminates with a synthesis of insights, challenges, and discourse on prospective research trajectories. To our knowledge, this constitutes the inaugural exhaustive literature review on open vocabulary learning.

Abstract:
Consistent correspondences between point clouds are vital to 3D vision tasks such as registration and recognition. In this paper, we present a mutual voting method for ranking 3D correspondences. The key insight is to achieve reliable scoring results for correspondences by refining both voters and candidates in a mutual voting scheme. First, a graph is constructed for the initial correspondence set with the pairwise compatibility constraint. Second, nodal clustering coefficients are introduced to preliminarily remove a portion of outliers and speed up the following voting process. Third, we model nodes and edges in the graph as candidates and voters, respectively. Mutual voting is then performed in the graph to score correspondences. Finally, the correspondences are ranked based on the voting scores and top-ranked ones are identified as inliers. Feature matching, 3D point cloud registration, and 3D object recognition experiments on various datasets with different nuisances and modalities verify that MV is robust to heavy outliers under different challenging settings, and can significantly boost 3D point cloud registration and 3D object recognition performance.

Abstract:
Applying machine learning to combinatorial optimization problems has the potential to improve both efficiency and accuracy. However, existing learning-based solvers often struggle with generalization when faced with changes in problem distributions and scales. In this paper, we propose a new approach called ASP: Adaptive Staircase Policy Space Response Oracle to address these generalization issues and learn a universal neural solver. ASP consists of two components: Distributional Exploration, which enhances the solver's ability to handle unknown distributions using Policy Space Response Oracles, and Persistent Scale Adaption, which improves scalability through curriculum learning. We have tested ASP on several challenging COPs, including the traveling salesman problem, the vehicle routing problem, and the prize collecting TSP, as well as the real-world instances from TSPLib and CVRPLib. Our results show that even with the same model size and weak training signal, ASP can help neural solvers explore and adapt to unseen distributions and varying scales, achieving superior performance. In particular, compared with the same neural solvers under a standard training pipeline, ASP produces a remarkable decrease in terms of the optimality gap with 90.9% and 47.43% on generated instances and real-world instances for TSP, and a decrease of 19% and 45.57% for CVRP.

Abstract:
Effective modeling of human interactions is of utmost importance when forecasting behaviors such as future trajectories. Each individual, with its motion, influences surrounding agents since everyone obeys to social non-written rules such as collision avoidance or group following. In this paper we model such interactions, which constantly evolve through time, by looking at the problem from an algorithmic point of view, i.e., as a data manipulation task. We present a neural network based on an end-to-end trainable working memory, which acts as an external storage where information about each agent can be continuously written, updated and recalled. We show that our method is capable of learning explainable cause-effect relationships between motions of different agents, obtaining state-of-the-art results on multiple trajectory forecasting datasets.

Abstract:
Factorization machines (FMs) are widely used in recommender systems due to their adaptability and ability to learn from sparse data. However, for the ubiquitous non-interactive features in sparse data, existing FMs can only estimate the parameters corresponding to these features via the inner product of their embeddings. Undeniably, they cannot learn the direct interactions of these features, which limits the model's expressive power. To this end, we first present MixFM, inspired by Mixup, to generate auxiliary training data to boost FMs. Unlike existing augmentation strategies that require labor costs and expertise to collect additional information such as position and fields, these augmented data are only by the convex combination of the raw ones without any professional knowledge support. More importantly, if non-interactive features exist in parent samples to be mixed respectively, MixFM will establish their direct interactions. Second, considering that MixFM may generate redundant or even detrimental instances, we further put forward a novel Factorization Machine powered by Saliency-guided Mixup (denoted as SMFM). Guided by the customized saliency, SMFM can generate more informative neighbor data. Through theoretical analysis, we prove that the proposed methods minimize the upper bound of the generalization error, which positively enhances FMs. Finally, extensive experiments on seven datasets confirm that our approaches are superior to baselines. Notably, the results also show that “poisoning” mixed data benefits the FM variants.

Abstract:
Edge Artificial Intelligence (AI) relies on the integration of Machine Learning (ML) into even the smallest embedded devices, thus enabling local intelligence in real-world applications, e.g. for image or speech processing. Traditional Edge AI frameworks lack important aspects required to keep up with recent and upcoming ML innovations. These aspects include low flexibility concerning the target hardware and limited support for custom hardware accelerator integration. Artificial Intelligence for Embedded Systems Framework (AIfES) has the goal to overcome these challenges faced by traditional edge AI frameworks. In this paper, we give a detailed overview of the architecture of AIfES and the applied design principles. Finally, we compare AIfES with TensorFlow Lite for Microcontrollers (TFLM) on an ARM Cortex-M4-based System-on-Chip (SoC) using fully connected neural networks (FCNNs) and convolutional neural networks (CNNs). AIfES outperforms TFLM in both execution time and memory consumption for the FCNNs. Additionally, using AIfES reduces memory consumption by up to 54% when using CNNs. Furthermore, we show the performance of AIfES during the training of FCNN as well as CNN and demonstrate the feasibility of training a CNN on a resource-constrained device with a memory usage of slightly more than 100 kB of RAM.

Abstract:
Person re-identification (Re-ID) is a fundamental task in visual surveillance. Given a query image of the target person, conventional Re-ID focuses on the pairwise similarities between the candidate images and the query. However, conventional Re-ID does not evaluate the consistency of the retrieval results of whether the most similar images ranked in each place contain the same person, which is risky in some applications such as missing out a place where the patient passed will hinder the epidemiological investigation. In this work, we investigate a more challenging task: consistently and successfully retrieving the target person in all camera views. We define the task as continuous person Re-ID and propose a corresponding evaluation metric termed overall Rank-K accuracy. Different from the conventional Re-ID, any incorrect retrieval under an individual camera view that raises an inconsistency will fail the continuous Re-ID. Consequently, the defective cameras, in which the images are hard to be automatically associated with the images from other views, strongly degrade the performance of continuous person Re-ID. Since the camera deployment is crucial for continuous tracking across camera views, we rethink person Re-ID from the perspective of camera deployment and assess the quality of a camera network by performing continuous Re-ID. Moreover, we propose to automatically detect the defective cameras that greatly hamper the continuous Re-ID. Because brute-force search is costly when the camera network becomes complicated, we explicitly model the visual relations as well as the spatial relations among cameras and develop a relational deep Q-network to select the properly deployed cameras and the un-selected cameras are regarded as the defective cameras. Since most existing datasets do not provide topology information about the camera network, they are unsuitable for investigating the importance of spatial relations on camera selection. Thus, we collect a new dataset including 20 cameras with topology information. Compared with randomly removing cameras, the experimental results show that our method can effectively detect the defective cameras so that people could take further operations on these cameras in practice (https://www.isee-ai.cn/∼yixing/MCCPD.html).

Abstract:
World models learn the consequences of actions in vision-based interactive systems. However, in practical scenarios like autonomous driving, noncontrollable dynamics that are independent or sparsely dependent on action signals often exist, making it challenging to learn effective world models. To address this issue, we propose Iso-Dream++, a model-based reinforcement learning approach that has two main contributions. First, we optimize the inverse dynamics to encourage the world model to isolate controllable state transitions from the mixed spatiotemporal variations of the environment. Second, we perform policy optimization based on the decoupled latent imaginations, where we roll out noncontrollable states into the future and adaptively associate them with the current controllable state. This enables long-horizon visuomotor control tasks to benefit from isolating mixed dynamics sources in the wild, such as self-driving cars that can anticipate the movement of other vehicles, thereby avoiding potential risks. On top of our previous work (Pan et al. 2022), we further consider the sparse dependencies between controllable and noncontrollable states, address the training collapse problem of state decoupling, and validate our approach in transfer learning setups. Our empirical study demonstrates that Iso-Dream++ outperforms existing reinforcement learning models significantly on CARLA and DeepMind Control.

Abstract:
Achieving human-level dexterity in robotics remains a critical open problem. Even simple dexterous manipulation tasks pose significant difficulties due to the high number of degrees of freedom and the need for cooperation among heterogeneous agents (e.g., finger joints). While some researchers have utilized reinforcement learning (RL) to control a single hand in manipulating objects, tasks that require coordinated bimanual cooperation are still under-explored due to the fewer suitable environments, which can result in difficulties and sub-optimal performance. To address these challenges, we introduce Bi-DexHands, a simulator with two dexterous hands featuring 20 bimanual manipulation tasks and thousands of target objects, designed to match various levels of human motor skills based on cognitive science research. We developed Bi-DexHands in Issac Gym, enabling highly efficient RL training at over 30,000 frames per second using a single NVIDIA RTX 3090. Based on Bi-DexHands, we present a comprehensive evaluation of popular RL algorithms in different settings, including single-agent/multi-agent RL, offline RL, multi-task RL, and meta RL. Our findings show that on-policy algorithms, such as PPO, can master simple manipulation tasks that correspond to those of 48-month-old babies, such as catching a flying object or opening a bottle. Furthermore, multi-agent RL can improve the ability to perform manipulations that require skilled bimanual cooperation, such as lifting a pot or stacking blocks. Despite achieving success in individual tasks, current RL algorithms struggle to learn multiple manipulation skills in most multi-task and few-shot learning scenarios. This highlights the need for further research and development within the RL community.

Abstract:
Recent graph-based models for multi-intent SLU have obtained promising results through modeling the guidance from the prediction of intents to the decoding of slot filling. However, existing methods (1) only model the unidirectional guidance from intent to slot, while there are bidirectional inter-correlations between intent and slot; (2) adopt homogeneous graphs to model the interactions between the slot semantics nodes and intent label nodes, which limit the performance. In this paper, we propose a novel model termed Co-guiding Net, which implements a two-stage framework achieving the mutual guidances between the two tasks. In the first stage, the initial estimated labels of both tasks are produced, and then they are leveraged in the second stage to model the mutual guidances. Specifically, we propose two heterogeneous graph attention networks working on the proposed two heterogeneous semantics-label graphs, which effectively represent the relations among the semantics nodes and label nodes. Besides, we further propose Co-guiding-SCL Net, which exploits the single-task and dual-task semantics contrastive relations. For the first stage, we propose single-task supervised contrastive learning, and for the second stage, we propose co-guiding supervised contrastive learning, which considers the two tasks’ mutual guidances in the contrastive learning procedure. Experiment results on multi-intent SLU show that our model outperforms existing models by a large margin, obtaining a relative improvement of 21.3% over the previous best model on MixATIS dataset in overall accuracy. We also evaluate our model on the zero-shot cross-lingual scenario and the results show that our model can relatively improve the state-of-the-art model by 33.5% on average in terms of overall accuracy for the total 9 languages.

Abstract:
In this paper, we propose a dynamic 3D object detector named HyperDet3D, which is adaptively adjusted based on the hyper scene-level knowledge on the fly. Existing methods strive for object-level representations of local elements and their relations without scene-level priors, which suffer from ambiguity between similarly-structured objects only based on the understanding of individual points and object candidates. Instead, we design scene-conditioned hypernetworks to simultaneously learn scene-agnostic embeddings to exploit sharable abstracts from various 3D scenes, and scene-specific knowledge which adapts the 3D detector to the given scene at test time. As a result, the lower-level ambiguity in object representations can be addressed by hierarchical context in scene priors. However, since the upstream hypernetwork in HyperDet3D takes raw scenes as input which contain noises and redundancy, it leads to sub-optimal parameters produced for the 3D detector simply under the constraint of downstream detection losses. Based on the fact that the downstream 3D detection task can be factorized into object-level semantic classification and bounding box regression, we furtherly propose HyperFormer3D by correspondingly designing their scene-level prior tasks in upstream hypernetworks, namely Semantic Occurrence and Objectness Localization. To this end, we design a transformer-based hypernetwork that translates the task-oriented scene priors into parameters of the downstream detector, which refrains from noises and redundancy of the scenes. Extensive experimental results on the ScanNet, SUN RGB-D and MatterPort3D datasets demonstrate the effectiveness of the proposed methods.

Abstract:
Generating graph-structured data is a challenging problem, which requires learning the underlying distribution of graphs. Various models such as graph VAE, graph GANs, and graph diffusion models have been proposed to generate meaningful and reliable graphs, among which the diffusion models have achieved state-of-the-art performance. In this paper, we argue that running full-rank diffusion SDEs on the whole graph adjacency matrix space hinders diffusion models from learning graph topology generation, and hence significantly deteriorates the quality of generated graph data. To address this limitation, we propose an efficient yet effective Graph Spectral Diffusion Model (GSDM), which is driven by low-rank diffusion SDEs on the graph spectrum space. Our spectral diffusion model is further proven to enjoy a substantially stronger theoretical guarantee than standard diffusion models. Extensive experiments across various datasets demonstrate that our proposed GSDM turns out to be the SOTA model, by exhibiting both significantly higher generation quality and much less computational consumption than the baselines.

Abstract:
In recent years, multiple-choice Visual Question Answering (VQA) has become topical and achieved remarkable progress. However, most pioneer multiple-choice VQA models are heavily driven by statistical correlations in datasets, which cannot perform well on multimodal understanding and suffer from poor generalization. In this paper, we identify two kinds of spurious correlations, i.e., a Vision-Answer bias (VA bias) and a Question-Answer bias (QA bias). To systematically and scientifically study these biases, we construct a new video question answering (videoQA) benchmark NExT-OOD in OOD setting and propose a graph-based cross-sample method for bias reduction. Specifically, the NExT-OOD is designed to quantify models’ generalizability and measure their reasoning ability comprehensively. It contains three sub-datasets including NExT-OOD-VA, NExT-OOD-QA, and NExT-OOD-VQA, which are designed for the VA bias, QA bias, and VA&QA bias, respectively. We evaluate several existing multiple-choice VQA models on our NExT-OOD, and illustrate that their performance degrades significantly compared with the results obtained on the original multiple-choice VQA dataset. Besides, to mitigate the VA bias and QA bias, we explicitly consider the cross-sample information and design a contrastive graph matching loss in our approach, which provides adequate debiasing guidance from the perspective of whole dataset, and encourages the model to focus on multimodal contents instead of spurious statistical regularities. Extensive experimental results illustrate that our method significantly outperforms other bias reduction strategies, demonstrating the effectiveness and generalizability of the proposed approach.

Abstract:
Estimation of depth in two-dimensional images is among the challenging topics in Computer Vision. This is a well-studied but also an ill-posed problem, which has long been the focus of intense research. This paper is an in-depth review of the topic, presenting two aspects, one that considers the mechanisms of human depth perception, and another that includes the various Deep Learning approaches. The methods are presented in a compact and structured way that outlines the topic and categorizes the approaches according to the line of research followed in the recent decade. Although there has been significant advancement in the topic, it was without any connection with human depth perception and the potential benefits from this sector.

Abstract:
Few-shot learning (FSL) aims to generate a classifier using limited labeled examples. Many existing works take the meta-learning approach, constructing a few-shot learner (a meta-model) that can learn from few-shot examples to generate a classifier. Typically, the few-shot learner is constructed or meta-trained by sampling multiple few-shot tasks in turn and optimizing the few-shot learner's performance in generating classifiers for those tasks. The performance is measured by how well the resulting classifiers classify the test (i.e., query) examples of those tasks. In this paper, we point out two potential weaknesses of this approach. First, the sampled query examples may not provide sufficient supervision for meta-training the few-shot learner. Second, the effectiveness of meta-learning diminishes sharply with the increasing number of shots (i.e., the number of training examples per class). To resolve these issues, we propose a novel meta-training objective for the few-shot learner, which is to encourage the few-shot learner to generate classifiers that perform like strong classifiers. Concretely, we associate each sampled few-shot task with a strong classifier, which is trained with ample labeled examples. The strong classifiers can be seen as the target classifiers that we hope the few-shot learner to generate given few-shot examples, and we use the strong classifiers to supervise the few-shot learner. We present an efficient way to construct the strong classifier, making our proposed objective an easily plug-and-play term to existing meta-learning based FSL methods. We validate our approach, (Learning with A Strong Teacher for few-SHOT learning), in combinations with many representative meta-learning methods. On several benchmark datasets including miniImageNet and tieredImageNet, our approach leads to a notable improvement across a variety of tasks. More importantly, with our approach, meta-learning based FSL methods can consistently outperform non-meta-learning based methods at different numbers of shots, even in many-shot settings, greatly strengthening their applicability.

Abstract:
Starting from the seminal work of Fully Convolutional Networks (FCN), there has been significant progress on semantic segmentation. However, deep learning models often require large amounts of pixelwise annotations to train accurate and robust models. Given the prohibitively expensive annotation cost of segmentation masks, we introduce a self-training framework in this paper to leverage pseudo labels generated from unlabeled data. In order to handle the data imbalance problem of semantic segmentation, we propose a centroid sampling strategy to uniformly select training samples from every class within each epoch. We also introduce a fast training schedule to alleviate the computational burden. This enables us to explore the usage of large amounts of pseudo labels. Our Centroid Sampling based Self-Training framework (CSST) achieves state-of-the-art results on Cityscapes and CamVid datasets. On PASCAL VOC 2012 test set, our models trained with the original train set even outperform the same models trained on the much bigger augmented train set. This indicates the effectiveness of CSST when there are fewer annotations. We also demonstrate promising few-shot generalization capability from Cityscapes to BDD100K and from Cityscapes to Mapillary datasets.

Abstract:
Multi-Source Domain Adaptation (MSDA) focuses on transferring the knowledge from multiple source domains to the target domain, which is a more practical and challenging problem compared to the conventional single-source domain adaptation. In this problem, it is essential to model multiple source domains and target domain jointly, and an effective domain combination scheme is also highly required. The graphical structure among different domains is useful to tackle these challenges, in which the interdependency among various instances/categories can be effectively modeled. In this work, we propose two types of graphical models, i.e. Conditional Random Field for MSDA (CRF-MSDA) and Markov Random Field for MSDA (MRF-MSDA), for cross-domain joint modeling and learnable domain combination. In a nutshell, given an observation set composed of a query sample and the semantic prototypes (i.e. representative category embeddings) on various domains, the CRF-MSDA model seeks to learn the joint distribution of labels conditioned on the observations. We attain this goal by constructing a relational graph over all observations and conducting local message passing on it. By comparison, MRF-MSDA aims to model the joint distribution of observations over different Markov networks via an energy-based formulation, and it can naturally perform label prediction by summing the joint likelihoods over several specific networks. Compared to the CRF-MSDA counterpart, the MRF-MSDA model is more expressive and possesses lower computational cost. We evaluate these two models on four standard benchmark data sets of MSDA with distinct domain shift and data complexity, and both models achieve superior performance over existing methods on all benchmarks. In addition, the analytical studies illustrate the effect of different model components and provide insights about how the cross-domain joint modeling performs.

Abstract:
This paper starts by revealing a surprising finding: without any learning, a randomly initialized CNN can localize objects surprisingly well. That is, a CNN has an inductive bias to naturally focus on objects, named as Tobias (“The object is at sight”) in this paper. This empirical inductive bias is further theoretically analyzed and empirically verified, and successfully applied to self-supervised learning as well as supervised learning. For self-supervised learning, a CNN is encouraged to learn representations that focus on the foreground object, by transforming every image into various versions with different backgrounds, where the foreground and background separation is guided by Tobias. Experimental results show that the proposed Tobias significantly improves downstream tasks, especially for object detection. This paper also shows that Tobias has consistent improvements on training sets of different sizes, and is more resilient to changes in image augmentations. Furthermore, we apply Tobias to supervised image classification by letting the average pooling layer focus on foreground regions, which achieves improved performance on various benchmarks.

Abstract:
3D morphable model (3DMM) fitting on 2D data is traditionally done via unconstrained optimization with regularization terms to ensure that the result is a plausible face shape and is consistent with a set of 2D landmarks. This paper presents inequality-constrained 3DMM fitting as the first alternative to regularization in optimization-based 3DMM fitting. Inequality constraints on the 3DMM's shape coefficients ensure face-like shapes without modifying the objective function for smoothness, thus allowing for more flexibility to capture person-specific shape details. Moreover, inequality constraints on landmarks increase robustness in a way that does not require per-image tuning. We show that the proposed method stands out with its ability to estimate person-specific face shapes by jointly fitting a 3DMM to multiple frames of a person. Further, when used with a robust objective function, namely gradient correlation, the method can work “in-the-wild” even with a 3DMM constructed from controlled data. Lastly, we show how to use the log-barrier method to efficiently implement the method. To our knowledge, we present the first 3DMM fitting framework that requires no learning yet is accurate, robust, and efficient. The absence of learning enables a generic solution that allows flexibility in the input image size, interchangeable morphable models, and incorporation of camera matrix.

Abstract:
Deep learning technology has developed unprecedentedly in the last decade and has become the primary choice in many application domains. This progress is mainly attributed to a systematic collaboration in which rapidly growing computing resources encourage advanced algorithms to deal with massive data. However, it has gradually become challenging to handle the unlimited growth of data with limited computing power. To this end, diverse approaches are proposed to improve data processing efficiency. Dataset distillation, a dataset reduction method, addresses this problem by synthesizing a small typical dataset from substantial data and has attracted much attention from the deep learning community. Existing dataset distillation methods can be taxonomized into meta-learning and data matching frameworks according to whether they explicitly mimic the performance of target data. Although dataset distillation has shown surprising performance in compressing datasets, there are still several limitations such as distilling high-resolution data or data with complex label spaces. This paper provides a holistic understanding of dataset distillation from multiple aspects, including distillation frameworks and algorithms, factorized dataset distillation, performance comparison, and applications. Finally, we discuss challenges and promising directions to further promote future studies on dataset distillation.

Abstract:
Meta-learning has emerged as an efficient approach for constructing target models based on support sets. For example, the meta-learned embeddings enable the construction of target nearest-neighbor classifiers for specific tasks by pulling instances closer to their same-class neighbors. However, a single instance can be annotated from various latent attributes, making visually similar instances inside or across support sets have different labels and diverse relationships with others. Consequently, a uniform meta-learned strategy for inferring the target model from the support set fails to capture the instance-wise ambiguous similarity. To this end, we propose Learning to Decompose Network (LeadNet) to contextualize the meta-learned “support-to-target” strategy, leveraging the context of instances with one or mixed latent attributes in a support set. In particular, the comparison relationship between instances is decomposed w.r.t. multiple embedding spaces. LeadNet learns to automatically select the strategy associated with the right attribute via incorporating the change of comparison across contexts with polysemous embeddings. We demonstrate the superiority of LeadNet in various applications, including exploring multiple views of confusing data, out-of-distribution recognition, and few-shot image classification.

Abstract:
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning, under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

Abstract:
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method. The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data while utilizing deep learning models. On one hand, PASS employs a generative model to create synthetic data that closely mirrors raw data while preserving its rank properties through data perturbation, thereby enhancing data diversity and bolstering privacy. By incorporating knowledge transfer from large pre-trained generative models, PASS enhances estimation accuracy, yielding refined distributional estimates of various statistics via Monte Carlo experiments. On the other hand, PAI boasts its statistically guaranteed validity. In pivotal inference, it enables precise conclusions even without prior knowledge of the pivotal's distribution. In non-pivotal situations, we enhance the reliability of synthetic data generation by training it with an independent holdout sample. We demonstrate the effectiveness of PAI in advancing uncertainty quantification in complex, data-driven tasks by applying it to diverse areas such as image synthesis, sentiment word analysis, multimodal inference, and the construction of prediction intervals.

Abstract:
In point cloud, some regions typically exist nodes from multiple categories, i.e., these regions have both homophilic and heterophilic nodes. However, most existing methods ignore the heterophily of edges during the aggregation of the neighborhood node features, which inevitably mixes unnecessary information of heterophilic nodes and leads to blurred boundaries of segmentation. To address this problem, we model the point cloud as a homophilic-heterophilic graph and propose a graph regulation network (GRN) to produce finer segmentation boundaries. The proposed method can adaptively adjust the propagation mechanism with the degree of neighborhood homophily. Moreover, we build a prototype feature extraction module, which is utilised to mine the homophily features of nodes from the global prototype space. Theoretically, we prove that our convolution operation can constrain the similarity of representations between nodes based on their degree of homophily. Extensive experiments on fully and weakly supervised point cloud semantic segmentation tasks demonstrate that our method achieves satisfactory performance. Especially in the case of weak supervision, that is, each sample has only 1%\!-\!10%1%-10% labeled points, the proposed method has a significant improvement in segmentation performance.

Abstract:
Semi-supervised learning (SSL) suffers from severe performance degradation when labeled and unlabeled data come from inconsistent and imbalanced distribution. Nonetheless, there is a lack of theoretical guidance regarding a remedy for this issue. To bridge the gap between theoretical insights and practical solutions, we embark to an analysis of generalization bound of classic SSL algorithms. This analysis reveals that distribution inconsistency between unlabeled and labeled data can cause a significant generalization error bound. Motivated by this theoretical insight, we present a Triplet Adaptation Framework (TAF) to reduce the distribution divergence and improve the generalization of SSL models. TAF comprises three adapters: Balanced Residual Adapter, aiming to map the class distribution of labeled and unlabeled data to a uniform distribution for reducing class distribution divergence; Representation Adapter, aiming to map the representation distribution of unlabeled data to labeled one for reducing representation distribution divergence; and Pseudo-Label Adapter, aiming to align the predicted pseudo-labels with the class distribution of unlabeled data, thereby preventing erroneous pseudo-labels from exacerbating representation divergence. These three adapters collaborate synergistically to reduce the generalization bound, ultimately achieving a more robust and generalizable SSL model. Extensive experiments across various robust SSL scenarios validate the efficacy of our method.

Abstract:
In this paper, we study the problem of efficiently and effectively embedding the high-dimensional spatio-spectral information of hyperspectral (HS) images, guided by feature diversity. Specifically, based on the theoretical formulation that feature diversity is correlated with the rank of the unfolded kernel matrix, we rectify 3D convolution by modifying its topology to enhance the rank upper-bound. This modification yields a rank-enhanced spatial-spectral symmetrical convolution set (ReS^33-ConvSet), which not only learns diverse and powerful feature representations but also saves network parameters. Additionally, we also propose a novel diversity-aware regularization (DA-Reg) term that directly acts on the feature maps to maximize independence among elements. To demonstrate the superiority of the proposed ReS^33-ConvSet and DA-Reg, we apply them to various HS image processing and analysis tasks, including denoising, spatial super-resolution, and classification. Extensive experiments show that the proposed approaches outperform state-of-the-art methods both quantitatively and qualitatively to a significant extent.

Abstract:
Deep cooperative multi-agent reinforcement learning has demonstrated its remarkable success over a wide spectrum of complex control tasks. However, recent advances in multi-agent learning mainly focus on value decomposition while leaving entity interactions still intertwined, which easily leads to over-fitting on noisy interactions between entities. In this work, we introduce a novel interactiOn Pattern disenTangling (OPT) method, to disentangle the entity interactions into interaction prototypes, each of which represents an underlying interaction pattern within a subgroup of the entities. OPT facilitates filtering the noisy interactions between irrelevant entities and thus significantly improves generalizability as well as interpretability. Specifically, OPT introduces a sparse disagreement mechanism to encourage sparsity and diversity among discovered interaction prototypes. Then the model selectively restructures these prototypes into a compact interaction pattern by an aggregator with learnable weights. To alleviate the training instability issue caused by partial observability, we propose to maximize the mutual information between the aggregation weights and the history behaviors of each agent. Experiments on single-task, multi-task and zero-shot benchmarks demonstrate that the proposed method yields results superior to the state-of-the-art counterparts.

Abstract:
In the field of image descattering, the image formation models employed for restoration approaches are often simplified. In these models, scattering distribution is uniform in homogeneous media when transmission is fixed. Through specifically designed experiments, we discover that scattering exhibits non-uniform characteristics even in homogeneous media. Neglecting non-uniform scattering in these models limits their accuracy in representing scattering distribution, resulting in existing image descattering approaches inadequate. To tackle these issues, this paper proposes a novel image formation model for image descattering, considering more physical parameters, such as zenith angle, azimuth angle, scattering phase function, and camera focal length. Our model describes the light transfer process in scattering media more accurately. For image descattering, we introduce corresponding algorithms for parameter estimation in our model and simultaneous restoration from degraded images. Experimental evaluations demonstrate the effectiveness of our proposed model in various tasks, including physical parameter estimation, pure-scattering removal, image dehazing, and underwater image restoration. In terms of calculating parameters, our results are close to the real values; in terms of underwater image restoration, our work outperforms the state-of-art methods; in terms of image dehazing, our work promotes the performance of existing methods by replacing previous models with our model.

Abstract:
Even though the collaboration between traditional and neuromorphic event cameras brings prosperity to frame-event based vision applications, the performance is still confined by the resolution gap crossing two modalities in both spatial and temporal domains. This paper is devoted to bridging the gap by increasing the temporal resolution for images, i.e., motion deblurring, and the spatial resolution for events, i.e., event super-resolving, respectively. To this end, we introduce CrossZoom, a novel unified neural Network (CZ-Net) to jointly recover sharp latent sequences within the exposure period of a blurry input and the corresponding High-Resolution (HR) events. Specifically, we present a multi-scale blur-event fusion architecture that leverages the scale-variant properties and effectively fuses cross-modal information to achieve cross-enhancement. Attention-based adaptive enhancement and cross-interaction prediction modules are devised to alleviate the distortions inherent in Low-Resolution (LR) events and enhance the final results through the prior blur-event complementary information. Furthermore, we propose a new dataset containing HR sharp-blurry images and the corresponding HR-LR event streams to facilitate future research. Extensive qualitative and quantitative experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed method.

Abstract:
Deep neural networks have exhibited remarkable performance in image super-resolution (SR) tasks by learning a mapping from low-resolution (LR) images to high-resolution (HR) images. However, the SR problem is typically an ill-posed problem and existing methods would come with several limitations. First, the possible mapping space of SR can be extremely large since there may exist many different HR images that can be super-resolved from the same LR image. As a result, it is hard to directly learn a promising SR mapping from such a large space. Second, it is often inevitable to develop very large models with extremely high computational cost to yield promising SR performance. In practice, one can use model compression techniques to obtain compact models by reducing model redundancy. Nevertheless, it is hard for existing model compression methods to accurately identify the redundant components due to the extremely large SR mapping space. To alleviate the first challenge, we propose a dual regression learning scheme to reduce the space of possible SR mappings. Specifically, in addition to the mapping from LR to HR images, we learn an additional dual regression mapping to estimate the downsampling kernel and reconstruct LR images. In this way, the dual mapping acts as a constraint to reduce the space of possible mappings. To address the second challenge, we propose a dual regression compression (DRC) method to reduce model redundancy in both layer-level and channel-level based on channel pruning. Specifically, we first develop a channel number search method that minimizes the dual regression loss to determine the redundancy of each layer. Given the searched channel numbers, we further exploit the dual regression manner to evaluate the importance of channels and prune the redundant ones. Extensive experiments show the effectiveness of our method in obtaining accurate and efficient SR models.

Abstract:
Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks.

Abstract:
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and, thus, the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%～∼65.3%), instance segmentation (e.g. 21.8%～∼54.0%), and panoptic segmentation (e.g. 14.7%～∼43.3%). Code will be available.

Abstract:
Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR (Guo et al. 2020), the Transformer-based OSTrack (Ye et al. 2022), and the hybrid structure TransT (Chen et al. 2021). The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.

Abstract:
Mining discriminative graph topological information plays an important role in promoting graph representation ability. However, it suffers from two main issues: (1) the difficulty/complexity of computing global inter-class/intra-class scatters, commonly related to mean and covariance of graph samples, for discriminant learning; (2) the huge complexity and variety of graph topological structure that is rather challenging to robustly characterize. In this paper, we propose the Wasserstein Discriminant Dictionary Learning (WDDL) framework to achieve discriminant learning on graphs with robust graph topology modeling, and hence facilitate graph-based pattern analysis tasks. Considering the difficulty of calculating global inter-class/intra-class scatters, a reference set of graphs (aka graph dictionary) is first constructed by generating representative graph samples (aka graph keys) with expressive topological structure. Then, a Wasserstein Graph Representation (WGR) process is proposed to project input graphs into a succinct dictionary space through the graph dictionary lookup. To further achieve discriminant graph learning, a Wasserstein discriminant loss (WD-loss) is defined on the graph dictionary, in which the graph keys are optimizable, to make the intra-class keys more compact and inter-class keys more dispersed. Hence, the calculation of global Wasserstein metric (W-metric) centers can be bypassed. For sophisticated topology mining in the WGR process, a joint-Wasserstein graph embedding module is constructed to model both between-node and between-edge relationships across inputs and graph keys by encapsulating both the Wasserstein metric (between cross-graph nodes) and proposed novel Kron–Gromov–Wasserstein (KGW) metric (between cross-graph adjacencies). Specifically, the KGW-metric comprehensively characterizes the cross-graph connection patterns with the Kronecker operation, then adaptively captures those salient patterns through connection pooling. To evaluate the proposed framework, we study two graph-based pattern analysis problems, i.e. graph classification and cross-modal retrieval, with the graph dictionary flexibly adjusted to cater to these two tasks. Extensive experiments are conducted to comprehensively compare with existing advanced methods, as well as dissect the critical component of our proposed architecture. The experimental results validate the effectiveness of the WDDL framework.

Abstract:
Visual categories that largely share the same set of local parts cannot be discriminated based on part information alone, as they mostly differ in the way the local parts relate to the overall global structure of the object. We propose Relational Proxies, a novel approach that leverages the relational information between the global and local views of an object for encoding its semantic label, even for categories it has not encountered during training. Starting with a rigorous formalization of the notion of distinguishability between categories that share attributes, we prove the necessary and sufficient conditions that a model must satisfy in order to learn the underlying decision boundaries to tell them apart. We design Relational Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets and achieve state-of-the-art results on all of them, surpassing the performance of all existing works with a margin exceeding 4% in some cases. We additionally show that Relational Proxies also generalizes to the zero-shot setting, where it can efficiently leverage emergent relationships among attributes and image views to generalize to unseen categories, surpassing current state-of-the-art in both the non-generative and generative settings. Implementation is available at https://github.com/abhrac/relational-proxies.

Abstract:
The crux of effective out-of-distribution (OOD) detection lies in acquiring a robust in-distribution (ID) representation, distinct from OOD samples. While previous methods predominantly leaned on recognition-based techniques for this purpose, they often resulted in shortcut learning, lacking comprehensive representations. In our study, we conducted a comprehensive analysis, exploring distinct pretraining tasks and employing various OOD score functions. The results highlight that the feature representations pre-trained through reconstruction yield a notable enhancement and narrow the performance gap among various score functions. This suggests that even simple score functions can rival complex ones when leveraging reconstruction-based pretext tasks. Reconstruction-based pretext tasks adapt well to various score functions. As such, it holds promising potential for further expansion. Our OOD detection framework, MOODv2, employs the masked image modeling pretext task. Without bells and whistles, MOODv2 impressively enhances 14.30% AUROC to 95.68% on ImageNet and achieves 99.98% on CIFAR-10.

Affiliations: State Key Laboratory of Information Security (SKLOIS), Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Security Department of Alibaba Group, Hangzhou, China; School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China; School of Computer Science and Technology, Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences, Beijing, China

Abstract:
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called Diversity-Promoting Collaborative Metric Learning (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users’ preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a Diversity Control Regularization Scheme (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require negative sampling to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML.

Abstract:
Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects’ boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask- architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.

Abstract:
Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer.

Abstract:
In recent years, Neural Fields (NFs) have emerged as an effective tool for encoding diverse continuous signals such as images, videos, audio, and 3D shapes. When applied to 3D data, NFs offer a solution to the fragmentation and limitations associated with prevalent discrete representations. However, given that NFs are essentially neural networks, it remains unclear whether and how they can be seamlessly integrated into deep learning pipelines for solving downstream tasks. This paper addresses this research problem and introduces nf2vec, a framework capable of generating a compact latent representation for an input NF in a single inference pass. We demonstrate that nf2vec effectively embeds 3D objects represented by the input NFs and showcase how the resulting embeddings can be employed in deep learning pipelines to successfully address various tasks, all while processing exclusively NFs. We test this framework on several NFs used to represent 3D surfaces, such as unsigned/signed distance and occupancy fields. Moreover, we demonstrate the effectiveness of our approach with more complex NFs that encompass both geometry and appearance of 3D objects such as neural radiance fields.

Abstract:
The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method.

Abstract:
Creating an image focal stack requires multiple shots, which captures images at different depths within the same scene. Such methods are not suitable for scenes undergoing continuous changes. Achieving an all-in-focus image from a single shot poses significant challenges, due to the highly ill-posed nature of rectifying defocus and deblurring from a single image. In this paper, to restore an all-in-focus image, we introduce the neuromorphic focal stack, which is defined as neuromorphic signal streams captured by an event/ a spike camera during a continuous focal sweep, aiming to restore an all-in-focus image. Given an RGB image focused at any distance, we harness the high temporal resolution of neuromorphic signal streams. From neuromorphic signal streams, we automatically select refocusing timestamps and reconstruct corresponding refocused images to form a focal stack. Guided by the neuromorphic signal around the selected timestamps, we can merge the focal stack using proper weights and restore a sharp all-in-focus image. We test our method on two distinct neuromorphic cameras. Experimental results from both synthetic and real datasets demonstrate a marked improvement over existing State-of-the-Art methods.

Abstract:
Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.

Abstract:
Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.

Abstract:
Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness.

Abstract:
Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the 3D sequence from the input 2D sequence. In this paper, we propose to solve deep sparse NRSfM from a sequence-to-sequence translation perspective, where the input 2D keypoints sequence is taken as a whole to reconstruct the corresponding 3D keypoints sequence in a self-supervised manner. First, we apply a shape-motion predictor on the input sequence to obtain an initial sequence of shapes and corresponding motions. Then, we propose the Context Layer, which enables the deep learning framework to effectively impose overall constraints on sequences based on the structural characteristics of non-rigid sequences. The Context Layer constructs modules for imposing the self-expressiveness regularity on non-rigid sequences with multi-head attention (MHA) as the core, together with the use of temporal encoding, both of which act simultaneously to constitute constraints on non-rigid sequences in the deep framework. Experimental results across different datasets such as Human3.6M, CMU Mocap, and InterHand prove the superiority of our framework. The code will be made publicly available.

Abstract:
Benefiting from advances in large-scale pre-training, foundation models, have demonstrated remarkable capability in the fields of natural language processing, computer vision, among others. However, to achieve expert-level performance in specific applications, such models often need to be fine-tuned with domain-specific knowledge. In this paper, we focus on enabling vision-language models to unleash more potential for visual understanding tasks under few-shot tuning. Specifically, we propose a novel adapter, dubbed as lusterAdapter, which is based on trainable multiple prototypes clustering algorithm, for tuning the CLIP model. It can not only alleviate the concern of catastrophic forgetting of foundation models by introducing anchors to inherit common knowledge, but also improve the utilization efficiency of few annotated samples via bringing in clustering and domain priors, thereby improving the performance of few-shot tuning. We have conducted extensive experiments on 11 common classification benchmarks. The results show our method significantly surpasses the original CLIP and achieves state-of-the-art (SOTA) performance under all benchmarks and settings. For example, under the 16-shot setting, our method exhibits a remarkable improvement over the original CLIP by 19.6%, and also surpasses TIP-Adapter and GraphAdapter by 2.7% and 2.2%, respectively, in terms of average accuracy across the 11 benchmarks.

Abstract:
Continual learning (CL) aims to learn new tasks without forgetting previous tasks. However, existing CL methods require a large amount of raw data, which is often unavailable due to copyright considerations and privacy risks. Instead, stakeholders usually release pre-trained machine learning models as a service (MLaaS), which users can access via APIs. This paper considers two practical-yet-novel CL settings: data-efficient CL (DECL-APIs) and data-free CL (DFCL-APIs), which achieve CL from a stream of APIs with partial or no raw data. Performing CL under these two new settings faces several challenges: unavailable full raw data, unknown model parameters, heterogeneous models of arbitrary architecture and scale, and catastrophic forgetting of previous APIs. To overcome these issues, we propose a novel data-free cooperative continual distillation learning framework that distills knowledge from a stream of APIs into a CL model by generating pseudo data, just by querying APIs. Specifically, our framework includes two cooperative generators and one CL model, forming their training as an adversarial game. We first use the CL model and the current API as fixed discriminators to train generators via a derivative-free method. Generators adversarially generate hard and diverse synthetic data to maximize the response gap between the CL model and the API. Next, we train the CL model by minimizing the gap between the responses of the CL model and the black-box API on synthetic data, to transfer the API's knowledge to the CL model. Furthermore, we propose a new regularization term based on network similarity to prevent catastrophic forgetting of previous APIs. Our method performs comparably to classic CL with full raw data on the MNIST and SVHN datasets in the DFCL-APIs setting. In the DECL-APIs setting, our method achieves 0.97×0.97×, 0.75×0.75× and 0.69×0.69× performance of classic CL on the more challenging CIFAR10, CIFAR100, and MiniImageNet, respectively.

Abstract:
Blind image restoration (IR) is a common yet challenging problem in computer vision. Classical model-based methods and recent deep learning (DL)-based methods represent two different methodologies for this problem, each with their own merits and drawbacks. In this paper, we propose a novel blind image restoration method, aiming to integrate both the advantages of them. Specifically, we construct a general Bayesian generative model for the blind IR, which explicitly depicts the degradation process. In this proposed model, a pixel-wise non-i.i.d. Gaussian distribution is employed to fit the image noise. It is with more flexibility than the simple i.i.d. Gaussian or Laplacian distributions as adopted in most of conventional methods, so as to handle more complicated noise types contained in the image degradation. To solve the model, we design a variational inference algorithm where all the expected posteriori distributions are parameterized as deep neural networks to increase their model capability. Notably, such an inference algorithm induces a unified framework to jointly deal with the tasks of degradation estimation and image restoration. Further, the degradation information estimated in the former task is utilized to guide the latter IR process. Experiments on two typical blind IR tasks, namely image denoising and super-resolution, demonstrate that the proposed method achieves superior performance over current state-of-the-arts.

Abstract:
Currently prevalent multi-modal 3D detection methods rely on dense detectors that usually use dense Bird’s-Eye-View (BEV) feature maps. However, the cost of such BEV feature maps is quadratic to the detection range, making it not scalable for long-range detection. Recently, LiDAR-only fully sparse architecture has been gaining attention for its high efficiency in long-range perception. In this paper, we study how to develop a multi-modal fully sparse detector. Specifically, our proposed detector integrates the well-studied 2D instance segmentation into the LiDAR side, which is parallel to the 3D instance segmentation part in the LiDAR-only baseline. The proposed instance-based fusion framework maintains full sparsity while overcoming the constraints associated with the LiDAR-only fully sparse detector. Our framework showcases state-of-the-art performance on the widely used nuScenes dataset, Waymo Open Dataset, and the long-range Argoverse 2 dataset. Notably, the inference speed of our proposed method under the long-range perception setting is 2.7× faster than that of other state-of-the-art multimodal 3D detection methods.

Abstract:
While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

Abstract:
Graph convolutional networks (GCNs) can quickly and accurately learn graph representations and have shown powerful performance in many graph learning domains. Despite their effectiveness, neighborhood awareness remains essential and challenging for GCNs. Existing methods usually perform neighborhood-aware steps only from the node or hop level, which leads to a lack of capability to learn the neighborhood information of nodes from both global and local perspectives. Moreover, most methods learn the nodes’ neighborhood information from a single view, ignoring the importance of multiple views. To address the above issues, we propose a multi-view adaptive neighborhood-aware approach to learn graph representations efficiently. Specifically, we propose three random feature masking variants to perturb some neighbors’ information to promote the robustness of graph convolution operators at node-level neighborhood awareness and exploit the attention mechanism to select important neighbors from the hop level adaptively. We also utilize the multi-channel technique and introduce a proposed multi-view loss to perceive neighborhood information from multiple perspectives. Extensive experiments show that our method can better obtain graph representation and has high accuracy.

Abstract:
Multi-graph Multi-label learning (Mgml) aims to classify a set of objects of interest, such as text or images, using a bag-of-graphs representation. Previous Mgml works have limitations as they only learn labels at the bag level, lose structural information in learning by transferring graphs into instances, and cannot handle noisy labels. This paper presents a robust coarse and fine-grained Noise Multi-graph Multi-label (cfMGNML) learning framework that builds the learning model over the graphs and empowers label prediction at both the coarse (bag) and fine-grained (graph in each bag) levels with noisy labels. To identify label noise, a label probability matrix is defined to act on the scoring function of each label, with a higher probability value indicating that the label is more likely to be the corresponding graph or bag label. The problem is regularized with the manifold constraint on the label probability matrix to preserve local relationships within the data and uncover its essential manifold structure. Meanwhile, a thresholding rank-loss objective is proposed to rank the labels for the graphs and bags and minimize the hamming loss at one step simultaneously. To tackle the non-convex optimization problem, an effective sub-gradient descent algorithm is developed. Experiments over various datasets demonstrate the proposed method achieves superior performance than the state-of-the-art algorithms.

Abstract:
Action recognition from video data forms a cornerstone with wide-ranging applications. Single-view action recognition faces limitations due to its reliance on a single viewpoint. In contrast, multi-view approaches capture complementary information from various viewpoints for improved accuracy. Recently, event cameras have emerged as innovative bio-inspired sensors, leading to advancements in event-based action recognition. However, existing works predominantly focus on single-view scenarios, leaving a gap in multi-view event data exploitation, particularly in challenges like information deficit and semantic misalignment. To bridge this gap, we introduce HyperMV, a multi-view event-based action recognition framework. HyperMV converts discrete event data into frame-like representations and extracts view-related features using a shared convolutional network. By treating segments as vertices and constructing hyperedges using rule-based and KNN-based strategies, a multi-view hypergraph neural network that captures relationships across viewpoint and temporal features is established. The vertex attention hypergraph propagation is also introduced for enhanced feature fusion. To prompt research in this area, we present the largest multi-view event-based action dataset \mathbfTHU^\mathbfMV-EACT\mathbf-50THUMV-EACT-50, comprising 50 actions from 6 viewpoints, which surpasses existing datasets by over tenfold. Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios, and also exceeds the state-of-the-arts in frame-based multi-view action recognition.

Abstract:
Causal discovery, the inference of causal relations among variables from data, is a fundamental problem of science. Nowadays, due to an increased awareness of data privacy concerns, there has been a shift towards distributed data collection, processing and storage. To meet the pressing need for distributed causal discovery, we propose a novel federated DAG learning method called distributed annealing on regularized likelihood score (DARLS) to learn a causal graph from data stored on multiple clients. DARLS simulates an annealing process to search over the space of topological sorts, where the optimal graphical structure compatible with a sort is found by distributed optimization. This distributed optimization relies on multiple rounds of communication between local clients and a central server to estimate the graphical structure. We establish its convergence to the solution obtained by an oracle with access to all the data. To the best of our knowledge, DARLS is the first distributed method for learning causal graphs with such finite-sample oracle guarantees. To establish the consistency of DARLS, we also derive new identifiability results for causal graphs parameterized by generalized linear models, which could be of independent interest. Through extensive simulation studies and a real-world application, we show that DARLS outperforms existing federated learning methods and is comparable to oracle methods on pooled data, demonstrating its great advantages in estimating causal networks from distributed data.

Abstract:
In this paper, we propose a general deep learning training framework XGrad which introduces weight prediction into the popular gradient-based optimizers to boost their convergence and generalization when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, the future weights are predicted according to the update rule of the used optimizer and are then applied to both the forward pass and backward propagation. In this way, during the whole training period, the optimizer always utilizes the gradients w.r.t. the future weights to update the DNN parameters, making the gradient-based optimizer achieve better convergence and generalization compared to the original optimizer without weight prediction. XGrad is rather straightforward to implement yet pretty effective in boosting the convergence of gradient-based optimizers and the accuracy of DNN models. Empirical results concerning five popular optimizers including SGD with momentum, Adam, AdamW, AdaBelief, and AdaM3 demonstrate the effectiveness of our proposal. The experimental results validate that XGrad can attain higher model accuracy than the baseline optimizers when training the DNN models.

Abstract:
Open-set segmentation can be conceived by complementing closed-set classification with anomaly detection. Many of the existing dense anomaly detectors operate through generative modelling of regular data or by discriminating with respect to negative data. These two approaches optimize different objectives and therefore exhibit different failure modes. Consequently, we propose a novel anomaly score that fuses generative and discriminative cues. Our score can be implemented by upgrading any closed-set segmentation model with dense estimates of dataset posterior and unnormalized data likelihood. The resulting dense hybrid open-set models require negative training images that can be sampled from an auxiliary negative dataset, from a jointly trained generative model, or from a mixture of both sources. We evaluate our contributions on benchmarks for dense anomaly detection and open-set segmentation. The experiments reveal strong open-set performance in spite of negligible computational overhead.

Abstract:
Deep reinforcement learning agents usually need to collect a large number of interactions to solve a single task. In contrast, meta-reinforcement learning (meta-RL) aims to quickly adapt to new tasks using a small amount of experience by leveraging the knowledge from training on a set of similar tasks. State-of-the-art context-based meta-RL algorithms use the context to encode the task information and train a policy conditioned on the inferred latent task encoding. However, most recent works are limited to parametric tasks, where a handful of variables control the full variation in the task distribution, and also failed to work in non-stationary environments due to the few-shot adaptation setting. To address those limitations, we propose MEta-reinforcement Learning with Task Self-discovery (MELTS), which adaptively learns qualitatively different nonparametric tasks and adapts to new tasks in a zero-shot manner. We introduce a novel deep clustering framework (DPMM-VAE) based on an infinite mixture of Gaussians, which combines the Dirichlet process mixture model (DPMM) and the variational autoencoder (VAE), to simultaneously learn task representations and cluster the tasks in a self-adaptive way. Integrating DPMM-VAE into MELTS enables it to adaptively discover the multi-modal structure of the nonparametric task distribution, which previous methods using isotropic Gaussian random variables cannot model. In addition, we propose a zero-shot adaptation mechanism and a recurrence-based context encoding strategy to improve the data efficiency and make our algorithm applicable in non-stationary environments. On various continuous control tasks with both parametric and nonparametric variations, our algorithm produces a more structured and self-adaptive task latent space and also achieves superior sample efficiency and asymptotic performance compared with state-of-the-art meta-RL algorithms.

Abstract:
Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. However, the performance of supervised contrastive learning is plagued by an inherent challenge: it necessitates sufficiently large batches of training data to construct contrastive pairs that cover all categories, yet this requirement is difficult to meet in the context of class-imbalanced data. To overcome this obstacle, we propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space, and samples contrastive pairs accordingly. In fact, estimating the distributions of all classes using features in a small batch, particularly for imbalanced data, is not feasible. Our key idea is to introduce a reasonable and simple assumption that the normalized features in contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions on unit space, which brings two-fold benefits. First, the distribution parameters can be estimated using only the first sample moment, which can be efficiently computed in an online manner across different batches. Second, based on the estimated distribution, the vMF distribution allows us to sample an infinite number of contrastive pairs and derive a closed form of the expected contrastive loss for efficient optimization. Other than long-tailed problems, ProCo can be directly applied to semi-supervised learning by generating pseudo-labels for unlabeled data, which can subsequently be utilized to estimate the distribution of the samples inversely. Theoretically, we analyze the error bound of ProCo. Empirically, extensive experimental results on supervised/semi-supervised visual recognition and object detection tasks demonstrate that ProCo consistently outperforms existing methods across various datasets.

Abstract:
Self-supervised learning aims to learn representation that can be effectively generalized to downstream tasks. Many self-supervised approaches regard two views of an image as both the input and the self-supervised signals, assuming that either view contains the same task-relevant information and the shared information is (approximately) sufficient for predicting downstream tasks. Recent studies show that discarding superfluous information not shared between the views can improve generalization. Hence, the ideal representation is sufficient for downstream tasks and contains minimal superfluous information, termed minimal sufficient representation. One can learn this representation by maximizing the mutual information between the representation and the supervised view while eliminating superfluous information. Nevertheless, the computation of mutual information is notoriously intractable. In this work, we propose an objective termed multi-view entropy bottleneck (MVEB) to learn minimal sufficient representation effectively. MVEB simplifies the minimal sufficient learning to maximizing both the agreement between the embeddings of two views and the differential entropy of the embedding distribution. Our experiments confirm that MVEB significantly improves performance. For example, it achieves top-1 accuracy of 76.9% on ImageNet with a vanilla ResNet-50 backbone on linear evaluation. To the best of our knowledge, this is the new state-of-the-art result with ResNet-50.

Abstract:
The remarkable performance of recent stereo depth estimation models benefits from the successful use of convolutional neural networks to regress dense disparity. Akin to most tasks, this needs gathering training data that covers a number of heterogeneous scenes at deployment time. However, training samples are typically acquired continuously in practical applications, making the capability to learn new scenes continually even more crucial. For this purpose, we propose to perform continual stereo matching where a model is tasked to 1) continually learn new scenes, 2) overcome forgetting previously learned scenes, and 3) continuously predict disparities at inference. We achieve this goal by introducing a Reusable Architecture Growth (RAG) framework. RAG leverages task-specific neural unit search and architecture growth to learn new scenes continually in both supervised and self-supervised manners. It can maintain high reusability during growth by reusing previous units while obtaining good performance. Additionally, we present a Scene Router module to adaptively select the scene-specific architecture path at inference. Comprehensive experiments on numerous datasets show that our framework performs impressively in various weather, road, and city circumstances and surpasses the state-of-the-art methods in more challenging cross-dataset settings. Further experiments also demonstrate the adaptability of our method to unseen scenes, which can facilitate end-to-end stereo architecture learning and practical deployment.

Abstract:
In the real world, how to effectively learn consistent similarity measurement across different modalities is essential. Most of the existing similarity learning methods cannot deal well with cross-modal data due to the modality gap and have obvious performance degeneration when applied to cross-modal data. To tackle this problem, we propose a novel cross-modal similarity learning method, called Causality-Invariant Interactive Mining (CIIM), that can effectively capture informative relationships among different samples and modalities to derive the modality-consistent feature embeddings in the unified metric space. Our CIIM tackles the modality gap from two aspects, i.e., sample-wise and feature-wise. Specifically, we start from the sample-wise view and learn the single-modality and hybrid-modality proxies for exploring the cross-modal similarity with the elaborate metric losses. In this way, sample-to-sample and sample-to-proxy correlations are both taken into consideration. Furthermore, we conduct the causal intervention to eliminate the modality bias and reconstruct the invariant causal embedding in the feature-wise aspect. To this end, we force the learned embeddings to satisfy the specific properties of our causal mechanism and derive the causality-invariant feature embeddings in the unified metric space. Extensive experiments on two cross-modality tasks demonstrate the superiority of our proposed method over the state-of-the-art methods.

Abstract:
This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate scalable supervision and layer-wise ID-based attention. This enables online architecture scalability in VOS for the first time and overcomes ID embeddings’ representation limitations. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly used VOS benchmarks, including YouTube-VOS 2018 & 2019 Val, DAVIS-2017 Val & Test, and DAVIS-2016. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Moreover, we notably achieved the \mathbf 1^st1st position in the 3 rd Large-scale Video Object Segmentation Challenge.

Abstract:
Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.

Abstract:
Fast adversarial training (FAT) is an efficient method to improve robustness in white-box attack scenarios. However, the original FAT suffers from catastrophic overfitting, which dramatically and suddenly reduces robustness after a few training epochs. Although various FAT variants have been proposed to prevent overfitting, they require high training time. In this paper, we investigate the relationship between adversarial example quality and catastrophic overfitting by comparing the training processes of standard adversarial training and FAT. We find that catastrophic overfitting occurs when the attack success rate of adversarial examples becomes worse. Based on this observation, we propose a positive prior-guided adversarial initialization to prevent overfitting by improving adversarial example quality without extra training time. This initialization is generated by using high-quality adversarial perturbations from the historical training process. We provide theoretical analysis for the proposed initialization and propose a prior-guided regularization method that boosts the smoothness of the loss function. Additionally, we design a prior-guided ensemble FAT method that averages the different model weights of historical models using different decay rates. Our proposed method, called FGSM-PGK, assembles the prior-guided knowledge, i.e., the prior-guided initialization and model weights, acquired during the historical training process. The proposed method can effectively improve the model's adversarial robustness in white-box attack scenarios. Evaluations of four datasets demonstrate the superiority of the proposed method.

Abstract:
AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used \ell _2ℓ2-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and \ell _2ℓ2-regularized Adam (\ell _2ℓ2-Adam) remain absent yet. To solve this issue, we prove the convergence of AdamW and justify its generalization advantages over Adam and \ell _2ℓ2-Adam. Specifically, AdamW provably converges but minimizes a dynamically regularized loss that combines vanilla loss and a dynamical regularization induced by decoupled weight decay, thus yielding different behaviors with Adam and \ell _2ℓ2-Adam. Moreover, on both general nonconvex problems and PŁ-conditioned problems, we establish stochastic gradient complexity of AdamW to find a stationary point. Such complexity is also applicable to Adam and \ell _2ℓ2-Adam, and improves their previously known complexity, especially for over-parametrized networks. Besides, we prove that AdamW enjoys smaller generalization errors than Adam and \ell _2ℓ2-Adam from the Bayesian posterior aspect. This result, for the first time, explicitly reveals the benefits of decoupled weight decay in AdamW. Experimental results validate our theory.

Abstract:
Offline reinforcement learning (RL) aims at learning an optimal policy from a static offline data set, without interacting with the environment. However, the theoretical understanding of the existing offline RL methods needs further studies, among which the conservatism of the learned Q-function and the learned policy is a major issue. In this article, we propose a simple and efficient offline RL with relaxed conservatism (ORL-RC) framework for addressing this concern by learning a Q-function that is close to the true Q-function under the learned policy. The conservatism of learned Q-functions and policies of offline RL methods is analyzed. The analysis results support that the conservatism can lead to policy performance degradation. We establish the convergence results of the proposed ORL-RC, and the bounds of learned Q-functions with and without sampling errors, respectively, suggesting that the gap between the learned Q-function and the true Q-function can be reduced by executing the conservative policy improvement. A practical implementation of ORL-RC is presented and the experimental results on the D4RL benchmark suggest that ORL-RC exhibits superior performance and substantially outperforms existing state-of-the-art offline RL methods.

Abstract:
Deep Neural Network classifiers are vulnerable to adversarial attacks, where an imperceptible perturbation could result in misclassification. However, the vulnerability of DNN-based image ranking systems remains under-explored. In this paper, we propose two attacks against deep ranking systems, i.e., Candidate Attack and Query Attack, that can raise or lower the rank of chosen candidates by adversarial perturbations. Specifically, the expected ranking order is first represented as a set of inequalities. Then a triplet-like objective function is designed to obtain the optimal perturbation. Conversely, an anti-collapse triplet defense is proposed to improve the ranking model robustness against all proposed attacks, where the model learns to prevent the adversarial attack from pulling the positive and negative samples close to each other. To comprehensively measure the empirical adversarial robustness of a ranking model with our defense, we propose an empirical robustness score, which involves a set of representative attacks against ranking models. Our adversarial ranking attacks and defenses are evaluated on MNIST, Fashion-MNIST, CUB200-2011, CARS196, and Stanford Online Products datasets. Experimental results demonstrate that our attacks can effectively compromise a typical deep ranking system. Nevertheless, our defense can significantly improve the ranking system's robustness and simultaneously mitigate a wide range of attacks.

Abstract:
Understanding human posture is a challenging topic, which encompasses several tasks, e.g., pose estimation, body mesh recovery and pose tracking. In this article, we propose a novel Distribution-Aware Single-stage (DAS) model for the pose-related tasks. The proposed DAS model estimates human position and localizes joints simultaneously, which requires only a single pass. Meanwhile, we utilize normalizing flow to enable DAS to learn the true distribution of joint locations, rather than making simple Gaussian or Laplacian assumptions. This provides a pivotal prior and greatly boosts the accuracy of regression-based methods, thus making DAS achieve comparable performance to the volumetric-based methods. We also introduce a recursively update strategy to progressively approach the regression target, reducing the difficulty of regression and improving the regression performance. We further adapt DAS to multi-person mesh recovery and pose tracking tasks and achieve considerable performance on both tasks. Comprehensive experiments on CMU Panoptic and MuPoTS-3D demonstrate the superior efficiency of DAS, specifically 1.5 times speedup over previous best method, and its state-of-the-art accuracy for multi-person pose estimation. Extensive experiments on 3DPW and PoseTrack2018 indicate the effectiveness and efficiency of DAS for human body mesh recovery and pose tracking, respectively, which prove the generality of our proposed DAS model.

Abstract:
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Abstract:
Over the past decade, domain adaptation has become a widely studied branch of transfer learning which aims to improve performance on target domains by leveraging knowledge from the source domain. Conventional domain adaptation methods often assume access to both source and target domain data simultaneously, which may not be feasible in real-world scenarios due to privacy and confidentiality concerns. As a result, the research of Source-Free Domain Adaptation (SFDA) has drawn growing attention in recent years, which only utilizes the source-trained model and unlabeled target data to adapt to the target domain. Despite the rapid explosion of SFDA work, there has been no timely and comprehensive survey in the field. To fill this gap, we provide a comprehensive survey of recent advances in SFDA and organize them into a unified categorization scheme based on the framework of transfer learning. Instead of presenting each approach independently, we modularize several components of each method to more clearly illustrate their relationships and mechanisms in light of the composite properties of each method. Furthermore, we compare the results of more than 30 representative SFDA methods on three popular classification benchmarks, namely Office-31, Office-home, and VisDA, to explore the effectiveness of various technical routes and the combination effects among them. Additionally, we briefly introduce the applications of SFDA and related fields. Drawing on our analysis of the challenges confronting SFDA, we offer some insights into future research directions and potential settings.

Abstract:
Conventional cameras capture image irradiance (RAW) on a sensor and convert it to RGB images using an image signal processor (ISP). The images can then be used for photography or visual computing tasks in a variety of applications, such as public safety surveillance and autonomous driving. One can argue that since RAW images contain all the captured information, the conversion of RAW to RGB using an ISP is not necessary for visual computing. In this paper, we propose a novel \rhoρ-Vision framework to perform high-level semantic understanding and low-level compression using RAW images without the ISP subsystem used for decades. Considering the scarcity of available RAW image datasets, we first develop an unpaired CycleR2R network based on unsupervised CycleGAN to train modular unrolled ISP and inverse ISP (invISP) models using unpaired RAW and RGB images. We can then flexibly generate simulated RAW images (simRAW) using any existing RGB image dataset and finetune different models originally trained in the RGB domain to process real-world camera RAW images. We demonstrate object detection and image compression capabilities in RAW-domain using RAW-domain YOLOv3 and RAW image compressor (RIC) on camera snapshots. Quantitative results reveal that RAW-domain task inference provides better detection accuracy and compression efficiency compared to that in the RGB domain. Furthermore, the proposed \rhoρ-Vision generalizes across various camera sensors and different task-specific models. An added benefit of employing the \rhoρ-Vision is the elimination of the need for ISP, leading to potential reductions in computations and processing times.

Abstract:
We present a method for estimating dense continuous-time optical flow from event data. Traditional dense optical flow methods compute the pixel displacement between two images. Due to missing information, these approaches cannot recover the pixel trajectories in the blind time between two images. In this work, we show that it is possible to compute per-pixel, continuous-time optical flow using events from an event camera. Events provide temporally fine-grained information about movement in pixel space due to their asynchronous nature and microsecond response time. We leverage these benefits to predict pixel trajectories densely in continuous time via parameterized Bézier curves. To achieve this, we build a neural network with strong inductive biases for this task: First, we build multiple sequential correlation volumes in time using event data. Second, we use Bézier curves to index these correlation volumes at multiple timestamps along the trajectory. Third, we use the retrieved correlation to update the Bézier curve representations iteratively. Our method can optionally include image pairs to boost performance further. To the best of our knowledge, our model is the first method that can regress dense pixel trajectories from event data. To train and evaluate our model, we introduce a synthetic dataset (MultiFlow) that features moving objects and ground truth trajectories for every pixel. Our quantitative experiments not only suggest that our method successfully predicts pixel trajectories in continuous time but also that it is competitive in the traditional two-view pixel displacement metric on MultiFlow and DSEC-Flow. Open source code and datasets are released to the public.

Affiliations: School of Computer Science and Technology, Harbin Institute of Technology, Weihai, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; University of Chinese Academy of Sciences, Beijing, China; UBTECH Sydney Artificial Intelligence Centre and the School of Information Technologies, Faculty of Engineering and Information Technologies, The University of Sydney, Darlington, NSW, Australia

Abstract:
Although stereo image restoration has been extensively studied, most existing work focuses on restoring stereo images with limited horizontal parallax due to the binocular symmetry constraint. Stereo images with unlimited parallax (e.g., large ranges and asymmetrical types) are more challenging in real-world applications and have rarely been explored so far. To restore high-quality stereo images with unlimited parallax, this paper proposes an attention-guided correspondence learning method, which learns both self- and cross-views feature correspondence guided by parallax and omnidirectional attention. To learn cross-view feature correspondence, a Selective Parallax Attention Module (SPAM) is proposed to interact with cross-view features under the guidance of parallax attention that adaptively selects receptive fields for different parallax ranges. Furthermore, to handle asymmetrical parallax, we propose a Non-local Omnidirectional Attention Module (NOAM) to learn the non-local correlation of both self- and cross-view contexts, which guides the aggregation of global contextual features. Finally, we propose an Attention-guided Correspondence Learning Restoration Network (ACLRNet) upon SPAMs and NOAMs to restore stereo images by associating the features of two views based on the learned correspondence. Extensive experiments on five benchmark datasets demonstrate the effectiveness and generalization of the proposed method on three stereo image restoration tasks including super-resolution, denoising, and compression artifact reduction.

Abstract:
Deep neural networks have become prevalent in human analysis, boosting the performance of applications, such as biometric recognition, action recognition, as well as person re-identification. However, the performance of such networks scales with the available training data. In human analysis, the demand for large-scale datasets poses a severe challenge, as data collection is tedious, time-expensive, costly and must comply with data protection laws. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. This survey introduces the basic definitions and methodologies, essential when generating and employing synthetic data for human analysis. We summarise current state-of-the-art methods and the main benefits of using synthetic data. We also provide an overview of publicly available synthetic datasets and generation models. Finally, we discuss limitations, as well as open research problems in this field. This survey is intended for researchers and practitioners in the field of human analysis.

Abstract:
Distribution comparison plays a central role in many machine learning tasks like data classification and generative modeling. In this study, we propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions with low complexity. In particular, we first project two high-dimensional probability distributions using Hilbert curve to obtain a coupling between them, and then calculate the transport distance between these two distributions in the original space, according to the coupling. We show that HCP distance is a proper metric and is well-defined for probability measures with bounded supports. Furthermore, we demonstrate that the modified empirical HCP distance with the L_pLp cost in the dd-dimensional space converges to its population counterpart at a rate of no more than O(n^-1/2\max \lbrace d,p\rbrace )O(n-1/2maxd,p). To suppress the curse-of-dimensionality, we also develop two variants of the HCP distance using (learnable) subspace projections. Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity and overcomes the drawbacks of the sliced Wasserstein distance.

Abstract:
In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods.

Abstract:
Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and we propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows us to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical simple tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked autoencoder pre-training to our MixFormer trackers and design the new competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10 k, OTB100, TOTB and UAV123. In particular, our MixViT-L achieves AUC scores of 73.3% on LaSOT, 86.1% on TrackingNet and 82.8% on TOTB.

Abstract:
Self-supervised node representation learning aims to learn node representations from unlabelled graphs that rival the supervised counterparts. The key towards learning informative node representations lies in how to effectively gain contextual information from the graph structure. In this work, we present simple-yet-effective self-supervised node representation learning via aligning the hidden representations of nodes and their neighbourhood. Our first idea achieves such node-to-neighbourhood alignment by directly maximizing the mutual information between their representations, which, we prove theoretically, plays the role of graph smoothing. Our framework is optimized via a surrogate contrastive loss and a Topology-Aware Positive Sampling (TAPS) strategy is proposed to sample positives by considering the structural dependencies between nodes, which enables offline positive selection. Considering the excessive memory overheads of contrastive learning, we further propose a negative-free solution, where the main contribution is a Graph Signal Decorrelation (GSD) constraint to avoid representation collapse and over-smoothing. The GSD constraint unifies some of the existing constraints and can be used to derive new implementations to combat representation collapse. By applying our methods on top of simple MLP-based node representation encoders, we learn node representations that achieve promising node classification performance on a set of graph-structured datasets from small- to large-scale.

Abstract:
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01-0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p \gg 0.05p≫0.05, which demonstrates no statistically significant difference from human recordings for the first time.

Abstract:
Despite great strides made on fine-grained visual classification (FGVC), current methods are still heavily reliant on fully-supervised paradigms where ample expert labels are called for. Semi-supervised learning (SSL) techniques, acquiring knowledge from unlabeled data, provide a considerable means forward and have shown great promise for coarse-grained problems. However, exiting SSL paradigms mostly assume in-category (i.e., category-aligned) unlabeled data, which hinders their effectiveness when re-proposed on FGVC. In this paper, we put forward a novel design specifically aimed at making out-of-category data work for semi-supervised FGVC. We work off an important assumption that all fine-grained categories naturally follow a hierarchical structure (e.g., the phylogenetic tree of “Aves” that covers all bird species). It follows that, instead of operating on individual samples, we can instead predict sample relations within this tree structure as the optimization goal of SSL. Beyond this, we further introduced two strategies uniquely brought by these tree structures to achieve inter-sample consistency regularization and reliable pseudo-relation. Our experimental results reveal that (i) the proposed method yields good robustness against out-of-category data, and (ii) it can be equipped with prior arts, boosting their performance thus yielding state-of-the-art results.

Abstract:
Cloth-changing person reidentification (ReID) is a newly emerging research topic aimed at addressing the issues of large feature variations due to cloth-changing and pedestrian view/pose changes. Although significant progress has been achieved by introducing extra information (e.g., human contour sketching information, human body keypoints, and 3D human information), cloth-changing person ReID remains challenging because pedestrian appearance representations can change at any time. Moreover, human semantic information and pedestrian identity information are not fully explored. To solve these issues, we propose a novel identity-guided collaborative learning scheme (IGCL) for cloth-changing person ReID, where the human semantic is effectively utilized and the identity is unchangeable to guide collaborative learning. First, we design a novel clothing attention degradation stream to reasonably reduce the interference caused by clothing information where clothing attention and mid-level collaborative learning are employed. Second, we propose a human semantic attention and body jigsaw stream to highlight the human semantic information and simulate different poses of the same identity. In this way, the extraction features not only focus on human semantic information that is unrelated to the background but are also suitable for pedestrian pose variations. Moreover, a pedestrian identity enhancement stream is proposed to enhance the identity importance and extract more favorable identity robust features. Most importantly, all these streams are jointly explored in an end-to-end unified framework, and the identity is utilized to guide the optimization. Extensive experiments on six public clothing person ReID datasets (LaST, LTCC, PRCC, NKUP, Celeb-reID-light, and VC-Clothes) demonstrate the superiority of the IGCL method. It outperforms existing methods on multiple datasets, and the extracted features have stronger representation and discrimination ability and are weakly correlated with clothing.

Abstract:
Fast person re-identification (ReID) aims to search person images quickly and accurately. The main idea of recent fast ReID methods is the hashing algorithm, which learns compact binary codes and performs fast Hamming distance and counting sort. However, a very long code is needed for high accuracy (e.g., 2048), which compromises search speed. In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Specifically, we design an All-in-One (AiO) module together with a Distance Threshold Optimization (DTO) algorithm. In AiO, we simultaneously learn and enhance multiple codes of different lengths in a single model. It learns multiple codes in a pyramid structure, and encourage shorter codes to mimic longer codes by self-distillation. DTO solves a complex threshold search problem by a simple optimization process, and the balance between accuracy and speed is easily controlled by a single parameter. It formulates the optimization target as a F_\beta Fβ score that can be optimised by Gaussian cumulative distribution functions. Besides, we find even short code (e.g., 32) still takes a long time under large-scale gallery due to the O(n)O(n) time complexity. To solve the problem, we propose a gallery-size-free latent-attributes-based One-Shot-Filter (OSF) strategy, that is always O(1)O(1) time complexity, to quickly filter major easy negative gallery images, Specifically, we design a Latent-Attribute-Learning (LAL) module supervised a Single-Direction-Metric (SDM) Loss. LAL is derived from principal component analysis (PCA) that keeps largest variance using shortest feature vector, meanwhile enabling batch and end-to-end learning. Every logit of a feature vector represents a meaningful attribute. SDM is carefully designed for fine-grained attribute supervision, outperforming common metrics such as Euclidean and Cosine metrics. Experimental results on 2 datasets show that CtF+OSF is not only 2% more accurate but also 5×5× faster than contemporary hashing ReID methods. Compared with non-hashing ReID methods, CtF is 50×50× faster with comparable accuracy. OSF further speeds CtF by 2×2× again and upto 10×10× in total with almost no accuracy drop.

Abstract:
The current success of Graph Neural Networks (GNNs) usually relies on loading the entire attributed graph for processing, which may not be satisfied with limited memory resources, especially when the attributed graph is large. This paper pioneers to propose a Binary Graph Convolutional Network (Bi-GCN), which binarizes both the network parameters and input node attributes and exploits binary operations instead of floating-point matrix multiplications for network compression and acceleration. Meanwhile, we also propose a new gradient approximation based back-propagation method to properly train our Bi-GCN. According to the theoretical analysis, our Bi-GCN can reduce the memory consumption by an average of ～∼31x for both the network parameters and input data, and accelerate the inference speed by an average of ～∼51x, on three citation networks, i.e., Cora, PubMed, and CiteSeer. Besides, we introduce a general approach to generalize our binarization method to other variants of GNNs, and achieve similar efficiencies. Although the proposed Bi-GCN and Bi-GNNs are simple yet efficient, these compressed networks may also possess a potential capacity problem, i.e., they may not have enough storage capacity to learn adequate representations for specific tasks. To tackle this capacity problem, an Entropy Cover Hypothesis is proposed to predict the lower bound of the width of Bi-GNN hidden layers. Extensive experiments have demonstrated that our Bi-GCN and Bi-GNNs can give comparable performances to the corresponding full-precision baselines on seven node classification datasets and verified the effectiveness of our Entropy Cover Hypothesis for solving the capacity problem.

Abstract:
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X^22-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X^22-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X^22-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X^22-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X^22-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training.

Abstract:
Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction.

Abstract:
Visual scenes are extremely diverse, not only because there are infinite possible combinations of objects and backgrounds but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a multi-object visual scene from multiple viewpoints, humans can perceive the scene compositionally from each viewpoint while achieving the so-called “object constancy” across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have a similar ability. In this article, we consider a novel problem of learning compositional scene representations from multiple unspecified (i.e., unknown and unrelated) viewpoints without using any supervision and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. During the inference, latent representations are randomly initialized and iteratively updated by integrating the information in different viewpoints with neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method can effectively learn from multiple unspecified viewpoints.

Abstract:
This paper focuses on the problem of semi-supervised domain adaptation for time-series forecasting, which is underexplored in literature, despite being often encountered in practice. Existing methods on time-series domain adaptation mainly follow the paradigm designed for static data, which cannot handle domain-specific complex conditional dependencies raised by data offset, time lags, and variant data distributions. In order to address these challenges, we analyze variational conditional dependencies in time-series data and find that the causal structures are usually stable among domains, and further raise the causal conditional shift assumption. Enlightened by this assumption, we consider the causal generation process for time-series data and propose an end-to-end model for the semi-supervised domain adaptation problem on time-series forecasting. Our method can not only discover the Granger-Causal structures among cross-domain data but also address the cross-domain time-series forecasting problem with accurate and interpretable predicted results. We further theoretically analyze the superiority of the proposed method, where the generalization error on the target domain is bounded by the empirical risks and by the discrepancy between the causal structures from different domains. Experimental results on both synthetic and real data demonstrate the effectiveness of our method for the semi-supervised domain adaptation method on time-series forecasting.

Abstract:
The widespread success of deep learning in solving machine learning problems has fueled its adoption in many fields, from speech recognition to drug discovery and medical imaging. However, deep learning systems are extremely fragile: imperceptibly small modifications to their input data can cause the models to produce erroneous output. It is very easy to generate such adversarial perturbations even for state-of-the-art models, yet immunization against them has proven exceptionally challenging. Despite over a decade of research on this problem, our solutions are still far from satisfactory and many open problems remain. In this work, we survey some of the most important contributions in the field of adversarial robustness. We pay particular attention to the reasons why past attempts at improving robustness have been insufficient, and we identify several promising areas for future research.

Abstract:
Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this paper, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference.

Abstract:
Human-Object Interaction (HOI), as an important problem in computer vision, requires locating the human-object pair and identifying the interactive relationships between them. The HOI instance has a greater span in spatial, scale, and task than the individual object instance, making its detection more susceptible to noisy backgrounds. To alleviate the disturbance of noisy backgrounds on HOI detection, it is necessary to consider the input image information to generate fine-grained anchors which are then leveraged to guide the detection of HOI instances. However, it has the following challenges. i) how to extract pivotal features from the images with complex background information is still an open question. ii) how to semantically align the extracted features and query embeddings is also a difficult issue. In this paper, a novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems. FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM). MSS extracts features of humans, objects and interaction areas from noisy backgrounds for HOI instances of various scales. HSAM and TAM semantically align and merge the extracted features and query embeddings in the hierarchical spatial and task perspectives in turn. In the meanwhile, a novel training strategy Stage-wise Training Strategy is designed to reduce the training pressure caused by overly complex tasks done by FGAHOI. In addition, we propose two ways to measure the difficulty of HOI detection and a novel dataset, i.e., HOI-SDC for the two challenges (Uneven Distributed Area in Human-Object Pairs and Long Distance Visual Modeling of Human-Object Pairs) of HOI instances detection. Experiments are conducted on three benchmarks: HICO-DET, HOI-SDC and V-COCO. Our model outperforms the state-of-the-art HOI detection methods, and the extensive ablations reveal the merits of our proposed contribution.

Abstract:
Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e., pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves 85.3% top-1 accuracy on ImageNet and 52.5% mIoU on ADE20k, surpassing previous best results by 0.7% and 1.8% respectively.

Abstract:
We propose self-adaptive training—a unified training algorithm that dynamically calibrates and enhances training processes by model predictions without incurring an extra computational cost—to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training processes: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification, and linear evaluation. To facilitate future research, the code has been made publicly available at https://github.com/LayneH/self-adaptive-training.

Abstract:
Multiple kernel alignment (MKA) maximization criterion has been widely applied into multiple kernel clustering (MKC) and many variants have been recently developed. Though demonstrating superior clustering performance in various applications, it is observed that none of them can effectively handle incomplete MKC, where parts or all of the pre-specified base kernel matrices are incomplete. To address this issue, we propose to integrate the imputation of incomplete kernel matrices and MKA maximization for clustering into a unified learning framework. The clustering of MKA maximization guides the imputation of incomplete kernel elements, and the completed kernel matrices are in turn combined to conduct the subsequent MKC. These two procedures are alternately performed until convergence. By this way, the imputation and MKC processes are seamlessly connected, with the aim to achieve better clustering performance. Besides theoretically analyzing the clustering generalization error bound, we empirically evaluate the clustering performance on several multiple kernel learning (MKL) benchmark datasets, and the results indicate the superiority of our algorithm over existing state-of-the-art counterparts. Our codes and data are publicly available at https://xinwangliu.github.io/.

Abstract:
We propose Recognition as Part Composition (RPC), an image encoding approach inspired by human cognition. It is based on the cognitive theory that humans recognize complex objects by components, and that they build a small compact vocabulary of concepts to represent each instance with. RPC encodes images by first decomposing them into salient parts, and then encoding each part as a mixture of a small number of prototypes, each representing a certain concept. We find that this type of learning inspired by human cognition can overcome hurdles faced by deep convolutional networks in low-shot generalization tasks, like zero-shot learning, few-shot learning and unsupervised domain adaptation. Furthermore, we find a classifier using an RPC image encoder is fairly robust to adversarial attacks, that deep neural networks are known to be prone to. Given that our image encoding principle is based on human cognition, one would expect the encodings to be interpretable by humans, which we find to be the case via crowd-sourcing experiments. Finally, we propose an application of these interpretable encodings in the form of generating synthetic attribute annotations for evaluating zero-shot learning methods on new datasets.

Abstract:
As autonomous decision-making agents move from narrow operating environments to unstructured worlds, learning systems must move from a closed-world formulation to an open-world and few-shot setting in which agents continuously learn new classes from small amounts of information. This stands in stark contrast to modern machine learning systems that are typically designed with a known set of classes and a large number of examples for each class. In this work we extend embedding-based few-shot learning algorithms to the open-world recognition setting. We combine Bayesian non-parametric class priors with an embedding-based pre-training scheme to yield a highly flexible framework which we refer to as few-shot learning for open world recognition (FLOWR). We benchmark our framework on open-world extensions of the common MiniImageNet and TieredImageNet few-shot learning datasets. Our results show, compared to prior methods, strong classification accuracy performance and up to a 12% improvement in H-measure (a measure of novel class detection) from our non-parametric open-world few-shot learning scheme.

Abstract:
Zero-shot object detection (ZSD), the task that extends conventional detection models to detecting objects from unseen categories, has emerged as a new challenge in computer vision. Most existing approaches tackle the ZSD task with a strict mapping-transfer strategy that may lead to suboptimal ZSD results: 1) the learning process of these models neglects the available semantic information on unseen classes, which can easily bias towards the seen categories; 2) the original visual feature space is not well-structured for the ZSD task due to the lack of discriminative information. To address these issues, we develop a novel Semantics-Guided Contrastive Network for ZSD, named ContrastZSD, a detection framework that first brings contrastive learning mechanism into the realm of zero-shot detection. Particularly, ContrastZSD incorporates two semantics-guided contrastive learning subnets that contrast between region-category and region-region pairs respectively. The pairwise contrastive tasks take advantage of supervision signals derived from both the ground truth label and class similarity information. By performing supervised contrastive learning over those explicit semantic supervision, the model can learn more knowledge about unseen categories to avoid the bias problem to seen concepts, while optimizing the visual data structure to be more discriminative for better visual-semantic alignment. Extensive experiments are conducted on two popular benchmarks for ZSD, i.e., PASCAL VOC and MS COCO. Results show that our method outperforms the previous state-of-the-art on both ZSD and generalized ZSD tasks.

Abstract:
In the task incremental learning problem, deep learning models suffer from catastrophic forgetting of previously seen classes/tasks as they are trained on new classes/tasks. This problem becomes even harder when some of the test classes do not belong to the training class set, i.e., the task incremental generalized zero-shot learning problem. We propose a novel approach to address the task incremental learning problem for both the non zero-shot and zero-shot settings. Our proposed approach, called Rectification-based Knowledge Retention (RKR), applies weight rectifications and affine transformations for adapting the model to any task. During testing, our approach can use the task label information (task-aware) to quickly adapt the network to that task. We also extend our approach to make it task-agnostic so that it can work even when the task label information is not available during testing. Specifically, given a continuum of test data, our approach predicts the task and quickly adapts the network to the predicted task. We experimentally show that our proposed approach achieves state-of-the-art results on several benchmark datasets for both non zero-shot and zero-shot task incremental learning.

Abstract:
The emergence of Graph Convolutional Network (GCN) has greatly boosted the progress of graph learning. However, two disturbing factors, noise and redundancy in graph data, and lack of interpretation for prediction results, impede further development of GCN. One solution is to recognize a predictive yet compressed subgraph to get rid of the noise and redundancy and obtain the interpretable part of the graph. This setting of subgraph is similar to the information bottleneck (IB) principle, which is less studied on graph-structured data and GCN. Inspired by the IB principle, we propose a novel subgraph information bottleneck (SIB) framework to recognize such subgraphs, named IB-subgraph. However, the intractability of mutual information and the discrete nature of graph data makes the objective of SIB notoriously hard to optimize. To this end, we introduce a bilevel optimization scheme coupled with a mutual information estimator for irregular graphs. Moreover, we propose a continuous relaxation for subgraph selection with a connectivity loss for stabilization. We further theoretically prove the error bound of our estimation scheme for mutual information and the noise-invariant nature of IB-subgraph. Extensive experiments on graph learning and large-scale point cloud tasks demonstrate the superior property of IB-subgraph.

Abstract:
Most of unsupervised person Re-Identification (ReID) works produce pseudo-labels by measuring the feature similarity without considering the domain discrepancy among cameras, leading to degraded accuracy in pseudo-label computation across cameras. This paper targets to address this challenge by decomposing the similarity computation into two stages, i.e., the intra-domain and inter-domain computations, respectively. The intra-domain similarity directly leverages CNN features learned within each camera, hence generates pseudo-labels on different cameras to train the ReID model in a multi-branch network. The inter-domain similarity considers the classification scores of each sample on different cameras as a new feature vector. This new feature effectively alleviates the domain discrepancy among cameras and generates more reliable pseudo-labels. We further propose the Instance and Camera Style Normalization (ICSN) to enhance the robustness to domain discrepancy. ICSN alleviates the intra-camera variations by adaptively learning a combination of instance and batch normalization. ICSN also boosts the robustness to inter-camera variations through TNorm which converts the original style of features into target styles. The proposed method achieves competitive performance on multiple datasets under fully unsupervised, intra-camera supervised and domain generalization settings, e.g., it achieves rank-1 accuracy of 64.4% on the MSMT17 dataset, outperforming the recent unsupervised methods by 20+%.

Abstract:
6D object pose estimation is a fundamental yet challenging problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even under monocular settings. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this limitation, we propose a novel monocular 6D pose estimation approach by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage current trends in noisy student training and differentiable rendering to further self-supervise the model on these unsupervised real RGB(-D) samples, seeking for a visually and geometrically optimal alignment. Moreover, employing both visible and amodal mask information, our self-supervision becomes very robust towards challenging scenarios such as occlusion. Extensive evaluations demonstrate that our proposed self-supervision outperforms all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm. Noteworthy, our self-supervised approach consistently improves over its synthetically trained baseline and often almost closes the gap towards its fully supervised counterpart.

Abstract:
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments. The source code is available at https://github.com/SCLBD/MCG-Blackbox.

Abstract:
Real world data often exhibits a long-tailed and open-ended (i.e., with unseen classes) distribution. A practical recognition system must balance between majority (head) and minority (tail) classes, generalize across the distribution, and acknowledge novelty upon the instances of unseen classes (open classes). We define Open Long-Tailed Recognition++ (OLTR++) as learning from such naturally distributed data and optimizing for the classification accuracy over a balanced test set which includes both known and open classes. OLTR++ handles imbalanced classification, few-shot learning, open-set recognition, and active learning in one integrated algorithm, whereas existing classification approaches often focus only on one or two aspects and deliver poorly over the entire spectrum. The key challenges are: 1) how to share visual knowledge between head and tail classes, 2) how to reduce confusion between tail and open classes, and 3) how to actively explore open classes with learned knowledge. Our algorithm, OLTR++, maps images to a feature space such that visual concepts can relate to each other through a memory association mechanism and a learned metric (dynamic meta-embedding) that both respects the closed world classification of seen classes and acknowledges the novelty of open classes. Additionally, we propose an active learning scheme based on visual memory, which learns to recognize open classes in a data-efficient manner for future expansions. On three large-scale open long-tailed datasets we curated from ImageNet (object-centric), Places (scene-centric), and MS1M (face-centric) data, as well as three standard benchmarks (CIFAR-10-LT, CIFAR-100-LT, and iNaturalist-18), our approach, as a unified framework, consistently demonstrates competitive performance. Notably, our approach also shows strong potential for the active exploration of open classes and the fairness analysis of minority groups.

Abstract:
Zero-shot learning (ZSL) aims to recognize objects from unseen classes only based on labeled images from seen classes. Most existing ZSL methods focus on optimizing feature spaces or generating visual features of unseen classes, both in conventional ZSL and generalized zero-shot learning (GZSL). However, since the learned feature spaces are suboptimal, there exists many virtual connections where visual features and semantic attributes are not corresponding to each other. To reduce virtual connections, in this paper, we propose to discover comprehensive and fine-grained object parts by building explanatory graphs based on convolutional feature maps, then aggregate object parts to train a part-net to obtain prediction results. Since the aggregated object parts contain comprehensive visual features for activating semantic attributes, the virtual connections can be reduced by a large extent. Since part-net aims to extract local fine-grained visual features, some attributes related to global structures are ignored. To take advantage of both local and global visual features, we design a feature distiller to distill local features into a master-net which aims to extract global features. The experimental results on AWA2, CUB, FLO, and SUN dataset demonstrate that our proposed method obviously outperforms the state-of-the-arts in both conventional ZSL and GZSL tasks.

Abstract:
Steepest descent algorithms, which are commonly used in deep learning, use the gradient as the descent direction, either as-is or after a direction shift using preconditioning. In many scenarios calculating the gradient is numerically hard due to complex or non-differentiable cost functions, specifically next to singular points. This has been commonly overcome by increased DNN model sizes and complexity. In this work we propose a novel mechanism we refer to as Cost Unrolling, for improving the ability of a given DNN model to solve a complex cost function, without modifying its architecture or increasing computational complexity. We focus on the derivation of the Total Variation (TV) smoothness constraint commonly used in unsupervised cost functions. We introduce an iterative differentiable alternative to the TV smoothness constraint, which is demonstrated to produce more stable gradients during training, enable faster convergence and improve the predictions of a given DNN model. We test our method in several tasks, including image denoising and unsupervised optical flow. Replacing the TV smoothness constraint with our loss during DNN training, we report improved results in all tested scenarios. Specifically, our method improves flows predicted at occluded regions, a crucial task by itself, resulting in sharper motion boundaries.

Abstract:
Image matting is a fundamental and challenging problem in computer vision and graphics. Most existing matting methods leverage a user-supplied trimap as an auxiliary input to produce good alpha matte. However, obtaining high-quality trimap itself is arduous. Recently, some hint-free methods have emerged, however, the matting quality is still far behind the trimap-based methods. The main reason is that, some hints for removing semantic ambiguity and improving matting quality are essential. Apparently, there is a trade-off between interaction cost and matting quality. To balance performance and user-friendliness, we propose an improved deep image matting framework which is trimap-free and only needs sparse user click or scribble interaction to minimize the needed auxiliary constraints while still allowing interactivity. Moreover, we introduce uncertainty estimation that predicts which parts need polishing and conduct uncertainty-guided refinement. To trade off runtime against refinement quality, users can also choose different refinement modes. Experimental results show that our method performs better than existing trimap-free methods and comparably to state-of-the-art trimap-based methods with minimal user effort. Finally, we demonstrate the extensibility of our framework to video human matting without any structure modification, by adding optical flow-based sparse hint propagation and temporal consistency regularization imposed on the single frame.

Abstract:
Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).

Abstract:
Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the false correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating false correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions.

Abstract:
As an effective tool for network compression, pruning techniques have been widely used to reduce the large number of parameters in deep neural networks (NNs). Nevertheless, unstructured pruning has the limitation of dealing with the sparse and irregular weights. By contrast, structured pruning can help eliminate this drawback but it requires complex criteria to determine which components to be pruned. Therefore, this paper presents a new method termed BUnit-Net, which directly constructs compact NNs by stacking designed basic units, without requiring additional judgement criteria anymore. Given the basic units of various architectures, they are combined and stacked systematically to build up compact NNs which involve fewer weight parameters due to the independence among the units. In this way, BUnit-Net can achieve the same compression effect as unstructured pruning while the weight tensors can still remain regular and dense. We formulate BUnit-Net in diverse popular backbones in comparison with the state-of-the-art pruning methods on different benchmark datasets. Moreover, two new metrics are proposed to evaluate the trade-off of compression performance. Experiment results show that BUnit-Net can achieve comparable classification accuracy while saving around 80% FLOPs and 73% parameters. That is, stacking basic units provides a new promising way for network compression.

Abstract:
Ensuring safety and achieving human-level driving performance remain challenges for autonomous vehicles, especially in safety-critical situations. As a key component of artificial intelligence, reinforcement learning is promising and has shown great potential in many complex tasks; however, its lack of safety guarantees limits its real-world applicability. Hence, further advancing reinforcement learning, especially from the safety perspective, is of great importance for autonomous driving. As revealed by cognitive neuroscientists, the amygdala of the brain can elicit defensive responses against threats or hazards, which is crucial for survival in and adaptation to risky environments. Drawing inspiration from this scientific discovery, we present a fear-neuro-inspired reinforcement learning framework to realize safe autonomous driving through modeling the amygdala functionality. This new technique facilitates an agent to learn defensive behaviors and achieve safe decision making with fewer safety violations. Through experimental tests, we show that the proposed approach enables the autonomous driving agent to attain state-of-the-art performance compared to the baseline agents and perform comparably to 30 certified human drivers, across various safety-critical scenarios. The results demonstrate the feasibility and effectiveness of our framework while also shedding light on the crucial role of simulating the amygdala function in the application of reinforcement learning to safety-critical autonomous driving domains.

Abstract:
Networks are used as highly expressive tools in different disciplines. In recent years, the analysis and mining of temporal networks have attracted substantial attention. Frequent pattern mining is considered an essential task in the network science literature. In addition to the numerous applications, the investigation of frequent pattern mining in networks directly impacts other analytical approaches, such as clustering, quasi-clique and clique mining, and link prediction. In nearly all the algorithms proposed for frequent pattern mining in temporal networks, the networks are represented as sequences of static networks. Then, the inter- or intra-network patterns are mined. This type of representation imposes a computation-expressiveness trade-off to the mining problem. In this paper, we propose a novel representation that can preserve the temporal aspects of the network losslessly. Then, we introduce the concept of constrained interval graphs (CIGCIGs). Next, we develop a series of algorithms for mining the complete set of frequent temporal patterns in a temporal network data set. We also consider four different definitions of isomorphism for accommodating minor variations in temporal data of networks. Implementing the algorithm for three real-world data sets proves the practicality of the proposed approach and its capability to discover unknown patterns in various settings.

Abstract:
Graph Neural Networks (GNNs) are proposed without considering the agnostic distribution shifts between training graphs and testing graphs, inducing the degeneration of the generalization ability of GNNs in Out-Of-Distribution (OOD) settings. The fundamental reason for such degeneration is that most GNNs are developed based on the I.I.D hypothesis. In such a setting, GNNs tend to exploit subtle statistical correlations existing in the training set for predictions, even though it is a spurious correlation. This learning mechanism inherits from the common characteristics of machine learning approaches. However, such spurious correlations may change in the wild testing environments, leading to the failure of GNNs. Therefore, eliminating the impact of spurious correlations is crucial for stable GNN models. To this end, in this paper, we argue that the spurious correlation exists among subgraph-level units and analyze the degeneration of GNN in causal view. Based on the causal view analysis, we propose a general causal representation framework for stable GNN, called StableGNN. The main idea of this framework is to extract high-level representations from raw graph data first and resort to the distinguishing ability of causal inference to help the model get rid of spurious correlations. Particularly, to extract meaningful high-level representations, we exploit a differentiable graph pooling layer to extract subgraph-based representations by an end-to-end manner. Furthermore, inspired by the confounder balancing techniques from causal inference, based on the learned high-level representations, we propose a causal variable distinguishing regularizer to correct the biased training distribution by learning a set of sample weights. Hence, GNNs would concentrate more on the true connection between discriminative substructures and labels. Extensive experiments are conducted on both synthetic datasets with various distribution shift degrees and eight real-world OOD graph datasets. The results well verify that the proposed model StableGNN not only outperforms the state-of-the-arts but also provides a flexible framework to enhance existing GNNs. In addition, the interpretability experiments validate that StableGNN could leverage causal structures for predictions.

Abstract:
Graph neural networks (GNNs) have shown remarkable performance on homophilic graph data while being far less impressive when handling non-homophilic graph data due to the inherent low-pass filtering property of GNNs. In general, since real-world graphs are often complex mixtures of diverse subgraph patterns, learning a universal spectral filter on the graph from the global perspective as in most current works may still suffer from great difficulty in adapting to the variation of local patterns. On the basis of the theoretical analysis of local patterns, we rethink the existing spectral filtering methods and propose the Node-oriented spectral Filtering for Graph Neural Network (namely NFGNN). By estimating the node-oriented spectral filter for each node, NFGNN is provided with the capability of precise local node positioning via the generalized translated operator, thus discriminating the variations of local homophily patterns adaptively. Meanwhile, the utilization of re-parameterization brings a good trade-off between global consistency and local sensibility for learning the node-oriented spectral filters. Furthermore, we theoretically analyze the localization property of NFGNN, demonstrating that the signal after adaptive filtering is still positioned around the corresponding node. Extensive experimental results demonstrate that the proposed NFGNN achieves more favorable performance.

Abstract:
Face identity editing (FIE) shows great value in AI content creation. Low-resolution FIE approaches have achieved tremendous progress, but high-quality FIE struggles. Two major challenges hinder higher-resolution and higher-performance development of FIE: lack of high-resolution dataset and unacceptable complexity forbidding for mobile platforms. To address both issues, we establish a novel large-scale, high-quality dataset tailored for FIE. Based on our SimSwap (Chen et al. 2020), we propose an upgraded version named SimSwap++ with significantly boosted model efficiency. SimSwap++ features two major innovations for high-performance model compression. First, a novel computational primitive named Conditional Dynamic Convolution (CD-Conv) is proposed to address the inefficiency of conditional schemes (e.g., AdaIN) in tiny models. CD-Conv achieves anisotropic processing and injection with significantly lower complexity compared to standard conditional operators, e.g., modulated convolution. Second, a Morphable Knowledge Distillation (MKD) is presented to further trim the overall model. Unlike conventional homogeneous teacher-student structures, MKD is designed to be heterogeneous and mutually compensable, endowing the student with the multi-path morphable property; thus, our student maximally inherits the teacher's knowledge after distillation while further reducing its complexity through structure re-parameterization. Extensive experiments demonstrate that our SimSwap++ achieves state-of-the-art performance (97.55% ID accuracy on FaceForensics++) with extremely low complexity (2.5 GFLOPs).

Abstract:
Negative flips are errors introduced in a classification system when a legacy model is updated. Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles, which multiply inference cost prohibitively. We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary, but often exhibit large deviations in the distance among their logits. Based on the observation, we present a method, called Ensemble Logit Difference Inhibition (Elodi), to train a classification system that achieves paragon performance in both error rate and NFR, at the inference cost of a single model. The method distills a homogeneous ensemble to a single student model which is used to update the classification system. Elodi also introduces a generalized distillation objective, Logit Difference Inhibition (LDI), which only penalizes the logit difference of a subset of classes with the highest logit values. On multiple image classification benchmarks, model updates with Elodi demonstrate superior accuracy retention and NFR reduction.

Abstract:
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate NAS from the expensive training process. The key idea behind zero-shot NAS approaches is to design proxies that can predict the accuracy of some given networks without training the network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical understanding of deep learning and have shown great potential on several datasets and NAS benchmarks. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies.

Abstract:
Pre-trained visual-language (ViL) models have demonstrated good zero-shot capability in video understanding tasks, where they were usually adapted through fine-tuning or temporal modeling. However, in the task of open-vocabulary temporal action localization (OV-TAL), such adaption reduces the robustness of ViL models against different data distributions, leading to a misalignment between visual representations and text descriptions of unseen action categories. As a result, existing methods often strike a trade-off between action detection and classification. Aiming at this issue, this paper proposes DeTAL, a simple but effective two-stage approach for OV-TAL. DeTAL decouples action detection from action classification to avoid the compromise between them, and the state-of-the-art methods for close-set action localization can be handily adapted to OV-TAL, which significantly improves the performance. Meanwhile, DeTAL can easily tackle the scenario where action category annotations are unavailable in the training dataset. In the experiments, we propose a new cross-dataset setting to evaluate the zero-shot capability of different methods. And the results demonstrate that DeTAL outperforms the state-of-the-art methods for OV-TAL on both THUMOS14 and ActivityNet1.3.

Abstract:
Dynamic networks have become a pivotal area of study in deep learning due to their ability to selectively activate computing units (such as layers or channels) or dynamically allocate computation to information-rich regions. This capability significantly curtails unnecessary computations, adapting to varying inputs. Despite these advantages, the practical efficiency of dynamic models often falls short of theoretical computation. This discrepancy arises from three primary challenges: 1) a lack of a unified framework across different dynamic inference paradigms due to the fragmented research landscape; 2) an excessive focus on algorithm design at the expense of scheduling strategies, which are essential for optimizing resource utilization on hardware; and 3) the complexity of latency evaluation, since most current libraries cater to static operators. To tackle these issues, we introduce Latency-Aware Unified Dynamic Networks (LAUDNet), a general framework that integrates three fundamental dynamic paradigms–spatially-adaptive computation, layer skipping, and channel skipping–into a single unified formulation. LAUDNet not only refines algorithmic design but also enhances scheduling optimization with the aid of a latency predictor. This predictor efficiently and accurately predicts the inference latency of dynamic operators on specific hardware setups. Our empirical assessments across multiple vision tasks–image classification, object detection, and instance segmentation–confirm that LAUDNet significantly bridges the gap between theoretical and practical efficiency. For instance, LAUDNet cuts down the practical latency of its static counterpart, ResNet-101, by over 50% on hardware platforms like V100, RTX 3090, and TX2 GPUs. Additionally, LAUDNet excels in the accuracy-efficiency trade-off compared to other methods.

Abstract:
Mixed-precision Deep Neural Networks (DNNs) provide an efficient solution for hardware deployment, especially under resource constraints, while maintaining model accuracy. Identifying the ideal bit precision for each layer, however, remains a challenge given the vast array of models, datasets, and quantization schemes, leading to an expansive search space. Recent literature has addressed this challenge, resulting in several promising frameworks. This paper offers a comprehensive overview of the standard quantization classifications prevalent in existing studies. A detailed survey of current mixed-precision frameworks is provided, with an in-depth comparative analysis highlighting their respective merits and limitations. The paper concludes with insights into potential avenues for future research in this domain.

Abstract:
Despite their remarkable performance, deep neural networks remain mostly “black boxes”, suggesting inexplicability and hindering their wide applications in fields requiring making rational decisions. Here we introduce HOPE (High-order Polynomial Expansion), a method for expanding a network into a high-order Taylor polynomial on a reference input. Specifically, we derive the high-order derivative rule for composite functions and extend the rule to neural networks to obtain their high-order derivatives quickly and accurately. From these derivatives, we can then derive the Taylor polynomial of the neural network, which provides an explicit expression of the network's local interpretations. We combine the Taylor polynomials obtained under different reference inputs to obtain the global interpretation of the neural network. Numerical analysis confirms the high accuracy, low computational complexity, and good convergence of the proposed method. Moreover, we demonstrate HOPE's wide applications built on deep learning, including function discovery, fast inference, and feature selection. We compared HOPE with other XAI methods and demonstrated our advantages.

Abstract:
Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (\geq 7× 7≥7×7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5× 55×5 to 21× 2121×21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20 k semantic segmentation.

Abstract:
Adversarial training (AT) is widely considered as the most promising strategy to defend against adversarial attacks and has drawn increasing interest from researchers. However, the existing AT methods still suffer from two challenges. First, they are unable to handle unrestricted adversarial examples (UAEs), which are built from scratch, as opposed to restricted adversarial examples (RAEs), which are created by adding perturbations bound by an l_plp norm to observed examples. Second, the existing AT methods often achieve adversarial robustness at the expense of standard generalizability (i.e., the accuracy on natural examples) because they make a tradeoff between them. To overcome these challenges, we propose a unique viewpoint that understands UAEs as imperceptibly perturbed unobserved examples. Also, we find that the tradeoff results from the separation of the distributions of adversarial examples and natural examples. Based on these ideas, we propose a novel AT approach called Provable Unrestricted Adversarial Training (PUAT), which can provide a target classifier with comprehensive adversarial robustness against both UAE and RAE, and simultaneously improve its standard generalizability. Particularly, PUAT utilizes partially labeled data to achieve effective UAE generation by accurately capturing the natural data distribution through a novel augmented triple-GAN. At the same time, PUAT extends the traditional AT by introducing the supervised loss of the target classifier into the adversarial loss and achieves the alignment between the UAE distribution, the natural data distribution, and the distribution learned by the classifier, with the collaboration of the augmented triple-GAN. Finally, the solid theoretical analysis and extensive experiments conducted on widely-used benchmarks demonstrate the superiority of PUAT.

Abstract:
Open-set Semi-supervised Learning (OSSL) holds a realistic setting that unlabeled data may come from classes unseen in the labeled set, i.e., out-of-distribution (OOD) data, which could cause performance degradation in conventional SSL models. To handle this issue, except for the traditional in-distribution (ID) classifier, some existing OSSL approaches employ an extra OOD detection module to avoid the potential negative impact of the OOD data. Nevertheless, these approaches typically employ the entire set of open-set data during their training process, which may contain data unfriendly to the OSSL task that can negatively influence the model performance. This inspires us to develop a robust open-set data selection strategy for OSSL. Through a theoretical understanding from the perspective of learning theory, we propose Wise Open-set Semi-supervised Learning (WiseOpen), a generic OSSL framework that selectively leverages the open-set data for training the model. By applying a gradient-variance-based selection mechanism, WiseOpen exploits a friendly subset instead of the whole open-set dataset to enhance the model's capability of ID classification. Moreover, to reduce the computational expense, we also propose two practical variants of WiseOpen by adopting low-frequency update and loss-based selection respectively. Extensive experiments demonstrate the effectiveness of WiseOpen in comparison with the state-of-the-art.

Abstract:
Although face swapping has attracted much attention in recent years, it remains a challenging problem. Existing methods leverage a large number of data samples to explore the intrinsic properties of face swapping without considering the semantic information of face images. Moreover, the representation of the identity information tends to be fixed, leading to suboptimal face swapping. In this paper, we present a simple yet efficient method named FaceSwapper, for one-shot face swapping based on Generative Adversarial Networks. Our method consists of a disentangled representation module and a semantic-guided fusion module. The disentangled representation module comprises an attribute encoder and an identity encoder, which aims to achieve the disentanglement of the identity and attribute information. The identity encoder is more flexible, and the attribute encoder contains more attribute details than its competitors. Benefiting from the disentangled representation, FaceSwapper can swap face images progressively. In addition, semantic information is introduced into the semantic-guided fusion module to control the swapped region and model the pose and expression more accurately. Experimental results show that our method achieves state-of-the-art results on benchmark datasets with fewer training samples.

Abstract:
Non-line-of-sight (NLOS) imaging aims to reconstruct the three-dimensional hidden scenes by using time-of-flight photon information after multiple diffuse reflections. The under-sampled scanning data can facilitate fast imaging. However, the resulting reconstruction problem becomes a serious ill-posed inverse problem, the solution of which is highly likely to be degraded due to noises and distortions. In this paper, we propose novel NLOS reconstruction models based on curvature regularization, i.e., the object-domain curvature regularization model and the dual (signal and object)-domain curvature regularization model. In what follows, we develop efficient optimization algorithms relying on the alternating direction method of multipliers (ADMM) with the backtracking stepsize rule, for which all solvers can be implemented on GPUs. We evaluate the proposed algorithms on both synthetic and real datasets, which achieve state-of-the-art performance, especially in the compressed sensing setting. Based on GPU computing, our algorithm is the most effective among iterative methods, balancing reconstruction quality and computational time.

Abstract:
Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations. The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics with only 2D labeling which can be either manual or machine-generated. Our key technical innovation is to leverage differentiable rendering of color and semantics to bridge 2D observations and unknown 3D space, using the observed RGB images and 2D semantics as supervision, respectively. We additionally develop a learning pipeline and corresponding method to enable learning from imperfect predicted 2D labels, which could be additionally acquired by synthesizing in an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision loop for semantics. As a result, our end-to-end trainable solution jointly addresses geometry completion, colorization, and semantic mapping from limited RGB-D images, without relying on any 3D ground-truth information. Our method achieves state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet, surpasses baselines even with costly 3D annotations in predicting both geometry and semantics. To our knowledge, our method is also the first 2D-driven method addressing completion and semantic segmentation of real-world 3D scans simultaneously.

Abstract:
A fundamental limitation of object detectors is that they suffer from “spatial bias”, and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, extending from the traditional evaluation to a more generalized one, which measures the detection performance over zones, yielding a series of Zone Precisions (ZPs). For the first time, we provide numerical results, showing that the object detectors perform quite unevenly across the zones. Surprisingly, the detector's performance in the 96% border zone of the image does not reach the AP value (Average Precision, commonly regarded as the average detection performance in the entire image zone). To better understand spatial bias, a series of heuristic experiments are conducted. Our investigation excludes two intuitive conjectures about spatial bias that the object scale and the absolute positions of objects barely influence the spatial bias. We find that the key lies in the human-imperceptible divergence in data patterns between objects in different zones, thus eventually forming a visible performance gap between the zones. With these findings, we finally discuss a future direction for object detection, namely, spatial disequilibrium problem, aiming at pursuing a balanced detection ability over the entire image zone. By broadly evaluating 10 popular object detectors and 5 detection datasets, we shed light on the spatial bias of object detectors. We hope this work could raise a focus on detection robustness.

Abstract:
Temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question, i.e., visual answer. Existing methods tend to solve the TAGV problem with a visual span-based predictor, taking visual information to predict the start and end frames in the video. However, due to the weak correlations between the semantic features of the textual question and visual answer, current methods using the visual span-based predictor do not work well in the TAGV task. In this paper, we propose a visual-prompt text span localization (VPTSL) method, which introduces the timestamped subtitles for a text span-based predictor. Specifically, the visual prompt is a learnable feature embedding, which brings visual knowledge to the pre-trained language model. Meanwhile, the text span-based predictor learns joint semantic representations from the input text question, video subtitles, and visual prompt feature with the pre-trained language model. Thus, the TAGV is reformulated as the task of the visual-prompt subtitle span localization for the visual answer. Extensive experiments on five instructional video datasets, namely MedVidQA, TutorialVQA, VehicleVQA, CrossTalk and Coin, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor.

Abstract:
Over the past few years, monocular depth estimation and completion have been paid more and more attention from the computer vision community because of their widespread applications. In this paper, we introduce novel physics (geometry)-driven deep learning frameworks for these two tasks by assuming that 3D scenes are constituted with piece-wise planes. Instead of directly estimating the depth map or completing the sparse depth map, we propose to estimate the surface normal and plane-to-origin distance maps or complete the sparse surface normal and distance maps as intermediate outputs. To this end, we develop a normal-distance head that outputs pixel-level surface normal and distance. Afterthat, the surface normal and distance maps are regularized by a developed plane-aware consistency constraint, which are then transformed into depth maps. Furthermore, we integrate an additional depth head to strengthen the robustness of the proposed frameworks. Extensive experiments on the NYU-Depth-v2, KITTI and SUN RGB-D datasets demonstrate that our method exceeds in performance prior state-of-the-art monocular depth estimation and completion competitors.

Abstract:
Gaussian Process Regression (GPR) is a popular regression method, which unlike most Machine Learning techniques, provides estimates of uncertainty for its predictions. These uncertainty estimates however, are based on the assumption that the model is well-specified, an assumption that is violated in most practical applications, since the required knowledge is rarely available. As a result, the produced uncertainty estimates can become very misleading; for example the prediction intervals (PIs) produced for the 95% confidence level may cover much less than 95% of the true labels. To address this issue, this paper introduces an extension of GPR based on a Machine Learning framework called, Conformal Prediction (CP). This extension guarantees the production of PIs with the required coverage even when the model is completely misspecified. The proposed approach combines the advantages of GPR with the valid coverage guarantee of CP, while the performed experimental results demonstrate its superiority over existing methods.

Abstract:
Complementary label learning (CLL) requires annotators to give irrelevant labels instead of relevant labels for instances. Currently, CLL has shown its promising performance on multi-class data by estimating a transition matrix. However, current multi-class CLL techniques cannot work well on multi-labeled data since they assume each instance is associated with one label while each multi-labeled instance is relevant to multiple labels. Here, we show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases as they ignore co-existing relevant labels. Moreover, theoretical findings reveal that calculating a transition matrix from label correlations in multi-labeled CLL (ML-CLL) needs multi-labeled data, while this is unavailable for ML-CLL. To solve this issue, we propose a two-step method to estimate the transition matrix from candidate labels. Specifically, we first estimate an initial transition matrix by decomposing the multi-label problem into a series of binary classification problems, then the initial transition matrix is corrected by label correlations to enforce the addition of relationships among labels. We further show that the proposal is classifier-consistent, and additionally introduce an MSE-based regularizer to alleviate the tendency of BCE loss overfitting to noises. Experimental results have demonstrated the effectiveness of the proposed method.

Abstract:
Restoring high-quality images from degraded hazy observations is a fundamental and essential task in the field of computer vision. While deep models have achieved significant success with synthetic data, their effectiveness in real-world scenarios remains uncertain. To improve adaptability in real-world environments, we construct an entirely new computational framework by making efforts from three key aspects: imaging perspective, structural modules, and training strategies. To simulate the often-overlooked multiple degradation attributes found in real-world hazy images, we develop a new hazy imaging model that encapsulates multiple degraded factors, assisting in bridging the domain gap between synthetic and real-world image spaces. In contrast to existing approaches that primarily address the inverse imaging process, we design a new dehazing network following the “localization-and-removal” pipeline. The degradation localization module aims to assist in network capture discriminative haze-related feature information, and the degradation removal module focuses on eliminating dependencies between features by learning a weighting matrix of training samples, thereby avoiding spurious correlations of extracted features in existing deep methods. We also define a new Gaussian perceptual contrastive loss to further constrain the network to update in the direction of the natural dehazing. Regarding multiple full/no-reference image quality indicators and subjective visual effects on challenging RTTS, URHI, and Fattal real hazy datasets, the proposed method has superior performance and is better than the current state-of-the-art methods.

Abstract:
Adversarial Training is a practical approach for improving the robustness of deep neural networks against adversarial attacks. Although bringing reliable robustness, the performance towards clean examples is negatively affected after Adversarial Training, which means a trade-off exists between accuracy and robustness. Recently, some studies have tried to use knowledge distillation methods in Adversarial Training, achieving competitive performance in improving the robustness but the accuracy for clean samples is still limited. In this paper, to mitigate the accuracy-robustness trade-off, we introduce the Balanced Multi-Teacher Adversarial Robustness Distillation (B-MTARD) to guide the model's Adversarial Training process by applying a strong clean teacher and a strong robust teacher to handle the clean examples and adversarial examples, respectively. During the optimization process, to ensure that different teachers show similar knowledge scales, we design the Entropy-Based Balance algorithm to adjust the teacher's temperature and keep the teachers’ information entropy consistent. Besides, to ensure that the student has a relatively consistent learning speed from multiple teachers, we propose the Normalization Loss Balance algorithm to adjust the learning weights of different types of knowledge. A series of experiments conducted on three public datasets demonstrate that B-MTARD outperforms the state-of-the-art methods against various adversarial attacks.

Affiliations: School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China; Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China; Tencent Data Platform, Shenzhen, China; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China; School of Computer Science and Technology, Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences, Beijing, China

Abstract:
Rank aggregation with pairwise comparisons is widely encountered in sociology, politics, economics, psychology, sports, etc. Given the enormous social impact and the consequent incentives, the potential adversary has a strong motivation to manipulate the ranking list. However, the ideal attack opportunity and the excessive adversarial capability cause the existing methods to be impractical. To fully explore the potential risks, we leverage an online attack on the vulnerable data collection process. Since it is independent of rank aggregation and lacks effective protection mechanisms, we disrupt the data collection process by fabricating pairwise comparisons without knowledge of the future data or the true distribution. From the game-theoretic perspective, the confrontation scenario between the online manipulator and the ranker who takes control of the original data source is formulated as a distributionally robust game that deals with the uncertainty of knowledge. Then we demonstrate that the equilibrium in the above game is potentially favorable to the adversary by analyzing the vulnerability of the sampling algorithms such as Bernoulli and reservoir methods. According to the above theoretical analysis, different sequential manipulation policies are proposed under a Bayesian decision framework and a large class of parametric pairwise comparison models. For attackers with complete knowledge, we establish the asymptotic optimality of the proposed policies. To increase the success rate of the sequential manipulation with incomplete knowledge, a distributionally robust estimator, which replaces the maximum likelihood estimation in a saddle point problem, provides a conservative data generation solution. Finally, the corroborating empirical evidence shows that the proposed method manipulates the results of rank aggregation methods in a sequential manner.

Abstract:
Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called Structured Point Cloud Videos (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences.

Abstract:
Generative models make huge progress to the photorealistic image synthesis in recent years. To enable humans to steer the image generation process and customize the output, many works explore the interpretable dimensions of the latent space in GANs. Existing methods edit the attributes of the output image such as orientation or color scheme by varying the latent code along certain directions. However, these methods usually require additional human annotations for each pretrained model, and they mostly focus on editing global attributes. In this work, we propose a self-supervised approach to improve the spatial steerability of GANs without searching for steerable directions in the latent space or requiring extra annotations. Specifically, we design randomly sampled Gaussian heatmaps to be encoded into the intermediate layers of generative models as spatial inductive bias. Along with training the GAN model from scratch, these heatmaps are aligned with the emerging attention of the GAN's discriminator in a self-supervised learning manner. During inference, users can interact with the spatial heatmaps in an intuitive manner, enabling them to edit the output image by adjusting the scene layout, moving, or removing objects. Moreover, we incorporate DragGAN into our framework, which facilitates fine-grained manipulation within a reasonable time and supports a coarse-to-fine editing process. Extensive experiments show that the proposed method not only enables spatial editing over human faces, animal faces, outdoor scenes, and complicated multi-object indoor scenes but also brings improvement in synthesis quality.

Abstract:
Large-scale datasets with point-wise semantic and instance labels are crucial to 3D instance segmentation but also expensive. To leverage unlabeled data, previous semi-supervised 3D instance segmentation approaches have explored self-training frameworks, which rely on high-quality pseudo labels for consistency regularization. They intuitively utilize both instance and semantic pseudo labels in a joint learning manner. However, semantic pseudo labels contain numerous noise derived from the imbalanced category distribution and natural confusion of similar but distinct categories, which leads to severe collapses in self-training. Motivated by the observation that 3D instances are non-overlapping and spatially separable, we ask whether we can solely rely on instance consistency regularization for improved semi-supervised segmentation. To this end, we propose a novel self-training network InsTeacher3D to explore and exploit pure instance knowledge from unlabeled data. We first build a parallel base 3D instance segmentation model DKNet, which distinguishes each instance from the others via discriminative instance kernels without reliance on semantic segmentation. Based on DKNet, we further design a novel instance consistency regularization framework to generate and leverage high-quality instance pseudo labels. Experimental results on multiple large-scale datasets show that the InsTeacher3D significantly outperforms prior state-of-the-art semi-supervised approaches.

Abstract:
Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. To decode diversified outputs from transformers, auto-regressive sampling is the most common method, but with extremely low efficiency. We further overcome this issue by proposing a new decoding strategy, temperature annealing probabilistic sampling (TAPS), which firstly achieves more than 70× speedup of inference at most, meanwhile maintaining the high quality and diversity of the sampled global structures. Moreover, we find the full CNN architecture will lead to suboptimal solutions for guided upsampling. To render more realistic and coherent contents, we design a novel module, named texture-aware guided attention, to concurrently consider the procedures of texture copy and generation, meanwhile raising several important modifications to solve the boundary artifacts. Through dense experiments, we found the proposed method vastly outperforms state-of-the-art methods in terms of four aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet. 4) Much higher decoding efficiency over previous auto-regressive based methods.

Abstract:
Source-free domain adaptation has developed rapidly in recent years, where the well-trained source model is adapted to the target domain instead of the source data, offering the potential for privacy concerns and intellectual property protection. However, a number of feature alignment techniques in prior domain adaptation methods are not feasible in this challenging problem setting. Thereby, we resort to probing inherent domain-invariant feature learning and propose a curriculum-style self-training approach for source-free domain adaptive semantic segmentation. In particular, we introduce a curriculum-style entropy minimization method to explore the implicit knowledge from the source model, which fits the trained source model to the target data using certain information from easy-to-hard predictions. We then train the segmentation network by the proposed complementary curriculum-style self-training, which utilizes the negative and positive pseudo labels following the curriculum-learning manner. Although negative pseudo-labels with high uncertainty cannot be identified with the correct labels, they can definitely indicate absent classes. Moreover, we employ an information propagation scheme to further reduce the intra-domain discrepancy within the target domain, which could act as a standard post-processing method for the domain adaptation field. Furthermore, we extend the proposed method to a more challenging black-box source model scenario where only the source model's predictions are available. Extensive experiments validate that our method yields state-of-the-art performance on source-free semantic segmentation tasks for both synthetic-to-real and adverse conditions datasets.

Abstract:
While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L_1L1 loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations.

Abstract:
Automatic medical image segmentation is a crucial topic in the medical domain and successively a critical counterpart in the computer-aided diagnosis paradigm. U-Net is the most widespread image segmentation architecture due to its flexibility, optimized modular design, and success in all medical image modalities. Over the years, the U-Net model has received tremendous attention from academic and industrial researchers who have extended it to address the scale and complexity created by medical tasks. These extensions are commonly related to enhancing the U-Net's backbone, bottleneck, or skip connections, or including representation learning, or combining it with a Transformer architecture, or even addressing probabilistic prediction of the segmentation map. Having a compendium of different previously proposed U-Net variants makes it easier for machine learning researchers to identify relevant research questions and understand the challenges of the biological tasks that challenge the model. In this work, we discuss the practical aspects of the U-Net model and organize each variant model into a taxonomy. Moreover, to measure the performance of these strategies in a clinical application, we propose fair evaluations of some unique and famous designs on well-known datasets. Furthermore, we provide a comprehensive implementation library with trained models. In addition, for ease of future studies, we created an online list of U-Net papers with their possible official implementation.

Abstract:
This work presents a novel and effective method for fitting multidimensional ellipsoids (i.e., ellipsoids embedded in \mathbb R^nRn) to scattered data in the contamination of noise and outliers. Unlike conventional algebraic or geometric fitting paradigms that assume each measurement point is a noisy version of its nearest point on the ellipsoid, we approach the problem as a Bayesian parameter estimate process and maximize the posterior probability of a certain ellipsoidal solution given the data. We establish a more robust correlation between these points based on the predictive distribution within the Bayesian framework, i.e., considering each model point as a potential source for generating each measurement. Concretely, we incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain, ensuring ellipsoid-specific results regardless of inputs. We then establish the connection between measurement point and model data via Bayes’ rule to enhance the method's robustness against noise. Due to independent of spatial dimensions, the proposed method not only delivers high-quality fittings to challenging elongated ellipsoids but also generalizes well to multidimensional spaces. To address outlier disturbances, often overlooked by previous approaches, we further introduce a uniform distribution on top of the predictive distribution to significantly enhance the algorithm's robustness against outliers. Thanks to the uniform prior, our maximum a posterior probability coincides with a more tractable maximum likelihood estimation problem, which is subsequently solved by a numerically stable Expectation Maximization (EM) framework. Moreover, we introduce an \varepsilonɛ-accelerated technique to expedite the convergence of EM considerably. We also investigate the relationship between our algorithm and conventional least-squares-based ones, during which we theoretically prove our method's superior robustness. To the best of our knowledge, this is the first comprehensive method capable of performing multidimensional ellipsoid-specific fitting within the Bayesian optimization paradigm under diverse disturbances. We evaluate it across lower and higher dimensional spaces in the presence of heavy noise, outliers, and substantial variations in axis ratios. Also, we apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks. In all these test contexts, our method consistently delivers flexible, robust, ellipsoid-specific performance, and achieves the state-of-the-art results.

Abstract:
Heterogeneous Information Networks (HINs) are information networks with multiple types of nodes and edges. The concept of meta-path, i.e., a sequence of entity types and relation types connecting two entities, is proposed to provide the meta-level explainable semantics for various HIN tasks. Traditionally, meta-paths are primarily used for schema-simple HINs, e.g., bibliographic networks with only a few entity types, where meta-paths are often enumerated with domain knowledge. However, the adoption of meta-paths for schema-complex HINs, such as knowledge bases (KBs) with hundreds of entity and relation types, has been limited due to the computational complexity associated with meta-path enumeration. Additionally, effectively assessing meta-paths requires enumerating relevant path instances, which adds further complexity to the meta-path learning process. To address these challenges, we propose SchemaWalk, an inductive meta-path learning framework for schema-complex HINs. We represent meta-paths with schema-level representations to support the learning of the scores of meta-paths for varying relations, mitigating the need of exhaustive path instance enumeration for each relation. Further, we design a reinforcement-learning based path-finding agent, which directly navigates the network schema (i.e., schema graph) to learn policies for establishing meta-paths with high coverage and confidence for multiple relations. Extensive experiments on real data sets demonstrate the effectiveness of our proposed paradigm.

Abstract:
Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code is available at https://github.com/THU-LYJ-Lab/dmt.

Abstract:
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor’s entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6 dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.

Abstract:
How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer achieves new state-of-the-art performance on three widely-used camouflaged object detection benchmarks. To better evaluate the performance of the proposed CamoFormer around the border regions, we propose to use two new metrics, i.e., BR-M and BR-F. There are on average ～∼ 5% relative improvements over previous methods in terms of S-measure and weighted F-measure.

Abstract:
Lookahead is a popular stochastic optimizer that can accelerate the training process of deep neural networks. However, the solutions found by Lookahead often generalize worse than those found by its base optimizers, such as SGD and Adam. To address this issue, we propose Sharpness-Aware Lookahead (SALA), a novel optimizer that aims to identify flat minima that generalize well. SALA divides the training process into two stages. In the first stage, the direction towards flat regions is determined by leveraging a quadratic approximation of the optimization trajectory, without incurring any extra computational overhead. In the second stage, however, it is determined by Sharpness-Aware Minimization (SAM), which is particularly effective in improving generalization at the terminal phase of training. In contrast to Lookahead, SALA retains the benefits of accelerated convergence while also enjoying superior generalization performance compared to the base optimizer. Theoretical analysis of the expected excess risk, as well as empirical results on canonical neural network architectures and datasets, demonstrate the advantages of SALA over Lookahead. It is noteworthy that with approximately 25% more computational overhead than the base optimizer, SALA can achieve the same generalization performance as SAM which requires twice the training budget of the base optimizer.

Abstract:
Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models’ overfitting to low-level details. Our preliminary work (Zhang et al. 2022) introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA (Workman et al. 2015), CVACT (Liu and Li, 2019), and VIGOR (Zhu et al. 2021) by a large margin (16.44%, 22.71%, and 13.66% without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+.

Abstract:
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks.

Abstract:
Many studies have achieved excellent performance in analyzing graph-structured data. However, learning graph-level representations for graph classification is still a challenging task. Existing graph classification methods usually pay less attention to the fusion of node features and ignore the effects of different-hop neighborhoods on nodes in the graph convolution process. Moreover, they discard some nodes directly during the graph pooling process, resulting in the loss of graph information. To tackle these issues, we propose a new Graph Multi-Convolution and Attention Pooling based graph classification method (GMCAP). Specifically, the designed Graph Multi-Convolution (GMConv) layer explicitly fuses node features learned from different perspectives. The proposed weight-based aggregation module combines the outputs of all GMConv layers, for adaptively exploiting the information over different-hop neighborhoods to generate informative node representations. Furthermore, the designed Local information and Global Attention based Pooling (LGAPool) utilizes the local information of a graph to select several important nodes and aggregates the information of unselected nodes to the selected ones by a global attention mechanism when reconstructing a pooled graph, thus effectively reducing the loss of graph information. Extensive experiments show that GMCAP outperforms the state-of-the-art methods on graph classification tasks, demonstrating that GMCAP can learn graph-level representations effectively.

Abstract:
We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the \hbox``-``-'' structure and the \hbox``|``|'' structure. The \hbox``-``-'' structure utilizes an iterative learning strategy to process correspondences, while the \hbox``|``|'' structure integrates all feature information of the \hbox``-``-'' structure and produces inlier weights. Moreover, within the \hbox``|``|'' structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks.

Abstract:
With prior knowledge of seen objects, humans have a remarkable ability to recognize novel objects using shared and distinct local attributes. This is significant for the challenging tasks of zero-shot learning (ZSL) and fine-grained visual classification (FGVC), where the discriminative attributes of objects have played an important role. Inspired by human visual attention, neural networks have widely exploited the attention mechanism to learn the locally discriminative attributes for challenging tasks. Though greatly promoted the development of these fields, existing works mainly focus on learning the region embeddings of different attribute features and neglect the importance of discriminative attribute localization. It is also unclear whether the learned attention truly matches the real human attention. To tackle this problem, this paper proposes to employ real human gaze data for visual recognition networks to learn from human attention. Specifically, we design a unified Attribute Attention Network (A^22Net) that learns from human attention for both ZSL and FGVC tasks. The overall model consists of an attribute attention branch and a baseline classification network. On top of the image feature maps provided by the baseline classification network, the attribute attention branch employs attribute prototypes to produce attribute attention maps and attribute features. The attribute attention maps are converted to gaze-like attentions to be aligned with real human gaze attention. To guarantee the effectiveness of attribute feature learning, we further align the extracted attribute features with attribute-defined class embeddings. To facilitate learning from human gaze attention for the visual recognition problems, we design a bird classification game to collect real human gaze data using the CUB dataset via an eye-tracker device. Experiments on ZSL and FGVC tasks without/with real human gaze data validate the benefits and accuracy of our proposed model. This work supports the promising benefits of collecting human gaze datasets and automatic gaze estimation algorithms learning from human attention for high-level computer vision tasks.

Abstract:
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012 and COCO-Stuff 164 K. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4% and 1.7%, respectively, with negligible overheads. The code is available at here.

Abstract:
Partial label learning (PLL) is a form of weakly supervised learning, where each training example is linked to a set of candidate labels, among which only one label is correct. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels. However, in practice, this assumption may not hold true, as the candidate labels are often instance-dependent. In this paper, we address the instance-dependent PLL problem and assume that each example is associated with a latent label distribution where the incorrect label with a high degree is more likely to be annotated as a candidate label. Motivated by this consideration, we propose two methods Valen and Milen, which train the predictive model via utilizing the latent label distributions recovered by the label enhancement process. Specifically, Valen recovers the latent label distributions via inferring the variational posterior density parameterized by an inference model with the deduced evidence lower bound. Milen recovers the latent label distribution by adopting the variational approximation to bound the mutual information among the latent label distribution, observed labels and augmented instances. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed methods.

Abstract:
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include Composed Video Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.

Abstract:
Deep optics has been endeavoring to capture hyperspectral images of dynamic scenes, where the optical encoder plays an essential role in deciding the imaging performance. Our key insight is that the optical encoder of a deep optics system is expected to keep fabrication-friendliness and decoder-friendliness, to be faithfully realized in the implementation phase and fully interacted with the decoder in the design phase, respectively. In this paper, we propose the non-serial quantization-aware deep optics (NSQDO), which consists of the fabrication-friendly quantization-aware model (QAM) and the decoder-friendly non-serial manner (NSM). The QAM integrates the quantization process into the optimization and adaptively adjusts the physical height of each quantization level, reducing the deviation of the physical encoder from the numerical simulation through the awareness of and adaptation to the quantization operation of the DOE physical structure. The NSM bridges the encoder and the decoder with full interaction through bidirectional hint connections and flexibilize the connections with a gating mechanism, boosting the power of joint optimization in deep optics. The proposed NSQDO improves the fabrication-friendliness and decoder-friendliness of the encoder and develops the deep optics framework to be more practical and powerful. Extensive synthetic simulation and real hardware experiments demonstrate the superior performance of the proposed method.

Abstract:
A recent trend in Non-Rigid Structure-from-Motion (NRSfM) is to express local, differential constraints between pairs of images, from which the surface normal at any point can be obtained by solving a system of polynomial equations. While this approach is more successful than its counterparts relying on global constraints, the resulting methods face two main problems: First, most of the equation systems they formulate are of high degree and must be solved using computationally expensive polynomial solvers. Some methods use polynomial reduction strategies to simplify the system, but this adds some phantom solutions. In any event, an additional mechanism is employed to pick the best solution, which adds to the computation without any guarantees on the reliability of the solution. Second, these methods formulate constraints between a pair of images. Even if there is enough motion between them, they may suffer from local degeneracies that make the resulting estimates unreliable without any warning mechanism. In this paper, we solve these problems for isometric/conformal NRSfM. We show that, under widely applicable assumptions, we can derive a new system of equations in terms of the surface normals, whose two solutions can be obtained in closed-form and can easily be disambiguated locally. Our formalism also allows us to assess how reliable the estimated local normals are and to discard them if they are not. Our experiments show that our reconstructions, obtained from two or more views, are significantly more accurate than those of state-of-the-art methods, while also being faster.

Abstract:
Raw depth images captured in indoor scenarios frequently exhibit extensive missing values due to the inherent limitations of the sensors and environments. For example, transparent materials frequently elude detection by depth sensors; surfaces may introduce measurement inaccuracies due to their polished textures, extended distances, and oblique incidence angles from the sensor. The presence of incomplete depth maps imposes significant challenges for subsequent vision applications, prompting the development of numerous depth completion techniques to mitigate this problem. Numerous methods excel at reconstructing dense depth maps from sparse samples, but they often falter when faced with extensive contiguous regions of missing depth values, a prevalent and critical challenge in indoor environments. To overcome these challenges, we design a novel two-branch end-to-end fusion network named RDFC-GAN, which takes a pair of RGB and incomplete depth images as input to predict a dense and completed depth map. The first branch employs an encoder-decoder structure, by adhering to the Manhattan world assumption and utilizing normal maps from RGB-D information as guidance, to regress the local dense depth values from the raw depth map. The other branch applies an RGB-depth fusion CycleGAN, adept at translating RGB imagery into detailed, textured depth maps while ensuring high fidelity through cycle consistency. We fuse the two branches via adaptive fusion modules named W-AdaIN and train the model with the help of pseudo depth maps. Comprehensive evaluations on NYU-Depth V2 and SUN RGB-D datasets show that our method significantly enhances depth completion performance particularly in realistic indoor settings.

Abstract:
Interactive image restoration aims to construct an interactive pathway between users and restoration networks, which empowers users to modulate the restoration results according to their own demands. However, existing methods are primarily limited to training their networks with predefined and simplistic synthetic degradations. Consequently, these methods often encounter significant performance degradation when confronted with real-world degradations that deviate from their assumptions. Furthermore, existing interactive image restoration approaches solely support global modulation, wherein a single modulation factor governs the reconstruction process for the entire image. In this paper, we propose a novel method to perform real-world and intricate image super-resolution in an interactive manner. Specifically, we propose a metric-learning-based degradation estimation strategy to estimate not only the overall degradation level of the entire image but also the finer-grained, pixel-wise degradation within real-world scenarios. This enables local control over the restoration results by selectively modulating the corresponding regions based on the densely-estimated degradation map. Additionally, a new metric-argumented loss is proposed to further enhance the performance of real-world image super-resolution. Through extensive experimentation, we demonstrate the efficacy of our method in achieving exceptional modulation and restoration performance in real-world image super-resolution tasks, all while maintaining an appealing model complexity.

Abstract:
The design of neural networks typically involves trial-and-error, a time-consuming process for obtaining an optimal architecture, even for experienced researchers. Additionally, it is widely accepted that loss functions of deep neural networks are generally non-convex with respect to the parameters to be optimised. We propose the Layer-wise Convex Theorem to ensure that the loss is convex with respect to the parameters of a given layer, achieved by constraining each layer to be an overdetermined system of non-linear equations. Based on this theorem, we developed an end-to-end algorithm (the AutoNet) to automatically generate layer-wise convex networks (LCNs) for any given training set. We then demonstrate the performance of the AutoNet-generated LCNs (AutoNet-LCNs) compared to state-of-the-art models on three electrocardiogram (ECG) classification benchmark datasets, with further validation on two non-ECG benchmark datasets for more general tasks. The AutoNet-LCN was able to find networks customised for each dataset without manual fine-tuning under 2 GPU-hours, and the resulting networks outperformed the state-of-the-art models with fewer than 5% parameters on all the above five benchmark datasets. The efficiency and robustness of the AutoNet-LCN markedly reduce model discovery costs and enable efficient training of deep learning models in resource-constrained settings.

Abstract:
Aside from graph neural networks (GNNs) attracting significant attention as a powerful framework revolutionizing graph representation learning, there has been an increasing demand for explaining GNN models. Although various explanation methods for GNNs have been developed, most studies have focused on instance-level explanations, which produce explanations tailored to a given graph instance. In our study, we propose Prototype-bAsed GNN-Explainer (\sf PAGEPAGE), a novel model-level GNN explanation method that explains what the underlying GNN model has learned for graph classification by discovering human-interpretable prototype graphs. Our method produces explanations for a given class, thus being capable of offering more concise and comprehensive explanations than those of instance-level explanations. First, \sf PAGEPAGE selects embeddings of class-discriminative input graphs on the graph-level embedding space after clustering them. Then, \sf PAGEPAGE discovers a common subgraph pattern by iteratively searching for high matching node tuples using node-level embeddings via a prototype scoring function, thereby yielding a prototype graph as our explanation. Using six graph classification datasets, we demonstrate that \sf PAGEPAGE qualitatively and quantitatively outperforms the state-of-the-art model-level explanation method. We also carry out systematic experimental studies by demonstrating the relationship between \sf PAGEPAGE and instance-level explanation methods, the robustness of \sf PAGEPAGE to input data scarce environments, and the computational efficiency of the proposed prototype scoring function in \sf PAGEPAGE.

Abstract:
Salient object ranking (SOR) aims to segment salient objects in an image and simultaneously predict their saliency rankings, according to the shifted human attention over different objects. The existing SOR approaches mainly focus on object-based attention, e.g., the semantic and appearance of object. However, we find that the scene context plays a vital role in SOR, in which the saliency ranking of the same object varies a lot at different scenes. In this paper, we thus make the first attempt towards explicitly learning scene context for SOR. Specifically, we establish a large-scale SOR dataset of 24,373 images with rich context annotations, i.e., scene graphs, segmentation, and saliency rankings. Inspired by the data analysis on our dataset, we propose a novel graph hypernetwork, named HyperSOR, for context-aware SOR. In HyperSOR, an initial graph module is developed to segment objects and construct an initial graph by considering both geometry and semantic information. Then, a scene graph generation module with multi-path graph attention mechanism is designed to learn semantic relationships among objects based on the initial graph. Finally, a saliency ranking prediction module dynamically adopts the learned scene context through a novel graph hypernetwork, for inferring the saliency rankings. Experimental results show that our HyperSOR can significantly improve the performance of SOR.

Affiliations: College of Electronic Science and Technology, National University of Defense Technology (NUDT), Changsha, Hunan, China; College of Systems Engineering, National University of Defense Technology (NUDT), Changsha, Hunan, China; College of Liberal Arts and Sciences, National University of Defense Technology (NUDT), Changsha, Hunan, China; College of Intelligence Science and Technology, National University of Defense Technology (NUDT), Changsha, Hunan, China; Center for Machine Vision and Signal Analysis, Oulu University, Oulu, Finland

Abstract:
Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper will present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.

Abstract:
Estimating reliable geometric model parameters from the data with severe outliers is a fundamental and important task in computer vision. This paper attempts to sample high-quality subsets and select model instances to estimate parameters in the multi-structural data. To address this, we propose an effective method called Latent Semantic Consensus (LSC). The principle of LSC is to preserve the latent semantic consensus in both data points and model hypotheses. Specifically, LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses, respectively. Then, LSC explores the distributions of points in the two latent semantic spaces, to remove outliers, generate high-quality model hypotheses, and effectively estimate model instances. Finally, LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting, due to its deterministic fitting nature and efficiency. Compared with several state-of-the-art model fitting methods, our LSC achieves significant superiority for the performance of both accuracy and speed on synthetic data and real images.

Abstract:
Self-supervised representation learning for 3D point clouds has attracted increasing attention. However, existing methods in the field of 3D computer vision generally use fixed embeddings to represent the latent features, and impose hard constraints on the embeddings to make the latent feature values of the positive samples converge to consistency, which limits the ability of feature extractors to generalize over different data domains. To address this issue, we propose a Generative Variational-Contrastive Learning (GVC) model, where Gaussian distribution is used to construct a continuous, smoothed representation of the latent features. A distribution constraint and cross-supervision are constructed to improve the transfer ability of the feature extractor over synthetic and real-world data. Specifically, we design a variational contrastive module to constrain the feature distribution instead of feature values corresponding to each sample in the latent space. Moreover, a generative cross-supervision module is introduced to preserve the invariance features and promote the consistency of feature distribution among positive samples. Experimental results demonstrate that GVC achieves SOTA on different downstream tasks. In particular, with only pre-training on the synthetic dataset, GVC achieves a lead of 8.4% and 14.2% when transferring to the real-world dataset in the linear classification and few-shot classification.

Abstract:
Data-free knowledge distillation (DFKD) improves the student model (S) by mimicking the class probability from a pre-trained teacher model (T) without training data. Under such setting, an ideal scenario is that T can help generate ”good” samples from a generator (G) to maximally benefit S. However, existing arts suffer from the non-ideal generated samples under the disturbance of the gap (i.e., either too large or small) between the class probabilities of T and S; for example, the generated samples with too large gap may exhibit excessive information for S, while too small gap leads to the limited knowledge in the samples, resulting into the poor generalization. Meanwhile, they fail to judge the “goodness” of the generated samples for S since the fixed T is not necessarily ideal. In this paper, we aim to answer what is inside the gap box; together with how to yield ”good” generated samples for DFKD? To this end, we propose a Gap-Sensitive Sample Generation (GapSSG) approach, by revisiting the empirical distilled risk from a data-free perspective, which confirms the existence of an ideal teacher (T^), while theoretically implying: (1) the gap disturbance originates from the mismatch between T and T^, hence the class probabilities of T enable the approximation to those of T^; and (2) ”good” samples should maximally benefit S via T's class probabilities, owing to unknown T^. To this end, we unpack the gap box between T and S as two findings: inherent gap to perceive T and T^; derived gap to monitor S and T^. Benefiting from the derived gap that focuses on the adaptability of generated sample to S, we attempt to track student's training route (a series of training epochs) to capture the category distribution of S; upon which, a regulatory factor is further devised to approximate T^ over inherent gap, so as to generate ”good” samples to S. Furthermore, during the distillation process, a sample-balanced strategy comes up to tackle the overfitting and missing knowledge issues between the generated partial and critical samples by training G. The theoretical and empirical studies verify the advantages of GapSSG over the state-of-the-arts.

Affiliations: Department of Intelligent Data Science, College of Computer Science and Technology, National University of Defense Technology, Changsha, China; Trustworthy Machine Learning Lab, School of Computer Science, Faculty of Engineering, University of Sydney, Darlington, NSW, Australia; State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China; Department of Automation, University of Science and Technology of China, Hefei, Anhui, China; Department of Computer Science, Hong Kong Baptist University, Hong Kong

Abstract:
Given data with noisy labels, over-parameterized deep networks suffer overfitting mislabeled data, resulting in poor generalization. The memorization effect of deep networks shows that although the networks have the ability to memorize all noisy data, they would first memorize clean training data, and then gradually memorize mislabeled training data. A simple and effective method that exploits the memorization effect to combat noisy labels is early stopping. However, early stopping cannot distinguish the memorization of clean data and mislabeled data, resulting in the network still inevitably overfitting mislabeled data in the early training stage. In this paper, to decouple the memorization of clean data and mislabeled data, and further reduce the side effect of mislabeled data, we perform additive decomposition on network parameters. Namely, all parameters are additively decomposed into two groups, i.e., parameters \mathbf ww are decomposed as \mathbf w=\boldsymbol \sigma +\boldsymbol \gamma w=σ+γ. Afterward, the parameters \boldsymbol \sigma σ are considered to memorize clean data, while the parameters \boldsymbol \gamma γ are considered to memorize mislabeled data. Benefiting from the memorization effect, the updates of the parameters \boldsymbol \sigma σ are encouraged to fully memorize clean data in early training, and then discouraged with the increase of training epochs to reduce interference of mislabeled data. The updates of the parameters \boldsymbol \gamma γ are the opposite. In testing, only the parameters \boldsymbol \sigma σ are employed to enhance generalization. Extensive experiments on both simulated and real-world benchmarks confirm the superior performance of our method.

Abstract:
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planar constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption and the Atlanta-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality.

Abstract:
Matrix factorization is a popular framework for modeling low-rank data matrices. Motivated by manifold learning problems, this paper proposes a quadratic matrix factorization (QMF) framework to learn the curved manifold on which the dataset lies. Unlike local linear methods such as the local principal component analysis, QMF can better exploit the curved structure of the underlying manifold. Algorithmically, we propose an alternating minimization algorithm to optimize QMF and establish its theoretical convergence properties. To avoid possible over-fitting, we then propose a regularized QMF algorithm and discuss how to tune its regularization parameter. Finally, we elaborate how to apply the regularized QMF to manifold learning problems. Experiments on a synthetic manifold learning dataset and three real-world datasets, including the MNIST handwritten dataset, a cryogenic electron microscopy dataset, and the Frey Face dataset, demonstrate the superiority of the proposed method over its competitors.

Abstract:
The hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks. Scaling up the input resolution of such hybrid backbones naturally strengthes model capacity, but inevitably suffers from heavy computational cost that scales quadratically. Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs. HIRI-ViT is built upon the seminal idea of decomposing the typical CNN operations into two parallel CNN branches in a cost-efficient manner. One high-resolution branch directly takes primary high-resolution features as inputs, but uses less convolution operations. The other low-resolution branch first performs down-sampling and then utilizes more convolution operations over such low-resolution features. Experiments on both recognition task (ImageNet-1K dataset) and dense prediction tasks (COCO and ADE20 K datasets) demonstrate the superiority of HIRI-ViT. More remarkably, under comparable computational cost (～∼5.0 GFLOPs), HIRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448×448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224×224 inputs.

Abstract:
Molecular property prediction plays a fundamental role in AI-aided drug discovery to identify candidate molecules, which is also essentially a few-shot problem due to lack of labeled data. In this paper, we propose Property-Aware Relation networks (PAR) to handle this problem. We first introduce a property-aware molecular encoder to transform the generic molecular embeddings to property-aware ones. Then, we design a query-dependent relation graph learning module to estimate molecular relation graph and refine molecular embeddings w.r.t. the target property. Thus, the facts that both property-related information and relationships among molecules change across different properties are utilized to better learn and propagate molecular embeddings. Generally, PAR can be regarded as a combination of metric-based and optimization-based few-shot learning method. We further extend PAR to Transferable PAR (T-PAR) to handle the distribution shift, which is common in drug discovery. The keys are joint sampling and relation graph learning schemes, which simultaneously learn molecular embeddings from both source and target domains. Extensive results on benchmark datasets show that PAR and T-PAR consistently outperform existing methods on few-shot and transferable few-shot molecular property prediction tasks, respectively. Besides, ablation and case studies are conducted to validate the rationality of our designs in PAR and T-PAR.

Abstract:
Clustering is a fundamental topic in machine learning and various methods are proposed, in which K-Means (KM) and min cut clustering are typical ones. However, they may produce empty or skewed clustering results, which are not as expected. In KM, the constrained clustering methods have been fully studied while in min cut clustering, it still needs to be developed. In this paper, we propose a parameter-insensitive min cut clustering with flexible size constraints. Specifically, we add lower limitations on the number of samples for each cluster, which can perfectly avoid the trivial solution in min cut clustering. As far as we are concerned, this is the first attempt of directly incorporating size constraints into min cut. However, it is a NP-hard problem and difficult to solve. Thus, the upper limits is also added in but it is still difficult to solve. Therefore, an additional variable that is equivalent to label matrix is introduced in and the augmented Lagrangian multiplier (ALM) is used to decouple the constraints. In the experiments, we find that the our algorithm is less sensitive to lower bound and is practical in image segmentation. A large number of experiments demonstrate the effectiveness of our proposed algorithm.

Abstract:
Optical aberration is a ubiquitous degeneration in realistic lens-based imaging systems. Optical aberrations are caused by the differences in the optical path length when light travels through different regions of the camera lens with different incident angles. The blur and chromatic aberrations manifest significant discrepancies when the optical system changes. This work designs a transferable and effective image simulation system of simple lenses via multi-wavelength, depth-aware, spatially-variant four-dimensional point spread functions (4D-PSFs) estimation by changing a small amount of lens-dependent parameters. The image simulation system can alleviate the overhead of dataset collecting and exploiting the principle of computational imaging for effective optical aberration correction. With the guidance of domain knowledge about the image formation model provided by the 4D-PSFs, we establish a multi-scale optical aberration correction network for degraded image reconstruction, which consists of a scene depth estimation branch and an image restoration branch. Specifically, we propose to predict adaptive filters with the depth-aware PSFs and carry out dynamic convolutions, which facilitate the model's generalization in various scenes. We also employ convolution and self-attention mechanisms for global and local feature extraction and realize a spatially-variant restoration. The multi-scale feature extraction complements the features across different scales and provides fine details and contextual features. Extensive experiments demonstrate that our proposed algorithm performs favorably against state-of-the-art restoration methods.

Abstract:
Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various DeepFake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^44). DGM^44 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^44 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs: 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.

Abstract:
Group re-identification (GReID) aims to correctly associate group images belonging to the same group identity, which is a crucial task for video surveillance. Existing methods only model the member feature representations inside each image (regarded as spatial members), which leads to potential failures in long-term video surveillance due to cloth-changing behaviors. Therefore, we focus on a new task called cloth-changing group re-identification (CCGReID), which needs to consider group relationship modeling in GReID and robust group representation against cloth-changing members. In this paper, we propose the separable spatial-temporal residual graph (SSRG) for CCGReID. Unlike existing GReID methods, SSRG considers both spatial members inside each group image and temporal members among multiple group images with the same identity. Specifically, SSRG constructs full graphs for each group identity within the batched data, which will be completely and non-redundantly separated into the spatial member graph (SMG) and temporal member graph (TMG). SMG aims to extract group features from spatial members, and TMG improves the robustness of the cloth-changing members by feature propagation. The separability enables SSRG to be available in the inference rather than only assisting supervised training. The residual guarantees efficient SSRG learning for SMG and TMG. To expedite research in CCGReID, we develop two datasets, including GroupPRCC and GroupVC, based on the existing CCReID datasets. The experimental results show that SSRG achieves state-of-the-art performance, including the best accuracy and low degradation (only 2.15% on GroupVC). Moreover, SSRG can be well generalized to the GReID task. As a weakly supervised method, SSRG surpasses the performance of some supervised methods and even approaches the best performance on the CSG dataset.

Affiliations: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Industrial Engineering, University of Houston, Houston, TX, USA; Department of Data Science, New Jersey Institute of Technology, Newark, NJ, USA; Department of Computer Science, Guangzhou Maritime University, Guangzhou, Guandong, China; School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou, Guangdong, China; Hikvision Research Institute, Hangzhou, Zhejiang, China; Department of Computer Science and Engineering, and the John Hopcroft Center, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Various attribution methods have been developed to explain deep neural networks (DNNs) by inferring the attribution/importance/contribution score of each input variable to the final output. However, existing attribution methods are often built upon different heuristics. There remains a lack of a unified theoretical understanding of why these methods are effective and how they are related. Furthermore, there is still no universally accepted criterion to compare whether one attribution method is preferable over another. In this paper, we resort to Taylor interactions and for the first time, we discover that fourteen existing attribution methods, which define attributions based on fully different heuristics, actually share the same core mechanism. Specifically, we prove that attribution scores of input variables estimated by the fourteen attribution methods can all be mathematically reformulated as a weighted allocation of two typical types of effects, i.e., independent effects of each input variable and interaction effects between input variables. The essential difference among these attribution methods lies in the weights of allocating different effects. Inspired by these insights, we propose three principles for fairly allocating the effects, which serve as new criteria to evaluate the faithfulness of attribution methods. In summary, this study can be considered as a new unified perspective to revisit fourteen attribution methods, which theoretically clarifies essential similarities and differences among these methods. Besides, the proposed new principles enable people to make a direct and fair comparison among different methods under the unified perspective.

Abstract:
Meta-learning empowers learning systems with the ability to acquire knowledge from multiple tasks, enabling faster adaptation and generalization to new tasks. This review provides a comprehensive technical overview of meta-learning, emphasizing its importance in real-world applications where data may be scarce or expensive to obtain. The article covers the state-of-the-art meta-learning approaches and explores the relationship between meta-learning and multi-task learning, transfer learning, domain adaptation and generalization, self-supervised learning, personalized federated learning, and continual learning. By highlighting the synergies between these topics and the field of meta-learning, the article demonstrates how advancements in one area can benefit the field as a whole, while avoiding unnecessary duplication of efforts. Additionally, the article delves into advanced meta-learning topics such as learning from complex multi-modal task distributions, unsupervised meta-learning, learning to efficiently adapt to data distribution shifts, and continual meta-learning. Lastly, the article highlights open problems and challenges for future research in the field. By synthesizing the latest research developments, this article provides a thorough understanding of meta-learning and its potential impact on various machine learning applications. We believe that this technical overview will contribute to the advancement of meta-learning and its practical implications in addressing real-world problems.

Abstract:
Point-based object localization (POL), which pursues high-performance object sensing under low-cost data annotation, has attracted increased attention. However, the point annotation mode inevitably introduces semantic variance due to the inconsistency of annotated points. Existing POL heavily rely on strict annotation rules, which are difficult to define and apply, to handle the problem. In this study, we propose coarse point refinement (CPR), which to our best knowledge is the first attempt to alleviate semantic variance from an algorithmic perspective. CPR reduces the semantic variance by selecting a semantic centre point in a neighbourhood region to replace the initial annotated point. Furthermore, We design a sampling region estimation module to dynamically compute a sampling region for each object and use a cascaded structure to achieve end-to-end optimization. We further integrate a variance regularization into the structure to concentrate the predicted scores, yielding CPR++. We observe that CPR++ can obtain scale information and further reduce the semantic variance in a global region, thus guaranteeing high-performance object localization. Extensive experiments on four challenging datasets validate the effectiveness of both CPR and CPR++. We hope our work can inspire more research on designing algorithms rather than annotation rules to address the semantic variance problem in POL.

Abstract:
Change captioning aims to describe the semantic change between two similar images. In this process, as the most typical distractor, viewpoint change leads to the pseudo changes about appearance and position of objects, thereby overwhelming the real change. Besides, since the visual signal of change appears in a local region with weak feature, it is difficult for the model to directly translate the learned change features into the sentence. In this paper, we propose a syntax-calibrated multi-aspect relation transformer to learn effective change features under different scenes, and build reliable cross-modal alignment between the change features and linguistic words during caption generation. Specifically, a multi-aspect relation learning network is designed to 1) explore the fine-grained changes under irrelevant distractors (e.g., viewpoint change) by embedding the relations of semantics and relative position into the features of each image; 2) learn two view-invariant image representations by strengthening their global contrastive alignment relation, so as to help capture a stable difference representation; 3) provide the model with the prior knowledge about whether and where the semantic change happened by measuring the relation between the representations of captured difference and the image pair. Through the above manner, the model can learn effective change features for caption generation. Further, we introduce the syntax knowledge of Part-of-Speech (POS) and devise a POS-based visual switch to calibrate the transformer decoder. The POS-based visual switch dynamically utilizes visual information during different word generation based on the POS of words. This enables the decoder to build reliable cross-modal alignment, so as to generate a high-level linguistic sentence about change. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the three public datasets.

Abstract:
Transformers have been widely used for video processing owing to the multi-head self attention (MHSA) mechanism. However, the MHSA mechanism encounters an intrinsic difficulty for video inpainting, since the features associated with the corrupted regions are degraded and incur inaccurate self attention. This problem, termed query degradation, may be mitigated by first completing optical flows and then using the flows to guide the self attention, which was verified in our previous work – flow-guided transformer (FGT). We further exploit the flow guidance and propose FGT++ to pursue more effective and efficient video inpainting. First, we design a lightweight flow completion network by using local aggregation and edge loss. Second, to address the query degradation, we propose a flow guidance feature integration module, which uses the motion discrepancy to enhance the features, together with a flow-guided feature propagation module that warps the features according to the flows. Third, we decouple the transformer along the temporal and spatial dimensions, where flows are used to select the tokens through a temporally deformable MHSA mechanism, and global tokens are combined with the inner-window local tokens through a dual-perspective MHSA mechanism. FGT++ is experimentally evaluated to be outperforming the existing video inpainting networks qualitatively and quantitatively.

Abstract:
Stochastic optimization of the Area Under the Precision-Recall Curve (AUPRC) is a crucial problem for machine learning. Despite extensive studies on AUPRC optimization, generalization is still an open problem. In this work, we present the first trial in the algorithm-dependent generalization of stochastic AUPRC optimization. The obstacles to our destination are three-fold. First, according to the consistency analysis, the majority of existing stochastic estimators are biased with biased sampling strategies. To address this issue, we propose a stochastic estimator with sampling-rate-invariant consistency and reduce the consistency error by estimating the full-batch scores with score memory. Second, standard techniques for algorithm-dependent generalization analysis cannot be directly applied to listwise losses. To fill this gap, we extend the model stability from instance-wise losses to listwise losses. Third, AUPRC optimization involves a compositional optimization problem, which brings complicated computations. In this work, we propose to reduce the computational complexity by matrix spectral decomposition. Based on these techniques, we derive the first algorithm-dependent generalization bound for AUPRC optimization. Motivated by theoretical results, we propose a generalization-induced learning framework, which improves the AUPRC generalization by equivalently increasing the batch size and the number of valid training examples. Practically, experiments on image retrieval and long-tailed classification speak to the effectiveness and soundness of our framework.

Abstract:
One-shot skeleton action recognition, which aims to learn a skeleton action recognition model with a single training sample, has attracted increasing interest due to the challenge of collecting and annotating large-scale skeleton action data. However, most existing studies match skeleton sequences by comparing their feature vectors directly which neglects spatial structures and temporal orders of skeleton data. This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching. We represent skeleton data at multiple spatial and temporal scales and achieve optimal feature matching from two perspectives. The first is multi-scale matching which captures the scale-wise semantic relevance of skeleton data at multiple spatial and temporal scales simultaneously. The second is cross-scale matching which handles different motion magnitudes and speeds by capturing sample-wise relevance across multiple scales. Extensive experiments over three large-scale datasets (NTU RGB+D, NTU RGB+D 120, and PKU-MMD) show that our method achieves superior one-shot skeleton action recognition, and outperforms SOTA consistently by large margins.

Abstract:
Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. Due to this, model performance drops drastically when evaluated on label-scarce datasets having visually distinct images, termed as domain adaptation problem. There are a plethora of works to adapt classification and segmentation models to label-scarce target dataset through unsupervised domain adaptation. Considering that detection is a fundamental task in computer vision, many recent works have focused on developing novel domain adaptive detection techniques. Here, we describe in detail the domain adaptation problem for detection and present an extensive survey of the various methods. Furthermore, we highlight strategies proposed and the associated shortcomings. Subsequently, we identify multiple aspects of the problem that are most promising for future research. We believe that this survey shall be valuable to the pattern recognition experts working in the fields of computer vision, biometrics, medical imaging, and autonomous navigation by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research.

Abstract:
We present incomplete gamma kernels, a generalization of Locally Optimal Projection (LOP) operators. In particular, we reveal the relation of the classical localized L_1L1 estimator, used in the LOP operator for point cloud denoising, to the common Mean Shift framework via a novel kernel. Furthermore, we generalize this result to a whole family of kernels that are built upon the incomplete gamma function and each represents a localized L_pLp estimator. By deriving various properties of the kernel family concerning distributional, Mean Shift induced, and other aspects such as strict positive definiteness, we obtain a deeper understanding of the operator's projection behavior. From these theoretical insights, we illustrate several applications ranging from an improved Weighted LOP (WLOP) density weighting scheme and a more accurate Continuous LOP (CLOP) kernel approximation to the definition of a novel set of robust loss functions. These incomplete gamma losses include the Gaussian and LOP loss as special cases and can be applied to various tasks including normal filtering. Furthermore, we show that the novel kernels can be included as priors into neural networks. We demonstrate the effects of each application in a range of quantitative and qualitative experiments that highlight the benefits induced by our modifications.

Abstract:
Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, one of the first diffusion model-based text-driven motion generation frameworks, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation.

Abstract:
Branch-and-bound-based consensus maximization stands out due to its important ability of retrieving the globally optimal solution to outlier-affected geometric problems. However, while the discovery of such solutions caries high scientific value, its application in practical scenarios is often prohibited by its computational complexity growing exponentially as a function of the dimensionality of the problem at hand. In this work, we convey a novel, general technique that allows us to branch over an n-1n-1 dimensional space for an n-dimensional problem. The remaining degree of freedom can be solved globally optimally within each bound calculation by applying the efficient interval stabbing technique. While each individual bound derivation is harder to compute owing to the additional need for solving a sorting problem, the reduced number of intervals and tighter bounds in practice lead to a significant reduction in the overall number of required iterations. Besides an abstract introduction of the approach, we present applications to four fundamental geometric computer vision problems: camera resectioning, relative camera pose estimation, point set registration, and rotation and focal length estimation. Through our exhaustive tests, we demonstrate significant speed-up factors at times exceeding two orders of magnitude, thereby increasing the viability of globally optimal consensus maximizers in online application scenarios.

Abstract:
Fine-grained image retrieval mainly focuses on learning salient features from the seen subcategories as discriminative embedding while neglecting the problems behind zero-shot settings. We argue that retrieving fine-grained objects from unseen subcategories may rely on more diverse clues, which are easily restrained by the salient features learnt from seen subcategories. To address this issue, we propose a novel Content-aware Rectified Activation model, which enables this model to suppress the activation on salient regions while preserving their discrimination, and spread activation to adjacent non-salient regions, thus mining more diverse discriminative features for retrieving unseen subcategories. Specifically, we construct a content-aware rectified prototype (CARP) by perceiving semantics of salient regions. CARP acts as a channel-wise non-destructive activation upper bound and can be selectively used to suppress salient regions for obtaining the rectified features. Moreover, two regularizations are proposed: 1) a semantic coherency constraint that imposes a restriction on semantic coherency of CARP and salient regions, aiming at propagating the discriminative ability of salient regions to CARP, 2) a feature-navigated constraint to further guide the model to adaptively balance the discrimination power of rectified features and the suppression power of salient features. Experimental results on fine-grained and product retrieval benchmarks demonstrate that our method consistently outperforms the state-of-the-art methods.

Abstract:
Generative Adversarial Networks (GANs) have significantly advanced image synthesis through mapping randomly sampled latent codes to high-fidelity synthesized images. However, applying well-trained GANs to real image editing remains challenging. A common solution is to find an approximate latent code that can adequately recover the input image to edit, which is also known as GAN inversion. To invert a GAN model, prior works typically focus on reconstructing the target image at the pixel level, yet few studies are conducted on whether the inverted result can well support manipulation at the semantic level. This work fills in this gap by proposing in-domain GAN inversion, which consists of a domain-guided encoder and a domain-regularized optimizer, to regularize the inverted code in the native latent space of the pre-trained GAN model. In this way, we manage to sufficiently reuse the knowledge learned by GANs for image reconstruction, facilitating a wide range of editing applications without any retraining. We further make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property. Such a trade-off sheds light on how a GAN model represents an image with various semantics encoded in the learned latent distribution.

Abstract:
Most existing learning-based deraining methods are supervisedly trained on synthetic rainy-clean pairs. The domain gap between the synthetic and real rain makes them less generalized to complex real rainy scenes. Moreover, the existing methods mainly utilize the property of the image or rain layers independently, while few of them have considered their mutually exclusive relationship. To solve above dilemma, we explore the intrinsic intra-similarity within each layer and inter-exclusiveness between two layers and propose an unsupervised non-local contrastive learning (NLCL) deraining method. The non-local self-similarity image patches as the positives are tightly pulled together and rain patches as the negatives are remarkably pushed away, and vice versa. On one hand, the intrinsic self-similarity knowledge within positive/negative samples of each layer benefits us to discover more compact representation; on the other hand, the mutually exclusive property between the two layers enriches the discriminative decomposition. Thus, the internal self-similarity within each layer (similarity) and the external exclusive relationship of the two layers (dissimilarity) serving as a generic image prior jointly facilitate us to unsupervisedly differentiate the rain from clean image. We further discover that the intrinsic dimension of the non-local image patches is generally higher than that of the rain patches. This insight motivates us to design an asymmetric contrastive loss that precisely models the compactness discrepancy of the two layers, thereby improving the discriminative decomposition. In addition, recognizing the limited quality of existing real rain datasets, which are often small-scale or obtained from the internet, we collect a large-scale real dataset under various rainy weathers that contains high-resolution rainy images. Extensive experiments conducted on different real rainy datasets demonstrate that the proposed method obtains state-of-the-art performance in real deraining.

Abstract:
Though very popular, it is well known that the Expectation-Maximisation (EM) algorithm for the Gaussian mixture model performs poorly for non-Gaussian distributions or in the presence of outliers or noise. In this paper, we propose a Flexible EM-like Clustering Algorithm (FEMCA): a new clustering algorithm following an EM procedure is designed. It is based on both estimations of cluster centers and covariances. In addition, using a semi-parametric paradigm, the method estimates an unknown scale parameter per data point. This allows the algorithm to accommodate heavier tail distributions, noise, and outliers without significantly losing efficiency in various classical scenarios. We first present the general underlying model for independent, but not necessarily identically distributed, samples of elliptical distributions. We then derive and analyze the proposed algorithm in this context, showing in particular important distribution-free properties of the underlying data distributions. The algorithm convergence and accuracy properties are analyzed by considering the first synthetic data. Finally, we show that FEMCA outperforms other classical unsupervised methods of the literature, such as k-means, EM for Gaussian mixture models, and its recent modifications or spectral clustering when applied to real data sets as MNIST, NORB, and 20newsgroups.

Abstract:
Neural Architecture Search (NAS), aiming at automatically designing neural architectures by machines, has been considered a key step toward automatic machine learning. One notable NAS branch is the weight-sharing NAS, which significantly improves search efficiency and allows NAS algorithms to run on ordinary computers. Despite receiving high expectations, this category of methods suffers from low search effectiveness. By employing a generalization boundedness tool, we demonstrate that the devil behind this drawback is the untrustworthy architecture rating with the oversized search space of the possible architectures. Addressing this problem, we modularize a large search space into blocks with small search spaces and develop a family of models with the distilling neural architecture (DNA) techniques. These proposed models, namely a DNA family, are capable of resolving multiple dilemmas of the weight-sharing NAS, such as scalability, efficiency, and multi-modal compatibility. Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a sub- search space using heuristic algorithms. Moreover, under a certain computational complexity constraint, our method can seek architectures with different depths and widths. Extensive experimental evaluations show that our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively. Additionally, we provide in-depth empirical analysis and insights into neural architecture ratings.

Abstract:
Making line segment detectors more reliable under motion blurs is one of the most important challenges for practical applications, such as visual SLAM and 3D line mapping. Existing line segment detection methods face severe performance degradation for accurately detecting and locating line segments when motion blur occurs. While event data shows strong complementary characteristics to images for minimal blur and edge awareness at high-temporal resolution, potentially beneficial for reliable line segment recognition. To robustly detect line segments over motion blurs, we propose to leverage the complementary information of images and events. Specifically, we first design a general frame-event feature fusion network to extract and fuse the detailed image textures and low-latency event edges, which consists of a channel-attention-based shallow fusion module and a self-attention-based dual hourglass module. We then utilize the state-of-the-art wireframe parsing networks to detect line segments on the fused feature map. Moreover, due to the lack of line segment detection datasets with pairwise motion-blurred images and events, we contribute two datasets, i.e., synthetic FE-Wireframe and realistic FE-Blurframe, for network training and evaluation. Extensive analyses on the component configurations demonstrate the design effectiveness of our fusion network. When compared to the state-of-the-arts, the proposed approach achieves the highest detection accuracy while maintaining comparable real-time performance. In addition to being robust to motion blur, our method also exhibits superior performance for line detection under high dynamic range scenes.

Abstract:
Multilayer perceptron (MLP) has become the de facto backbone in two-view correspondence learning, for it can extract effective deep features from unordered correspondences individually. However, the problem of natively lacking context information limits its performance although many context-capturing modules are appended in the follow-up studies. In this paper, from a novel perspective, we design a correspondence learning network called ConvMatch that for the first time can leverage a convolutional neural network (CNN) as the backbone, inherently capable of context aggregation. Specifically, with the observation that sparse motion vectors and a dense motion field can be converted into each other with interpolating and sampling, we regularize the putative motion vectors by estimating the dense motion field implicitly, then rectify the errors caused by outliers in local areas with CNN, and finally obtain correct motion vectors from the rectified motion field. Moreover, we propose global information injection and bilateral convolution, to fit the overall spatial transformation better and accommodate the discontinuities of the motion field in case of large scene disparity. Extensive experiments reveal that ConvMatch consistently outperforms state-of-the-arts for relative pose estimation, homography estimation, and visual localization.

Abstract:
Federated learning aims to train models collaboratively across different clients without sharing data for privacy considerations. However, one major challenge for this learning paradigm is the data heterogeneity problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe dimensional collapse, in which representations tend to reside in a lower-dimensional space instead of the ambient space. This dimensional collapse phenomenon severely curtails the expressive power of models, leading to significant degradation in the performance. Next, via experiments, we make more observations and posit two reasons that result in this phenomenon: 1) dimensional collapse on local models; 2) the operation of global averaging on local model parameters. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse. To remedy this problem caused by the data heterogeneity, we propose FedDecorr, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, FedDecorr applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. FedDecorr, which is implementation-friendly and computationally-efficient, yields consistent improvements over various baselines on five standard benchmark datasets including CIFAR10, CIFAR100, TinyImageNet, Office-Caltech10, and DomainNet.

Abstract:
Recent research on multi-agent reinforcement learning (MARL) has shown that action coordination of multi-agents can be significantly enhanced by introducing communication learning mechanisms. Meanwhile, graph neural network (GNN) provides a promising paradigm for communication learning of MARL. Under this paradigm, agents and communication channels can be regarded as nodes and edges in the graph, and agents can aggregate information from neighboring agents through GNN. However, this GNN-based communication paradigm is susceptible to adversarial attacks and noise perturbations, and how to achieve robust communication learning under perturbations has been largely neglected. To this end, this paper explores this problem and introduces a robust communication learning mechanism with graph information bottleneck optimization, which can optimally realize the robustness and effectiveness of communication learning. We introduce two information-theoretic regularizers to learn the minimal sufficient message representation for multi-agent communication. The regularizers aim at maximizing the mutual information (MI) between the message representation and action selection while minimizing the MI between the agent feature and message representation. Besides, we present a MARL framework that can integrate the proposed communication mechanism with existing value decomposition methods. Experimental results demonstrate that the proposed method is more robust and efficient than state-of-the-art GNN-based MARL methods.

Abstract:
Since higher-order tensors are naturally suitable for representing multi-dimensional data in real-world, e.g., color images and videos, low-rank tensor representation has become one of the emerging areas in machine learning and computer vision. However, classical low-rank tensor representations can solely represent multi-dimensional discrete data on meshgrid, which hinders their potential applicability in many scenarios beyond meshgrid. To break this barrier, we propose a low-rank tensor function representation (LRTFR) parameterized by multilayer perceptrons (MLPs), which can continuously represent data beyond meshgrid with powerful representation abilities. Specifically, the suggested tensor function, which maps an arbitrary coordinate to the corresponding value, can continuously represent data in an infinite real space. Parallel to discrete tensors, we develop two fundamental concepts for tensor functions, i.e., the tensor function rank and low-rank tensor function factorization, and utilize MLPs to paramterize factor functions of the tensor function factorization. We theoretically justify that both low-rank and smooth regularizations are harmoniously unified in LRTFR, which leads to high effectiveness and efficiency for data continuous representation. Extensive multi-dimensional data recovery applications arising from image processing (image inpainting and denoising), machine learning (hyperparameter optimization), and computer graphics (point cloud upsampling) substantiate the superiority and versatility of our method as compared with state-of-the-art methods. Especially, the experiments beyond the original meshgrid resolution (hyperparameter optimization) or even beyond meshgrid (point cloud upsampling) validate the favorable performances of our method for continuous representation.

Affiliations: Sydney AI Center, School of Computer Science, Faculty of Engineering, University of Sydney, Darlington, NSW, Australia; Australian AI Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW, Australia; PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Department of Computer Science, Hong Kong Baptist University, Hong Kong, China; Department of Automation, University of Science and Technology of China, Hefei, China; School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China

Abstract:
The sample selection approach is very popular in learning with noisy labels. As deep networks “learn pattern first”, prior methods built on sample selection share a similar training procedure: the small-loss examples can be regarded as clean examples and used for helping generalization, while the large-loss examples are treated as mislabeled ones and excluded from network parameter updates. However, such a procedure is arguably debatable from two folds: (a) it does not consider the bad influence of noisy labels in selected small-loss examples; (b) it does not make good use of the discarded large-loss examples, which may be clean or have meaningful information for generalization. In this paper, we propose regularly truncated M-estimators (RTME) to address the above two issues simultaneously. Specifically, RTME can alternately switch modes between truncated M-estimators and original M-estimators. The former can adaptively select small-losses examples without knowing the noise rate and reduce the side-effects of noisy labels in them. The latter makes the possibly clean examples but with large losses involved to help generalization. Theoretically, we demonstrate that our strategies are label-noise-tolerant. Empirically, comprehensive experimental results show that our method can outperform multiple baselines and is robust to broad noise types and levels.

Abstract:
Our goal with this survey is to provide an overview of the state of the art deep learning methods for face generation and editing using StyleGAN. The survey covers the evolution of StyleGAN, from PGGAN to StyleGAN3, and explores relevant topics such as suitable metrics for training, different latent representations, GAN inversion to latent spaces of StyleGAN, face image editing, cross-domain face stylization, face restoration, and even Deepfake applications. We aim to provide an entry point into the field for readers that have basic knowledge about the field of deep learning and are looking for an accessible introduction and overview.

Abstract:
This paper addresses the problem of mapping high-dimensional data to a low-dimensional space, in the presence of other known features. This problem is ubiquitous in science and engineering as there are often controllable/measurable features in most applications. To solve this problem, this paper proposes a broad class of methods, which is referred to as conditional multidimensional scaling (MDS). An algorithm for optimizing the objective function of conditional MDS is also developed. The convergence of this algorithm is proven under mild assumptions. Conditional MDS is illustrated with kinship terms, facial expressions, textile fabrics, car-brand perception, and cylinder machining examples. These examples demonstrate the advantages of conditional MDS over conventional dimension reduction in improving the estimation quality of the reduced-dimension space and simplifying visualization and knowledge discovery tasks. Computer codes for this work are available in the open-source cml R package.

Abstract:
The objective of Active Learning is to strategically label a subset of the dataset to maximize performance within a predetermined labeling budget. In this study, we harness features acquired through self-supervised learning. We introduce a straightforward yet potent metric, Cluster Distance Difference, to identify diverse data. Subsequently, we introduce a novel framework, Balancing Active Learning (BAL), which constructs adaptive sub-pools to balance diverse and uncertain data. Our approach outperforms all established active learning methods on widely recognized benchmarks by 1.20%. Moreover, we assess the efficacy of our proposed framework under extended settings, encompassing both larger and smaller labeling budgets. Experimental results demonstrate that, when labeling 80% of the samples, the performance of the current SOTA method declines by 0.74%, whereas our proposed BAL achieves performance comparable to the full dataset.

Abstract:
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. However, modeling global correlations with multi-head self-attention (MSA) layers leads to two widely recognized issues: the massive computational resource consumption and the lack of intrinsic inductive bias for modeling local visual patterns. To solve both issues, we devise a simple yet effective method named Single-Path Vision Transformer pruning (SPViT), to efficiently and automatically compress the pre-trained ViTs into compact models with proper locality added. Specifically, we first propose a novel weight-sharing scheme between MSA and convolutional operations, delivering a single-path space to encode all candidate operations. In this way, we cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty, and the convolution kernels can be well initialized using pre-trained MSA parameters. Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers. Similarly, we further employ learnable gates to encode the fine-grained MLP expansion ratios of FFN layers. In this way, our SPViT optimizes the learnable gates to automatically explore from a vast and unified search space and flexibly adjust the MSA-FFN pruning proportions for each individual dense model. We conduct extensive experiments on two representative ViTs showing that our SPViT achieves a new SOTA for pruning on ImageNet-1 k. For example, our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.

Abstract:
Federated learning (FL) is a hot collaborative training framework via aggregating model parameters of decentralized local clients. However, most FL methods unreasonably assume data categories of FL framework are known and fixed in advance. Moreover, some new local clients that collect novel categories unseen by other clients may be introduced to FL training irregularly. These issues render global model to undergo catastrophic forgetting on old categories, when local clients receive new categories consecutively under limited memory of storing old categories. To tackle the above issues, we propose a novel Local-Global Anti-forgetting (LGA) model. It ensures no local clients are left behind as they learn new classes continually, by addressing local and global catastrophic forgetting. Specifically, considering tackling class imbalance of local client to surmount local forgetting, we develop a category-balanced gradient-adaptive compensation loss and a category gradient-induced semantic distillation loss. They can balance heterogeneous forgetting speeds of hard-to-forget and easy-to-forget old categories, while ensure consistent class-relations within different tasks. Moreover, a proxy server is designed to tackle global forgetting caused by Non-IID class imbalance between different clients. It augments perturbed prototype images of new categories collected from local clients via self-supervised prototype augmentation, thus improving robustness to choose the best old global model for local-side semantic distillation loss. Experiments on representative datasets verify superior performance of our model against comparison methods.

Abstract:
Robust multi-view learning with incomplete information has received significant attention due to issues such as incomplete correspondences and incomplete instances that commonly affect real-world multi-view applications. Existing approaches heavily rely on paired samples to realign or impute defective ones, but such preconditions cannot always be satisfied in practice due to the complexity of data collection and transmission. To address this problem, we present a novel framework called SeMantic Invariance LEarning (SMILE) for multi-view clustering with incomplete information that does not require any paired samples. To be specific, we discover the existence of invariant semantic distribution across different views, which enables SMILE to alleviate the cross-view discrepancy to learn consensus semantics without requiring any paired samples. The resulting consensus semantics remains unaffected by cross-view distribution shifts, making them useful for realigning/imputing defective instances and forming clusters. We demonstrate the effectiveness of SMILE through extensive comparison experiments with 13 state-of-the-art baselines on five benchmarks. Our approach improves the clustering accuracy of NoisyMNIST from 19.3%/23.2% to 82.7%/69.0% when the correspondences/instances are fully incomplete. We will release the code after acceptance.

Abstract:
The traditional 3D object retrieval (3DOR) task is under the close-set setting, which assumes the categories of objects in the retrieval stage are all seen in the training stage. Existing methods under this setting may tend to only lazily discriminate their categories, while not learning a generalized 3D object embedding. Under such circumstances, it is still a challenging and open problem in real-world applications due to the existence of various unseen categories. In this paper, we first introduce the open-set 3DOR task to expand the applications of the traditional 3DOR task. Then, we propose the Hypergraph-Based Multi-Modal Representation (HGM ^22 R) framework to learn 3D object embeddings from multi-modal representations under the open-set setting. The proposed framework is composed of two modules, i.e., the Multi-Modal 3D Object Embedding (MM3DOE) module and the Structure-Aware and Invariant Knowledge Learning (SAIKL) module. By utilizing the collaborative information of modalities derived from the same 3D object, the MM3DOE module is able to overcome the distinction across different modality representations and generate unified 3D object embeddings. Then, the SAIKL module utilizes the constructed hypergraph structure to model the high-order correlation among 3D objects from both seen and unseen categories. The SAIKL module also includes a memory bank that stores typical representations of 3D objects. By aligning with those memory anchors in the memory bank, the aligned embeddings can integrate the invariant knowledge to exhibit a powerful generalized capacity toward unseen categories. We formally prove that hypergraph modeling has better representative capability on data correlation than graph modeling. We generate four multi-modal datasets for the open-set 3DOR task, i.e., OS-ESB-core, OS-NTU-core, OS-MN40-core, and OS-ABO-core, in which each 3D object contains three modality representations: multi-view, point clouds, and voxel. Experiments on these four datasets show that the proposed method can significantly outperform existing methods. In particular, the proposed method outperforms the state-of-the-art by 12.12%/12.88% in terms of mAP on the OS-MN40-core/OS-ABO-core dataset, respectively. Results and visualizations demonstrate that the proposed method can effectively extract the generalized 3D object embeddings on the open-set 3DOR task and achieve satisfactory performance.

Abstract:
We present in this paper a novel denoising training method to speed up DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds GT bounding boxes with noises into the Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to faster convergence. Our method is universal and can be easily plugged into any DETR-like method by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement (+1.9+1.9AP) under the same setting and achieves 46.0 AP and 49.5 AP trained for 12 and 50 epochs with the ResNet-50 backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with 50% training epochs. We also demonstrate the effectiveness of denoising training in CNN-based detectors (Faster R-CNN), segmentation models (Mask2Former, Mask DINO), and more DETR-based models (DETR, Anchor DETR, Deformable DETR).

Abstract:
In this paper, we present a new framework named DIML to achieve more interpretable deep metric learning. Unlike traditional deep metric learning method that simply produces a global similarity given two images, DIML computes the overall similarity through the weighted sum of multiple local part-wise similarities, making it easier for human to understand the mechanism of how the model distinguish two images. Specifically, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images. We also devise a multi-scale matching strategy, which considers both global and local similarities and can significantly reduce the computational costs in the application of image retrieval. To handle the view variance in some complicated scenarios, we propose to use cross-correlation as the marginal distribution of the optimal transport to leverage semantic information to locate the important region in the images. Our framework is model-agnostic, which can be applied to off-the-shelf backbone networks and metric learning methods. To extend our DIML to more advanced architectures like vision Transformers (ViTs), we further propose truncated attention rollout and partial similarity to overcome the lack of locality in ViTs. We evaluate our method on three major benchmarks of deep metric learning including CUB200-2011, Cars196, and Stanford Online Products, and achieve substantial improvements over popular metric learning methods with better interpretability.

Abstract:
Undoubtedly, Deep Neural Networks (DNNs), from AlexNet to ResNet to Transformer, have sparked revolutionary advancements in diverse computer vision tasks. The scale of DNNs has grown exponentially due to the rapid development of computational resources. Despite the tremendous success, DNNs typically depend on massive amounts of training data (especially the recent various foundation models) to achieve high performance and are brittle in that their performance can degrade severely with small changes in their operating environment. Generally, collecting massive-scale training datasets is costly or even infeasible, as for certain fields, only very limited or no examples at all can be gathered. Nevertheless, collecting, labeling, and vetting massive amounts of practical training data is certainly difficult and expensive, as it requires the painstaking efforts of experienced human annotators or experts, and in many cases, prohibitively costly or impossible due to some reason, such as privacy, safety or ethic issues.

Abstract:
State-of-the-art deep learning models are often trained with a large amount of costly labeled training data. However, requiring exhaustive manual annotations may degrade the model's generalizability in the limited-label regime.Semi-supervised learning and unsupervised learning offer promising paradigms to learn from an abundance of unlabeled visual data. Recent progress in these paradigms has indicated the strong benefits of leveraging unlabeled data to improve model generalization and provide better model initialization. In this survey, we review the recent advanced deep learning algorithms on semi-supervised learning (SSL) and unsupervised learning (UL) for visual recognition from a unified perspective. To offer a holistic understanding of the state-of-the-art in these areas, we propose a unified taxonomy. We categorize existing representative SSL and UL with comprehensive and insightful analysis to highlight their design rationales in different learning scenarios and applications in different computer vision tasks. Lastly, we discuss the emerging trends and open challenges in SSL and UL to shed light on future critical research directions.

Abstract:
Unsupervised pre-training aims at learning transferable features that are beneficial for downstream tasks. However, most state-of-the-art unsupervised methods concentrate on learning global representations for image-level classification tasks instead of discriminative local region representations, which limits their transferability to region-level downstream tasks, such as object detection. To improve the transferability of pre-trained features to object detection, we present Deeply Unsupervised Patch Re-ID (DUPR), a simple yet effective method for unsupervised visual representation learning. The patch Re-ID task treats individual patch as a pseudo-identity and contrastively learns its correspondence in two views, enabling us to obtain discriminative local features for object detection. Then the proposed patch Re-ID is performed in a deeply unsupervised manner, appealing to object detection, which usually requires multi-level feature maps. Extensive experiments demonstrate that DUPR outperforms state-of-the-art unsupervised pre-trainings and even the ImageNet supervised pre-training on various downstream tasks related to object detection.

Abstract:
In heavy rain video, rain streak and rain accumulation are the most common causes of degradation. They occlude background information and can significantly impair the visibility. Most existing methods rely heavily on the synthetic training data, and thus raise the domain gap problem that prevents the trained models from performing adequately in real testing cases. Unlike these methods, we introduce a self-learning method to remove both rain streaks and rain accumulation without using any ground-truth clean images in training our model, which consequently can alleviate the domain gap issue. The main idea is based on the assumptions that (1) adjacent clean frames can be aligned or warped from one frame to another frame, (2) rain streaks are distributed randomly in the temporal domain, (3) the rain streak/accumulation related variables/priors can be inferred reliably from the information within the images/sequences. Based on these assumptions, we construct an augmented Self-Learned Deraining Network (SLDNet+) to remove both rain streaks and rain accumulation by utilizing temporal correlation, consistency, and rain-related priors. For the temporal correlation, our SLDNet+ takes rain degraded adjacent frames as its input, aligns them, and learns to predict the clean version of the current frame. For the temporal consistency, a new loss is designed to build a robust mapping between the predicted clean frame and non-rain regions from the adjacent rain frames. For the rain-streak-related prior, the rain streak removal network is optimized jointly with motion estimation and rain region detection; while for the rain-accumulation-related prior, a novel non-local video rain accumulation removal method is developed to estimate the accumulation-lines from the whole input video and to offer better color constancy and temporal smoothness. Extensive experiments show the effectiveness of our approach, which provides superior results compared with the existing state of the art methods both quantitatively and qualitatively. The source code will be made publicly available at: https://github.com/flyywh/CVPR-2020-Self-Rain-Removal-Journal.

Abstract:
Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called nebula anchors, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, e.g., the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.

Abstract:
The objective of few-shot learning is to design a system that can adapt to a given task with only few examples while achieving generalization. Model-agnostic meta-learning (MAML), which has recently gained the popularity for its simplicity and flexibility, learns a good initialization for fast adaptation to a task under few-data regime. However, its performance has been relatively limited especially when novel tasks are different from tasks previously seen during training. In this work, instead of searching for a better initialization, we focus on designing a better fast adaptation process. Consequently, we propose a new task-adaptive weight update rule that greatly enhances the fast adaptation process. Specifically, we introduce a small meta-network that can generate per-step hyperparameters for each given task: learning rate and weight decay coefficients. The experimental results validate that learning a good weight update rule for fast adaptation is the equally important component that has drawn relatively less attention in the recent few-shot learning approaches. Surprisingly, fast adaptation from random initialization with ALFA can already outperform MAML. Furthermore, the proposed weight-update rule is shown to consistently improve the task-adaptation capability of MAML across diverse problem domains: few-shot classification, cross-domain few-shot classification, regression, visual tracking, and video frame interpolation.

Abstract:
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. The crux of few-shot learning is to extract prior knowledge from related tasks to enable fast adaptation to a new task with a limited amount of data. In this paper, we propose meta-learning kernels with random Fourier features for few-shot learning, we call MetaKernel. Specifically, we propose learning variational random features in a data-driven manner to obtain task-specific kernels by leveraging the shared knowledge provided by related tasks in a meta-learning setting. We treat the random feature basis as the latent variable, which is estimated by variational inference. The shared knowledge from related tasks is incorporated into a context inference of the posterior, which we achieve via a long-short term memory module. To establish more expressive kernels, we deploy conditional normalizing flows based on coupling layers to achieve a richer posterior distribution over random Fourier bases. The resultant kernels are more informative and discriminative, which further improves the few-shot learning. To evaluate our method, we conduct extensive experiments on both few-shot image classification and regression tasks. A thorough ablation study demonstrates that the effectiveness of each introduced component in our method. The benchmark results on fourteen datasets demonstrate MetaKernel consistently delivers at least comparable and often better performance than state-of-the-art alternatives.

Abstract:
Automatic few-shot font generation aims to solve a well-defined, real-world problem because manual font designs are expensive and sensitive to the expertise of designers. Existing methods learn to disentangle style and content elements by developing a universal style representation for each font style. However, this approach limits the model in representing diverse local styles because it is unsuitable for complicated letter systems. For example, Chinese characters consist of a varying number of components (often called “radical”) with a highly complex structure. In this paper, we propose a novel font generation method that learns localized styles, namely component-wise style representations, instead of universal styles. The proposed style representations enable synthesizing complex local details in text designs. However, learning component-wise styles solely from a few reference glyphs is infeasible when a target script has a large number of components, for example, over 200 for Chinese. To reduce the number of required reference glyphs, we represent component-wise styles by a product of component and style factors inspired by low-rank matrix factorization. Owing to the combination of strong representation and a compact factorization strategy, our method shows remarkably better few-shot font generation results (with only eight reference glyphs) than other state-of-the-art methods. Moreover, strong locality supervision was not utilized, such as the location of each component, skeleton, or strokes. The source code is available at https://github.com/clovaai/lffont.

Abstract:
Adversarial domain adaptation has been an effective approach for learning domain-invariant features by adversarial training. In this paper, we propose a novel adversarial domain adaptation approach defined in the spherical feature space, in which we define spherical classifier for label prediction and spherical domain discriminator for discriminating domain labels. In the spherical feature space, we develop a spherical robust pseudo-label loss to utilize pseudo-labels robustly, which weights the importance of the estimated labels of target domain data by the posterior probability of correct labeling, modeled by the Gaussian-uniform mixture model in the spherical space. Our proposed approach can be generally applied to both unsupervised and semi-supervised domain adaptation settings. In particular, to tackle the semi-supervised domain adaptation setting where a few labeled target domain data are available for training, we propose a novel reweighted adversarial training strategy for effectively reducing the intra-domain discrepancy within the target domain. We also present theoretical analysis for the proposed method based on the domain adaptation theory. Extensive experiments are conducted on multiple benchmarks for object recognition, digit recognition, and face recognition. The results show that our method either surpasses or is competitive compared with the recent methods for both unsupervised and semi-supervised domain adaptation. Ablation studies also confirm the effectiveness of the spherical classifier, spherical discriminator, spherical robust pseudo-label loss, and reweighted adversarial training strategy.

Abstract:
Early screening is essential for effective intervention and treatment of individuals with mental disorders. Functional magnetic resonance imaging (fMRI) is a noninvasive tool for depicting neural activity and has demonstrated strong potential as a technique for identifying mental disorders. Due to the difficulty in data collection and diagnosis, imaging data from patients are rare at a single site, whereas abundant healthy control data are available from public datasets. However, joint use of these data from multiple sites for classification model training is hindered by cross-domain distribution discrepancy and diverse label spaces. Herein, we propose few-shot domain-adaptive anomaly detection (FAAD) to achieve cross-site anomaly detection of brain images based on only a few labeled samples. We introduce domain adaptation to mitigate cross-domain distribution discrepancy and jointly align the general and conditional feature distributions of imaging data across multiple sites. We utilize fMRI data of healthy subjects in the Human Connectome Project (HCP) as the source domain and fMRI images from six independent sites, including patients with mental disorders and demographically matched healthy controls, as target domains. Experiments showed the superiority of the proposed method compared with binary classification, traditional anomaly detection methods, and several recognized domain adaptation methods.

Abstract:
Tracking visual objects from a single initial exemplar in the testing phase has been broadly cast as a one-/few-shot problem, i.e., one-shot learning for initial adaptation and few-shot learning for online adaptation. The recent few-shot online adaptation methods incorporate the prior knowledge from large amounts of annotated training data via complex meta-learning optimization in the offline phase. This helps the online deep trackers to achieve fast adaptation and reduce overfitting risk in tracking. In this paper, we propose a simple yet effective recursive least-squares estimator-aided online learning approach for few-shot online adaptation without requiring offline training. It allows an in-built memory retention mechanism for the model to remember the knowledge about the object seen before, and thus the seen data can be safely removed from training. This also bears certain similarities to the emerging continual learning field in preventing catastrophic forgetting. This mechanism enables us to unveil the power of modern online deep trackers without incurring too much extra computational cost. We evaluate our approach based on two networks in the online learning families for tracking, i.e., multi-layer perceptrons in RT-MDNet and convolutional neural networks in DiMP. The consistent improvements on several challenging tracking benchmarks demonstrate its effectiveness and efficiency.

Abstract:
Federated learning is an important privacy-preserving multi-party learning paradigm, involving collaborative learning with others and local updating on private data. Model heterogeneity and catastrophic forgetting are two crucial challenges, which greatly limit the applicability and generalizability. This paper presents a novel FCCL+, federated correlation and similarity learning with non-target distillation, facilitating the both intra-domain discriminability and inter-domain generalization. For heterogeneity issue, we leverage irrelevant unlabeled public data for communication between the heterogeneous participants. We construct cross-correlation matrix and align instance similarity distribution on both logits and feature levels, which effectively overcomes the communication barrier and improves the generalizable ability. For catastrophic forgetting in local updating stage, FCCL+ introduces Federated Non Target Distillation, which retains inter-domain knowledge while avoiding the optimization conflict issue, fulling distilling privileged inter-domain information through depicting posterior classes relation. Considering that there is no standard benchmark for evaluating existing heterogeneous federated learning under the same setting, we present a comprehensive benchmark with extensive representative methods under four domain shift scenarios, supporting both heterogeneous and homogeneous federated settings. Empirical results demonstrate the superiority of our method and the efficiency of modules on various scenarios. The benchmark code for reproducing our results is available at https://github.com/WenkeHuang/FCCL.

Abstract:
Blockchain data mining has the potential to reveal the operational status and behavioral patterns of anonymous participants in blockchain systems, thus providing valuable insights into system operation and participant behavior. However, traditional blockchain analysis methods suffer from the problems of being unable to handle the data due to its large volume and complex structure. With powerful computing and analysis capabilities, graph learning can solve the current problems through handling each node's features and linkage relationships separately and exploring the implicit properties of data from a graph perspective. This paper systematically reviews the blockchain data mining tasks based on graph learning approaches. First, we investigate the blockchain data acquisition method, integrate the currently available data analysis tools, and divide the sampling method into rule-based and cluster-based techniques. Second, we classify the graph construction into transaction-based blockchain and account-based methods, and comprehensively analyze the existing blockchain feature extraction methods. Third, we compare the existing graph learning algorithms on blockchain and classify them into traditional machine learning-based, graph representation-based, and graph deep learning-based methods. Finally, we propose future research directions and open issues which are promising to address.

Abstract:
Image captioning is a core challenge in computer vision, attracting significant attention. Traditional methods prioritize caption quality, often overlooking style control. Our research enhances method controllability, enabling descriptions of varying detail. By integrating a length level embedding into current models, they can produce detailed or concise captions, increasing diversity. We introduce a length-level reranking transformer to correlate image and text complexity, optimizing caption length for informativeness without redundancy. Additionally, with caption length increase, computational complexity grows due to the autoregressive (AR) design of existing methods. To address this, our non-autoregressive (NAR) model maintains constant complexity regardless of caption length. We've developed a training approach that includes refinement sequence training and sequence-level knowledge distillation to close the performance gap between NAR and AR models. In testing, our models set new standards for caption quality on the MS COCO dataset and offer enhanced controllability and diversity. Our NAR model excels over AR models in these aspects and shows greater efficiency with longer captions. With advanced training techniques, our NAR's caption quality rivals that of leading AR models.

Abstract:
Non-adversarial generative models are relatively easy to train and have less mode collapse than adversarial models. However, they are not very accurate in approximating the target distribution in latent space because they don't have a discriminator. To this end, we develop a novel divide-and-conquer model called Tessellated Wasserstein Auto-Encoders (TWAE) which has less statistical error in approximating the target distribution. TWAE tessellates the support of the target distribution into a given number of regions using the centroidal Voronoi tessellation (CVT) technique and designs data batches according to the tessellation instead of random shuffling for accurate computation of discrepancy. Theoretically, we demonstrate that the error in estimating the discrepancy decreases as the number of samples nn and the regions mm of the tessellation increase at rates of \mathcal O(\frac1\sqrtn)O(1n) and \mathcal O(\frac1\sqrtm)O(1m), respectively. TWAE is very flexible to different non-adversarial metrics and can significantly enhance their generative performance in terms of Fréchet inception distance (FID) compared to existing ones. Furthermore, numerical results demonstrate that TWAE is competitive to the adversarial model and shows powerful generative ability.

Abstract:
When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.

Abstract:
Event streams provide a novel paradigm to describe visual scenes by capturing intensity variations above specific thresholds along with various types of noise. Existing event generation methods usually rely on one-way mappings using hand-crafted parameters and noise rates, which may not adequately suit diverse scenarios and event cameras. To address this limitation, we propose a novel approach to learn a bidirectional mapping between the feature space of event streams and their inherent parameters, enabling the generation of reliable event streams with enhanced generalization capabilities. We first randomly generate a vast number of parameters and synthesize massive event streams using an event simulator. Subsequently, an event-based normalizing flow network is proposed to learn the invertible mapping between the representation of a synthetic event stream and its parameters. The invertible mapping is implemented by incorporating an intensity-guided conditional affine simulation mechanism, facilitating better alignment between event features and parameter spaces. Additionally, we impose constraints on event sparsity, edge distribution, and noise distribution through novel event losses, further emphasizing event priors in the bidirectional mapping. Our framework surpasses state-of-the-art methods in video reconstruction, optical flow estimation, and parameter estimation tasks on synthetic and real-world datasets, exhibiting excellent generalization across diverse scenes and cameras.

Abstract:
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes. As a long-range video understanding task, researchers have developed an extended collection of methods and examined their performance using various benchmarks. Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted in these sectors. This survey analyzes and summarizes the most significant contributions and trends. In particular, we first examine the task definition, common benchmarks, types of supervision, and prevalent evaluation measures. In addition, we systematically investigate two essential techniques of this topic, i.e., frame representation and temporal modeling, which have been studied extensively in the literature. We then conduct a thorough review of existing TAS works categorized by their levels of supervision and conclude our survey by identifying and emphasizing several research gaps.

Abstract:
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled to 1B parameters by taking the advantage of the scalable model capacity and high parallelism, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose++ model is proposed to deal with heterogeneous body keypoint categories via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Our largest single model ViTPose-G sets a new record on the MS COCO test set without model ensemble. Furthermore, our ViTPose++ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

Abstract:
Scene graph generation is a structured prediction task aiming to explicitly model objects and their relationships via constructing a visually-grounded scene graph for an input image. Currently, the message passing neural network based mean field variational Bayesian methodology is the ubiquitous solution for such a task, in which the variational inference objective is often assumed to be the classical evidence lower bound. However, the variational approximation inferred from such loose objective generally underestimates the underlying posterior, which often leads to inferior generation performance. In this paper, we propose a novel importance weighted structure learning method aiming to approximate the underlying log-partition function with a tighter importance weighted lower bound, which is computed from multiple samples drawn from a reparameterizable Gumbel-Softmax sampler. A generic entropic mirror descent algorithm is applied to solve the resulting constrained variational inference task. The proposed method achieves the state-of-the-art performance on various popular scene graph generation benchmarks.

Abstract:
The fusion of federated learning and differential privacy can provide more comprehensive and rigorous privacy protection, thus attracting extensive interests from both academia and industry. However, facing the system-level challenge of device heterogeneity, most current synchronous FL paradigms exhibit low efficiency due to the straggler effect, which can be significantly reduced by Asynchronous FL (AFL). However, AFL has never been comprehensively studied, which imposes a major challenge in the utility optimization of DP-enhanced AFL. Here, theoretically motivated multi-stage adaptive private algorithms are proposed to improve the trade-off between model utility and privacy for DP-enhanced AFL. In particular, we first build two DP-enhanced AFL frameworks with consideration of universal factors for different adversary models. Then, we give a solid analysis on the model convergence of AFL, based on which, DP can be adaptively achieved with high utility. Through extensive experiments on different training models and benchmark datasets, we demonstrate that the proposed algorithms achieve the overall best performances and improve up to 24% test accuracy with the same privacy loss and have faster convergence compared with the state-of-the-art algorithms. Our frameworks provide an analytical way for private AFL and adapt to more complex FL application scenarios.

Abstract:
In this article, we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry; and LayerNet, a deep network that given a single image of a person simultaneously performs detailed 3D reconstruction of body and clothes. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and open jackets), while controlling other properties like garment size or tightness/looseness. LayerNet follows a coarse-to-fine multi-stage strategy by first predicting smooth cloth geometries from SMPLicit, which are then refined by an image-guided displacement network that gracefully fits the body recovering high-frequency details and wrinkles. LayerNet achieves competitive accuracy in the task of 3D reconstruction against current ‘garment-agnostic’ state of the art for images of people in up-right positions and controlled environments, and consistently surpasses these methods on challenging body poses and uncontrolled settings. Furthermore, the semantically rich outcome of our approach is suitable for performing Virtual Try-on tasks directly on 3D, a task which, so far, has only been addressed in the 2D domain.

Abstract:
This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a new scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and better rate-distortion performance than existing baseline methods.

Abstract:
Batch normalization (BN) is used by default in many modern deep neural networks due to its effectiveness in accelerating training convergence and boosting inference performance. Recent studies suggest that the effectiveness of BN is due to the Lipschitzness of the loss and gradient, rather than the reduction of internal covariate shift. However, questions remain about whether Lipschitzness is sufficient to explain the effectiveness of BN and whether there is room for vanilla BN to be further improved. To answer these questions, we first prove that when stochastic gradient descent (SGD) is applied to optimize a general non-convex problem, three effects will help convergence to be faster and better: (i) reduction of the gradient Lipschitz constant, (ii) reduction of the expectation of the square of the stochastic gradient, and (iii) reduction of the variance of the stochastic gradient. We demonstrate that vanilla BN only with ReLU can induce the three effects above, rather than Lipschitzness, but vanilla BN with other nonlinearities like Sigmoid, Tanh, and SELU will result in degraded convergence performance. To improve vanilla BN, we propose a new normalization approach, dubbed complete batch normalization (CBN), which changes the placement position of normalization and modifies the structure of vanilla BN based on the theory. It is proven that CBN can elicit all the three effects above, regardless of the nonlinear activation used. Extensive experiments on benchmark datasets CIFAR10, CIFAR100, and ILSVRC2012 validate that CBN makes the training convergence faster, and the training loss converges to a smaller local minimum than vanilla BN. Moreover, CBN helps networks with multiple nonlinear activations (Sigmoid, Tanh, ReLU, SELU, and Swish) achieve higher test accuracy steadily. Specifically, benefitting from CBN, the classification accuracies for networks with Sigmoid, Tanh, and SELU are boosted by more than 15.0%, 4.5%, and 4.0% on average, respectively, which is even comparable to the performance for ReLU.

Abstract:
Anomaly detection has recently gained increasing attention in the field of computer vision, likely due to its broad set of applications ranging from product fault detection on industrial production lines and impending event detection in video surveillance to finding lesions in medical scans. Regardless of the domain, anomaly detection is typically framed as a one-class classification task, where the learning is conducted on normal examples only. An entire family of successful anomaly detection methods is based on learning to reconstruct masked normal inputs (e.g. patches, future frames, etc.) and exerting the magnitude of the reconstruction error as an indicator for the abnormality level. Unlike other reconstruction-based methods, we present a novel self-supervised masked convolutional transformer block (SSMCTB) that comprises the reconstruction-based functionality at a core architectural level. The proposed self-supervised block is extremely flexible, enabling information masking at any layer of a neural network and being compatible with a wide range of neural architectures. In this work, we extend our previous self-supervised predictive convolutional attentive block (SSPCAB) with a 3D masked convolutional layer, a transformer for channel-wise attention, as well as a novel self-supervised objective based on Huber loss. Furthermore, we show that our block is applicable to a wider variety of tasks, adding anomaly detection in medical images and thermal videos to the previously considered tasks based on RGB images and surveillance videos. We exhibit the generality and flexibility of SSMCTB by integrating it into multiple state-of-the-art neural models for anomaly detection, bringing forth empirical results that confirm considerable performance improvements on five benchmarks: MVTec AD, BRATS, Avenue, ShanghaiTech, and Thermal Rare Event.

Abstract:
We propose a novel visual SLAM method that integrates text objects tightly by treating them as semantic features via fully exploring their geometric and semantic prior. The text object is modeled as a texture-rich planar patch whose semantic meaning is extracted and updated on the fly for better data association. With the full exploration of locally planar characteristics and semantic meaning of text objects, the SLAM system becomes more accurate and robust even under challenging conditions such as image blurring, large viewpoint changes, and significant illumination variations (day and night). We tested our method in various scenes with the ground truth data. The results show that integrating texture features leads to a more superior SLAM system that can match images across day and night. The reconstructed semantic 3D text map could be useful for navigation and scene understanding in robotic and mixed reality applications.

Abstract:
Detecting diverse objects, including ones never-seen-before during training, is critical for the safe application of object detectors. To this end, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect unknown objects without the reliance on an auxiliary dataset. For this task, it is important to reduce the impact of lacking unknown data for supervision and leverage in-distribution (ID) data to improve the model's discrimination. In this paper, we propose a method of Two-Stream Information Bottleneck (TIB), consisting of a standard IB and a dedicated Reverse Information Bottleneck (RIB). Specifically, after extracting the features of an ID image, we first define a standard IB network to disentangle instance representations that are beneficial for localizing and recognizing objects. Meanwhile, we present RIB to obtain simulative OOD features to alleviate the impact of lacking unknown data. Different from standard IB aiming to extract task-relevant compact representations, RIB is to obtain task-irrelevant representations by reversing the optimization objective of the standard IB. Next, to further enhance the discrimination, a mixture of information bottlenecks is designed to sufficiently capture object-related information. Experimental results on OOD-OD, open-vocabulary object detection, incremental object detection, and open-set object detection show the superiorities of our method.

Abstract:
As a classical feature compression technique, quantization is usually coupled with inverted indices for scalable image retrieval. Most quantization methods explicitly divide feature space into Voronoi cells, and quantize feature vectors in each cell into the centroids learned from data distribution. However, Voronoi decomposition is difficult to achieve discriminative space partition for semantic image retrieval. In this paper, we explore semantic-aware feature space partition by deep neural network instead of Voronoi cells. To this end, we propose a new deep probabilistic quantization method, abbreviated as DeepIndex, which constructs inverted indices without explicit centroid learning. In our method, the deep neural network takes an image as input and outputs its probability of being put into each inverted index list. During training, we progressively quantize each image into the inverted lists with the top-TT maximal probabilities, and calculate the reward of each trial based on retrieval accuracy. We optimize the deep neural network to maximize the probability of the inverted list with maximal reward. In this way, the retrieval performance is directly optimized, leading to a more semantically discriminative space partition than other quantization methods. The experiments on public image datasets demonstrate the effectiveness of our DeepIndex method on semantic image retrieval.

Abstract:
We propose a novel discriminative feature learning method via Max-Min Ratio Analysis (MMRA) for exclusively dealing with the long-standing “worst-case class separation” problem. Existing technologies simply consider maximizing the minimal pairwise distance on all class pairs in the low-dimensional subspace, which is unable to separate overlapped classes entirely especially when the distribution of samples within same class is diverging. We propose a new criterion, i.e., Max-Min Ratio Analysis (MMRA) that focuses on maximizing the minimal ratio value of between-class and within-class scatter to extremely enlarge the separability on the overlapped pairwise classes. Furthermore, we develop two novel discriminative feature learning models for dimensionality reduction and metric learning based on our MMRA criterion. However, solving such a non-smooth non-convex max-min ratio problem is challenging. As an important theoretical contribution in this paper, we systematically derive an alternative iterative algorithm based on a general max-min ratio optimization framework to solve a general max-min ratio problem with rigorous proofs of convergence. More importantly, we also present another solver based on bisection search strategy to solve the SDP problem efficiently. To evaluate the effectiveness of proposed methods, we conduct extensive pattern classification and image retrieval experiments on several artificial datasets and real-world ScRNA-seq datasets, and experimental results demonstrate the effectiveness of proposed methods.

Abstract:
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models’ ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.

Abstract:
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The fused features can also be used to predict semantic labels, allowing our method to reconstruct and segment the 3D scene simultaneously. Furthermore, we purpose an efficient self-supervised fine-tuning scheme that refines scene geometry based on input images through differentiable volume rendering. This fine-tuning scheme improves reconstruction quality on the fine-tuned scenes, as well as the generalization to similar test scenes. The experiments on ScanNet, 7-Scenes and Replica datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed.

Abstract:
Deep Neural Networks (DNNs) are known to be vulnerable to both backdoor and adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct robustness problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, this paper revealed that there is an intriguing connection between them: (1) planting a backdoor into a model will significantly affect the model's adversarial examples and (2) for an infected model, its adversarial examples have similar features as the triggered images. Based on these observations, a novel Progressive Unified Defense (PUD) algorithm is proposed to defend against backdoor and adversarial attacks simultaneously. Specifically, our PUD has a progressive model purification scheme to jointly erase backdoors and enhance the model's adversarial robustness. At the early stage, the adversarial examples of infected models are utilized to erase backdoors. With the backdoor gradually erased, our model purification can naturally turn into a stage to boost the model's robustness against adversarial attacks. Besides, our PUD algorithm can effectively identify poisoned images, which allows the initial extra dataset not to be completely clean. Extensive experimental results show that, our discovered connection between backdoor and adversarial attacks is ubiquitous, no matter what type of backdoor attack. The proposed PUD outperforms the state-of-the-art backdoor defense, including the model repairing-based and data filtering-based methods. Besides, it also has the ability to compete with the most advanced adversarial defense methods. The code is available here.

Abstract:
Structure-guided image completion aims to inpaint a local region of an image according to an input guidance map from users. While such a task enables many practical applications for interactive editing, existing methods often struggle to hallucinate realistic object instances in complex natural scenes. Such a limitation is partially due to the lack of semantic-level constraints inside the hole region as well as the lack of a mechanism to enforce realistic object generation. In this work, we propose a learning paradigm that consists of semantic discriminators and object-level discriminators for improving the generation of complex semantics and objects. Specifically, the semantic discriminators leverage pretrained visual features to improve the realism of the generated visual concepts. Moreover, the object-level discriminators take aligned instances as inputs to enforce the realism of individual objects. Our proposed scheme significantly improves the generation quality and achieves state-of-the-art results on various tasks, including segmentation-guided completion, edge-guided manipulation and panoptically-guided manipulation on Places2 datasets. Furthermore, our trained model is flexible and can support multiple editing use cases, such as object insertion, replacement, removal and standard inpainting. In particular, our trained model combined with a novel automatic image completion pipeline achieves state-of-the-art results on the standard inpainting task.

Abstract:
Video snapshot compressive imaging (SCI) utilizes a 2D detector to capture sequential video frames and compress them into a single measurement. Various reconstruction methods have been developed to recover the high-speed video frames from the snapshot measurement. However, most existing reconstruction methods are incapable of efficiently capturing long-range spatial and temporal dependencies, which are critical for video processing. In this paper, we propose a flexible and robust approach based on the graph neural network (GNN) to efficiently model non-local interactions between pixels in space and time regardless of the distance. Specifically, we develop a motion-aware dynamic GNN for better video representation, i.e., represent each node as the aggregation of relative neighbors under the guidance of frame-by-frame motions, which consists of motion-aware dynamic sampling, cross-scale node sampling, global knowledge integration, and graph aggregation. Extensive results on both simulation and real data demonstrate both the effectiveness and efficiency of the proposed approach, and the visualization illustrates the intrinsic dynamic sampling operations of our proposed model for boosting the video SCI reconstruction results. The code and model will be released.

Abstract:
The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some ‘easier-to-learn’ discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the ‘easier-to-learn’ patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by \bm 1.5\!-\!3.0× 1.5-3.0× on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).

Abstract:
We study multi-sensor fusion for 3D semantic segmentation that is important to scene understanding for many applications, such as autonomous driving and robotics. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between the two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to effectively exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we project point clouds to the camera coordinate using perspective projection, and process both inputs from LiDAR and cameras in 2D space while preventing the information loss of RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately. The extracted features are fused by effective residual-based fusion modules. Moreover, we introduce additional perception-aware losses to measure the perceptual difference between the two modalities. Last, we propose an improved version of PMF, i.e., EPMF, which is more efficient and effective by optimizing data pre-processing and network architecture under perspective projection. Specifically, we propose cross-modal alignment and cropping to obtain tight inputs and reduce unnecessary computational costs. We then explore more efficient contextual modules under perspective projection and fuse the LiDAR features into the camera stream to boost the performance of the two-stream network. Extensive experiments on benchmark data sets show the superiority of our method. For example, on nuScenes test set, our EPMF outperforms the state-of-the-art method, i.e., RangeFormer, by 0.9% in mIoU.

Abstract:
We introduce PICFormer, a novel framework for Pluralistic Image Completion using a transFormer based architecture, that achieves both high quality and diversity at a much faster inference speed. Our key contribution is to introduce a code-shared codebook learning using a restrictive CNN on small and non-overlapping receptive fields (RFs) for the local visible token representation. This results in a compact yet expressive discrete representation, facilitating efficient modeling of global visible context relations by the transformer. Unlike the prevailing autoregressive approaches, we proposed to sample all tokens simultaneously, leading to more than 100× faster inference speed. To enhance appearance consistency between visible and generated regions, we further propose a novel attention-aware layer (AAL), designed to better exploit distantly related high-frequency features. Through extensive experiments, we demonstrate that the PICFormer efficiently learns semantically-rich discrete codes, resulting in significantly improved image quality. Moreover, our diverse image completion framework surpasses State-of-the-Art methods on multiple image completion datasets.

Abstract:
This paper proposes a novel transformer-based framework to generate accurate class-specific object localization maps for weakly supervised semantic segmentation (WSSS). Leveraging the insight that the attended regions of the one-class token in the standard vision transformer can generate class-agnostic localization maps, we investigate the transformer's capacity to capture class-specific attention for class-discriminative object localization by learning multiple class tokens. We present the Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with patch tokens. This is facilitated by a class-aware training strategy that establishes a one-to-one correspondence between output class tokens and ground-truth class labels. We also introduce a Contrastive-Class-Token (CCT) module to enhance the learning of discriminative class tokens, enabling the model to better capture the unique characteristics of each class. Consequently, the proposed framework effectively generates class-discriminative object localization maps from the class-to-patch attentions associated with different class tokens. To refine these localization maps, we propose the utilization of patch-level pairwise affinity derived from the patch-to-patch transformer attention. Furthermore, the proposed framework seamlessly complements the Class Activation Mapping (CAM) method, yielding significant improvements in WSSS performance on PASCAL VOC 2012 and MS COCO 2014. These results underline the importance of the class token for WSSS.

Abstract:
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios.

Abstract:
Transformers, originally devised for natural language processing (NLP), have also produced significant successes in computer vision (CV). Due to their strong expression power, researchers are investigating ways to deploy transformers for reinforcement learning (RL), and transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances concerning the transformation of RL with transformers (transformer-based RL (TRL)) to explore the development trajectory and future trends of this field. We group the existing developments into two categories: architecture enhancements and trajectory optimizations, and examine the main applications of TRL in robotic manipulation, text-based games (TBGs), navigation, and autonomous driving. Architecture enhancement methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, facilitating more precise modeling of agents and environments compared to traditional deep RL techniques. However, these methods are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and the “deadly triad”. Trajectory optimization methods treat RL problems as sequence modeling problems and train a joint state-action model over entire trajectories under the behavior cloning framework; such approaches are able to extract policies from static datasets and fully use the long-sequence modeling capabilities of transformers. Given these advancements, the limitations and challenges in TRL are reviewed and proposals regarding future research directions are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.

Abstract:
Recently, perception task based on Bird’s-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation or depth representation. Our Fast-BEV consists of five parts, we innovatively propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image features to 3D voxel space, (2) a multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Among them, (1) and (3) enable Fast-BEV to be fast inference and deployment friendly on the on-vehicle chips, (2), (4) and (5) ensure that Fast-BEV has competitive performance. All these make Fast-BEV a solution with high performance, fast inference speed, and deployment-friendly on the on-vehicle chips of autonomous driving. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model (Li et al. 2022) and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model (J. Huang and G. Huang, 2022). Our largest model (R101@900×1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips.

Abstract:
Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. According to the nature of audio to lip motions mapping, the same speech content may have different appearances even for the same person at different occasions. Such one-to-many mapping problem brings ambiguity during training and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.

Abstract:
Graph Convolutional Networks (GCN) have shown outstanding performance in skeleton-based behavior recognition. However, their opacity hampers further development. Researches on the explainability of deep learning have provided solutions to this issue, with Class Activation Map (CAM) algorithms being a class of explainable methods. However, existing CAM algorithms applies to GCN often independently compute the contribution of individual nodes, overlooking the interactions between nodes in the skeleton. Therefore, we propose a game theory based class activation map for GCN (GT-CAM). First, GT-CAM integrates Shapley values with gradient weights to calculate node importance, producing an activation map that highlights the critical role of nodes in decision-making. It also reveals the cooperative dynamics between nodes or local subgraphs for a more comprehensive explanation. Second, to reduce the computational burden of Shapley values, we propose a method for calculating Shapley values of node coalitions. Lastly, to evaluate the rationality of coalition partitioning, we propose a rationality evaluation method based on bipartite game interaction and cooperative game theory. Additionally, we introduce an efficient calculation method for the coalition rationality coefficient based on the Monte Carlo method. Experimental results demonstrate that GT-CAM outperforms other competitive interpretation methods in visualization and quantitative analysis.

Abstract:
Deep learning-based solutions have achieved impressive performance in semantic segmentation but often require large amounts of training data with fine-grained annotations. To alleviate such requisition, a variety of weakly supervised annotation strategies have been proposed, among which scribble supervision is emerging as a popular one due to its user-friendly annotation way. However, the sparsity and diversity of scribble annotations make it nontrivial to train a network to produce deterministic and consistent predictions directly. To address these issues, in this paper we propose holistic solutions involving the design of network structure, loss and training procedure, named CC4S to improve Certainty and Consistency for Scribble-Supervised Semantic Segmentation. Specifically, to reduce uncertainty, CC4S embeds a random walk module into the network structure to make neural representations uniformly distributed within similar semantic regions, which works together with a soft entropy loss function to force the network to produce deterministic predictions. To encourage consistency, CC4S adopts self-supervision training and imposes the consistency loss on the eigenspace of the probability transition matrix in the random walk module (we named neural eigenspace). Such self-supervision inherits the category-level discriminability from the neural eigenspace and meanwhile helps the network focus on producing consistent predictions for the salient parts and neglect semantically heterogeneous backgrounds. Finally, to further improve the performance, CC4S uses the network predictions as pseudo-labels and retrains the network with an extra color constraint regularizer. From comprehensive experiments, CC4S achieves comparable performance to those from fully supervised methods and shows promising robustness under extreme supervision cases.

Abstract:
With the emergence of AI generated content, cross-modal retrieval of 2D and 3D data has obtained increasing research attention. In practical applications, massive amounts of 2D and 3D data need expensive annotation, which would make labels scarce. Even worse, complicated heterogeneous relationships between 2D and 3D data make the problem more challenging. In this research, we study the problem of semi-supervised 2D and 3D cross-modal retrieval and provide a novel method named Hierarchical Alignment with Ambiguous Pseudo-labeling (HOPE) for this problem. The core of HOPE is to align two modalities in the common space from a hierarchical perspective. Specifically, HOPE not only enforces each sample to approach its respective modality-invariant anchors from an individual view, but also measures both prototypes and distribution for both modalities for discrepancy reduction from a group view. To handle label scarcity with limited error accumulation, HOPE employs two branches of perturbed networks to generate ambiguous candidates, which guides the cross-branch supervision using a margin-based ranking objective. In addition, we retrieve reliable unlabeled samples for each anchor with curriculum learning and class balance, which are added into labeled datasets to clear ambiguity. Extensive experiments on various benchmark datasets validate the superiority of the proposed HOPE.

Abstract:
Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors.

Abstract:
Object pose estimation constitutes a critical area within the domain of 3D vision. While contemporary state-of-the-art methods that leverage real-world pose annotations have demonstrated commendable performance, the procurement of such real training data incurs substantial costs. This paper focuses on a specific setting wherein only 3D CAD models are utilized as a priori knowledge, devoid of any background or clutter information. We introduce a novel method, CPPF++, designed for sim-to-real category-level pose estimation. This method builds upon the foundational point-pair voting scheme of CPPF, reformulating it through a probabilistic view. To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty by estimating the probabilistic distribution of each point pair within the canonical space. Furthermore, we augment the contextual information provided by each voting unit through the introduction of NN-point tuples. To enhance the robustness and accuracy of the model, we incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a tuple feature ensemble. Alongside these methodological advancements, we introduce a new category-level pose estimation dataset, named DiversePose 300. Empirical evidence demonstrates that our method significantly surpasses previous sim-to-real approaches and achieves comparable or superior performance on novel datasets.

Abstract:
Integer programming with block structures has received considerable attention recently and is widely used in many practical applications such as train timetabling and vehicle routing problems. It is known to be NP-hard due to the presence of integer variables. We define a novel augmented Lagrangian function by directly penalizing the inequality constraints and establish the strong duality between the primal problem and the augmented Lagrangian dual problem. Then, a customized augmented Lagrangian method is proposed to address the block-structures. In particular, the minimization of the augmented Lagrangian function is decomposed into multiple subproblems by decoupling the linking constraints and these subproblems can be efficiently solved using the block coordinate descent method. We also establish the convergence property of the proposed method. To make the algorithm more practical, we further introduce several refinement techniques to identify high-quality feasible solutions. Numerical experiments on a few interesting scenarios show that our proposed algorithm often achieves a satisfactory solution and is quite effective.

Abstract:
The U-Net-like coarse-to-fine network design is currently the dominant choice for dense prediction tasks. Although this design can often achieve competitive performance, it suffers from some inherent limitations, such as training error propagation from low to high resolution and the dependency on the deeper and heavier backbones. To design an effective network that performs better, we instead propose Recurrent Multiscale Feature Modulation (R-MSFM), a new lightweight network design for self-supervised monocular depth estimation. R-MSFM extracts per-pixel features, builds a multiscale feature modulation module, and performs recurrent depth refinement through a parameter-shared decoder at a fixed resolution. This network design enables our R-MSFM to maintain a more lightweight architecture and fundamentally avoid error propagation caused by the coarse-to-fine design. Furthermore, we introduce the mask geometry consistency loss to facilitate our R-MSFM for geometry consistent depth learning. This loss penalizes the inconsistency of the estimated depths between adjacent views within the nonoccluded and nonstationary regions. Experimental results demonstrate the superiority of our proposed R-MSFM both at model size and inference speed, and show state-of-the-art results on two datasets: KITTI and Make3D.

Abstract:
We propose integrally pre-trained transformer pyramid network (iTPN), towards jointly optimizing the network backbone and the neck, so that transfer gap between representation models and downstream tasks is minimal. iTPN is born with two elaborated designs: 1) The first pre-trained feature pyramid upon vision transformer (ViT). 2) Multi-stage supervision to the feature pyramid using masked feature modeling (MFM). iTPN is updated to Fast-iTPN, reducing computational memory overhead and accelerating inference through two flexible designs. 1) Token migration: dropping redundant tokens of the backbone while replenishing them in the feature pyramid without attention operations. 2) Token gathering: reducing computation cost caused by global attention by introducing few gathering tokens. The base/large-level Fast-iTPN achieve 88.75%/89.5% top-1 accuracy on ImageNet-1 K. With 1×1× training schedule using DINO, the base/large-level Fast-iTPN achieves 58.4%/58.8% box AP on COCO object detection, and a 57.5%/58.7% mIoU on ADE20 K semantic segmentation using MaskDINO. Fast-iTPN can accelerate the inference procedure by up to 70%, with negligible performance loss, demonstrating the potential to be a powerful backbone for downstream vision tasks.

Abstract:
Active learning (AL) is to design label-efficient algorithms by labeling the most representative samples. It reduces annotation cost and attracts increasing attention from the community. However, previous AL methods suffer from the inadequacy of annotations and unreliable uncertainty estimation. Moreover, we find that they ignore the intra-diversity of selected samples, which leads to sampling redundancy. In view of these challenges, we propose an inductive state-relabeling adversarial AL model (ISRA) that consists of a unified representation generator, an inductive state-relabeling discriminator, and a heuristic clique rescaling module. The generator introduces contrastive learning to leverage unlabeled samples for self-supervised training, where the mutual information is utilized to improve the representation quality for AL selection. Then, we design an inductive uncertainty indicator to learn the state score from labeled data and relabel unlabeled data with different importance for better discrimination of instructive samples. To solve the problem of sampling redundancy, the heuristic clique rescaling module measures the intra-diversity of candidate samples and recurrently rescales them to select the most informative samples. The experiments conducted on eight datasets and two imbalanced scenarios show that our model outperforms the previous state-of-the-art AL methods. As an extension on the cross-modal AL task, we apply ISRA to the image captioning and it also achieves superior performance.

Affiliations: School of Computer Science, National Engineering Research Center for Multimedia Software, Wuhan University, Wuhan, China; National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China; Digital Content and Media Sciences Research Division, National Institute of Informatics, Chiyoda City, Japan; Colleage of Software and Technology, Zhejiang University, Hangzhou, China; School of Cyber Science and Engineering, Wuhan University, Wuhan, China; Computer Vision Lab, ETH Zurich, Zürich, Switzerland

Abstract:
Despite the impressive achievements of Deep Neural Networks (DNNs) in computer vision, their vulnerability to adversarial attacks remains a critical concern. Extensive research has demonstrated that incorporating sophisticated perturbations into input images can lead to a catastrophic degradation in DNNs’ performance. This perplexing phenomenon not only exists in the digital space but also in the physical world. Consequently, it becomes imperative to evaluate the security of DNNs-based systems to ensure their safe deployment in real-world scenarios, particularly in security-sensitive applications. To facilitate a profound understanding of this topic, this paper presents a comprehensive overview of physical adversarial attacks. First, we distill four general steps for launching physical adversarial attacks. Building upon this foundation, we uncover the pervasive role of artifacts carrying adversarial perturbations in the physical world. These artifacts influence each step. To denote them, we introduce a new term: adversarial medium. Then, we take the first step to systematically evaluate the performance of physical adversarial attacks, taking the adversarial medium as a first attempt. Our proposed evaluation metric, hiPAA, comprises six perspectives: Effectiveness, Stealthiness, Robustness, Practicability, Aesthetics, and Economics. We also provide comparative results across task categories, together with insightful observations and suggestions for future research directions.

Abstract:
Gradient inversion attacks (GIAs) have posed significant challenges to the emerging paradigm of distributed learning, which aims to reconstruct the private training data of clients (participating parties in distributed training) through the shared parameters. For counteracting GIAs, a large number of privacy-preserving methods for distributed learning scenario have emerged. However, these methods have significant limitations, either compromising the usability of global model or consuming substantial additional computational resources. Furthermore, despite the extensive efforts dedicated to defense methods, the underlying causes of data leakage in distributed learning still have not been thoroughly investigated. Therefore, this paper tries to reveal the potential reasons behind the successful implementation of existing GIAs, explore variations in the robustness of models against GIAs during the training process, and investigate the impact of different model structures on attack performance. After these explorations and analyses, this paper propose a plug-and-play GIAs defense method, which augments the training data by a designed vicinal distribution. Sufficient empirical experiments demonstrate that this easy-to-implement method can ensure the basic level of privacy without compromising the usability of global model.

Abstract:
Self-supervised learning (SSL) opens up huge opportunities for medical image analysis that is well known for its lack of annotations. However, aggregating massive (unlabeled) 3D medical images like computerized tomography (CT) remains challenging due to its high imaging cost and privacy restrictions. In our pilot study, we advocated bringing a wealth of 2D images like X-rays as compensation for the lack of 3D data, aiming to build a universal medical self-supervised representation learning framework, called UniMiSS. Especially, we designed a pyramid U-like medical Transformer (MiT) as the backbone to make UniMiSS possible to perform SSL with both 2D and 3D images. UniMiSS surpasses current 3D-specific SSL in effectiveness and versatility, excelling in various downstream tasks and overcoming the limitations of dimensionality. However, the initial version did not fully explore the anatomical correlations between 2D and 3D images due to the absence of paired multi-modal patient data. In this extension, we introduce UniMiSS+, which leverages digitally reconstructed radiographs (DRR) technology to simulate X-rays from CT volumes, providing access to paired data. Benefiting from the paired group, we introduce an extra pair-wise constraint to boost the cross modality correlation learning, which also can be adopted as a cross dimension regularization to further improve the representations. We conduct expensive experiments on multiple 3D/2D medical image analysis tasks, including segmentation and classification. The results show that our UniMiSS+ achieves promising performance on various downstream tasks, not only outperforming ImageNet pre-training and other advanced SSL counterparts but also improving the predecessor UniMiSS pre-training.

Abstract:
Multi-view learning has raised more and more attention in recent years. However, traditional approaches only focus on the difference while ignoring the consistency among views. It may make some views, with the situation of data abnormality or noise, ineffective in the progress of view learning. Besides, the current datasets have become high-dimensional and large-scale gradually. Therefore, this paper proposes a novel multi-view compressed subspace learning method via low-rank tensor constraint, which incorporates the clustering progress and multi-view learning into a unified framework. First, for each view, we take the partial samples to build a small-size dictionary, which can reduce the effect of both redundancy information and computation cost greatly. Then, to find the consistency and difference among views, we impose a low-rank tensor constraint on these representations and further design an auto-weighted mechanism to learn the optimal representation. Last, due to the non-square of the learned representation, the bipartite graph has been introduced, and under the structured constraint, the clustering results can be obtained directly from this graph without any post-processing. Extensive experiments on synthetic and real-world benchmark datasets demonstrate the efficacy and efficiency of our method, especially for the views with noise or outliers.

Abstract:
Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes. Compared to SGG, PSG has several challenging problems: pixel-level segment outputs and full relationship exploration (It also considers thing and stuff relation). Thus, current PSG methods have limited performance, which hinders downstream tasks or applications. This work aims to design a novel and strong baseline for PSG. To achieve that, we first conduct an in-depth analysis to identify the bottleneck of the current PSG models, finding that inter-object pair-wise recall is a crucial factor that was ignored by previous PSG methods. Based on this and the recent query-based frameworks, we present a novel framework: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects. Moreover, we also observed the sparse nature of object pairs for both. Motivated by this, we design a lightweight Matrix Learner within the PPN, which directly learns pair-wised relationships for pair proposal generation. Through extensive ablation and analysis, our approach significantly improves upon leveraging the segmenter solid baseline. Notably, our method achieves over 10% absolute gains compared to our baseline, PSGFormer.

Abstract:
The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection by addressing three key aspects specifically tailored for this task. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network, which represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. To solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the decoder. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match. And similar queries are insufficiently suppressed and turn into redundant prediction boxes. To address this issue, our proposed IoU regularization term encourages similar queries to be distinct during the refinement. Through extensive experiments, we demonstrate the effectiveness of our approach in handling challenging scenarios, while incurring only a minor additional computational overhead.

Affiliations: School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden; Department of Electrical and Electronic Engineering, Imperial College London, London, U.K.; Artificial Intelligence and its Applications Institute, School of Informatics, University of Edinburgh, Edinburgh, U.K.; Department of Control Science and Engineering, Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China; AGH University, Kraków, Poland; College of Computing & Data Science, Nanyang Technological University, Singapore

Abstract:
Many machine learning problems can be formulated as non-convex multi-player games. Due to non-convexity, it is challenging to obtain the existence condition of the global Nash equilibrium (NE) and design theoretically guaranteed algorithms. This paper studies a class of non-convex multi-player games, where players’ payoff functions consist of canonical functions and quadratic operators. We leverage conjugate properties to transform the complementary problem into a variational inequality (VI) problem using a continuous pseudo-gradient mapping. We prove the existence condition of the global NE as the solution to the VI problem satisfies a duality relation. We then design an ordinary differential equation to approach the global NE with an exponential convergence rate. For practical implementation, we derive a discretized algorithm and apply it to two scenarios: multi-player games with generalized monotonicity and multi-player potential games. In the two settings, step sizes are required to be \mathcal O(1/k)O(1/k) and \mathcal O(1/\sqrtk)O(1/k) to yield the convergence rates of \mathcal O(1/ k)O(1/k) and \mathcal O(1/\sqrtk)O(1/k), respectively. Extensive experiments on robust neural network training and sensor network localization validate our theory. Our code is available at https://github.com/GuanpuChen/Global-NE.

Abstract:
Rejecting outlier correspondences is one of the critical steps for successful feature-based two-view geometry estimation, and contingent heavily upon local context exploration. Recent advances focus on devising elaborate local context extractors whereas typically adopting explicit neighborhood relationship modeling at a specific scale, which is intrinsically flawed and inflexible, because 1) severe outliers often populated in putative correspondences and 2) the uncertainty in the distribution of inliers and outliers make the network incapable of capturing adequate and reliable local context from such neighborhoods, therefore resulting in the failure of pose estimation. This prospective study proposes a novel network called U-Match that has the flexibility to enable implicit local context awareness at multiple levels, naturally circumventing the aforementioned issues that plague most existing studies. Specifically, to aggregate multi-level local context implicitly, a hierarchy-aware graph representation module is designed to flexibly encode and decode hierarchical features. Moreover, considering that global context always works collaboratively with local context, an orthogonal local-and-global information fusion module is presented to integrate complementary local and global context in a redundancy-free manner, thus yielding compact feature representations to facilitate correspondence learning. Thorough experimentation across relative pose estimation, homography estimation, visual localization, and point cloud registration affirms U-Match's remarkable capabilities.

Abstract:
In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.

Abstract:
Combining LiDAR points and images for robust semantic segmentation has shown great potential. However, the heterogeneity between the two modalities (e.g. the density, the field of view) poses challenges in establishing a bijective mapping between each point and pixel. This modality alignment problem introduces new challenges in network design and data processing for cross-modal methods. Specifically, 1) points that are projected outside the image planes; 2) the complexity of maintaining geometric consistency limits the deployment of many data augmentation techniques. To address these challenges, we propose a cross-modal knowledge imputation and transition approach. First, we introduce a bidirectional feature fusion strategy that imputes missing image features and performs cross-modal fusion simultaneously. This allows us to generate reliable predictions even when images are missing. Second, we propose a Uni-to-Multi modal Knowledge Distillation (U2MKD) framework, leveraging the transfer of informative features from a single-modality teacher to a cross-modality student. This overcomes the issues of augmentation misalignment and enables us to train the student effectively. Extensive experiments on the nuScenes, Waymo, and SemanticKITTI datasets demonstrate the effectiveness of our approach. Notably, our method achieves an 8.3 mIoU gain over the LiDAR-only baseline on the nuScenes validation set and achieves state-of-the-art performance on the three datasets.

Abstract:
We propose an end-to-end visuomotor navigation framework that leverages Neural Radiance Fields (NeRF) for spatial cognition. To the best of our knowledge, this is the first effort to integrate such implicit spatial representation with embodied policy end-to-end for cognitive decision-making. Consequently, our system does not necessitate modularized designs nor transformations into explicit scene representations for downstream control. The NeRF-based memory is constructed online during navigation, without relying on any environmental priors. To enhance the extraction of decision-critical historical insights from the rigid and implicit structure of NeRF, we introduce a spatial information extraction mechanism named Structural Radiance Attention (SRA). SRA empowers the agent to grasp complex scene structures and task objectives, thus paving the way for the development of intelligent behavioral patterns. Our comprehensive testing in image-goal navigation tasks demonstrates that our approach significantly outperforms existing navigation models. We demonstrate that SRA markedly improves the agent's understanding of both the scene and the task by retrieving historical information stored in NeRF memory. The agent also learns exploratory awareness from our pipeline to better adapt to low signal-to-noise memory signals in unknown scenes. We deploy our navigation system on a mobile robot in real-world scenarios, where it exhibits evident cognitive capabilities while ensuring real-time performance.

Abstract:
Correspondence pruning plays a crucial role in a variety of feature matching based tasks, which aims at identifying correct correspondences (inliers) from initial ones. Seeking consistent kk-nearest neighbors in both coordinate and feature spaces is a prevalent strategy employed in previous approaches. However, the vicinity of an inlier contains numerous irregular false correspondences (outliers), which leads them to mistakenly become neighbors according to the similarity constraint of nearest neighbors. To tackle this issue, we propose a global-graph space to seek consistent neighbors with similar graph structures. This is achieved by using a global connected graph to explicitly render the affinity relationship between correspondences based on the spatial and feature consistency. Furthermore, to enhance the robustness of method for various matching scenes, we develop a neighbor consistency block to adequately leverage the potential of three types of neighbors. The consistency can be progressively mined by sequentially extracting intra-neighbor context and exploring inter-neighbor interactions. Ultimately, we present a Neighbor Consistency Mining Network (NCMNet) to estimate the parametric models and remove outliers. Extensive experimental results demonstrate that the proposed method outperforms other state-of-the-art methods on various benchmarks for two-view geometry estimation. Meanwhile, four extended tasks, including remote sensing image registration, point cloud registration, 3D reconstruction, and visual localization, are conducted to test the generalization ability.

Abstract:
Dual-pixel (DP) imaging sensors are getting more popularly adopted by modern cameras. A DP camera captures a pair of images in a single snapshot by splitting each pixel in half. Several previous studies show how to recover depth information by treating the DP pair as an approximate stereo pair. However, dual-pixel disparity occurs only in image regions with defocus blur which is unlike classic stereo disparity. Heavy defocus blur in DP pairs affects the performance of depth estimation approaches based on matching. Therefore, we treat the blur removal and the depth estimation as a joint problem. We investigate the formation of the DP pair, which links the blur and depth information, rather than blindly removing the blur effect. We propose a mathematical DP model that can improve depth estimation by the blur. This exploration motivated us to propose our previous work, an end-to-end DDDNet (DP-based Depth and Deblur Network), which jointly estimates depth and restores the image in a supervised fashion. However, collecting the ground-truth (GT) depth map for the DP pair is challenging and limits the depth estimation potential of the DP sensor. Therefore, we propose an extension of the DDDNet, called WDDNet (Weakly-supervised Depth and Deblur Network), which includes an efficient reblur solver that does not require GT depth maps for training. To achieve this, we convert all-in-focus images into supervisory signals for unsupervised depth estimation in our WDDNet. We jointly estimate an all-in-focus image and a disparity map, then use a Reblur and Fstack module to regularize the disparity estimation and image restoration. We conducted extensive experiments on synthetic and real data to demonstrate the competitive performance of our method when compared to state-of-the-art (SOTA) supervised approaches.

Abstract:
A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution \rhoρ. Using \rhoρ and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.

Abstract:
Estimating 3D human texture from a single image is essential in graphics and vision. It requires learning a mapping function from input images of humans with diverse poses into the parametric (uv) space and reasonably hallucinating invisible parts. To achieve a high-quality 3D human texture estimation, we propose a framework that adaptively samples the input by a deformable convolution where offsets are learned via a deep neural network. Additionally, we describe a novel cycle consistency loss that improves view generalization. We further propose to train our framework with an uncertainty-based pixel-level image reconstruction loss, which enhances color fidelity. We compare our method against the state-of-the-art approaches and show significant qualitative and quantitative improvements.

Affiliations: School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore; Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong; Research Institute for Future Media Computing, Shenzhen University, Shenzhen, China; Department of Informatics, University of Leicester, Leicester, U.K.; Faculty of Information Science and Engineering and the Institute for Advanced Ocean Study, Ocean University of China, Qingdao, China; National Key Laboratory for Multimedia Information Processing and National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China

Abstract:
Photometric stereo recovers the surface normals of an object from multiple images with varying shading cues, i.e., modeling the relationship between surface orientation and intensity at each pixel. Photometric stereo prevails in superior per-pixel resolution and fine reconstruction details. However, it is a complicated problem because of the non-linear relationship caused by non-Lambertian surface reflectance. Recently, various deep learning methods have shown a powerful ability in the context of photometric stereo against non-Lambertian surfaces. This paper provides a comprehensive review of existing deep learning-based calibrated photometric stereo methods utilizing orthographic cameras and directional light sources. We first analyze these methods from different perspectives, including input processing, supervision, and network architecture. We summarize the performance of deep learning photometric stereo models on the most widely-used benchmark data set. This demonstrates the advanced performance of deep learning-based photometric stereo methods. Finally, we give suggestions and propose future research trends based on the limitations of existing models.

Abstract:
Generating realistic 3D human motion has been a fundamental goal of the game/animation industry. This work presents a novel transition generation technique that can bridge the actions of people in the foreground by generating 3D poses and shapes in-between photos, allowing 3D animators/novice users to easily create/edit 3D motions. To achieve this, we propose an adaptive motion network (ADAM-Net) that effectively learns human motion from masked action sequences to generate kinematically compliant 3D poses and shapes in-between given temporally-sparse photos. Three core learning designs underpin ADAM-Net. First, we introduce a random masking process that randomly masks images from an action sequence and fills masked regions in latent space by interpolation of unmasked images to simulate various transitions under given temporally-sparse photos. Second, we propose a long-range adaptive motion (L-ADAM) attention module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in a sequence, along with a multi-head cross-attention. Third, we develop a short-range adaptive motion (S-ADAM) attention module that weightedly selects and integrates adjacent feature representations at different levels to strengthen temporal correlation. By coupling these designs, the results demonstrate that ADAM-Net excels not only in generating 3D poses and shapes in-between photos, but also in classic 3D human pose and shape estimation.

Abstract:
Digitalization of large-scale urban scenes (in particular buildings) has been a long-standing open problem, which attributes to the challenges in data acquisition, such as incomplete scene coverage, lack of semantics, low efficiency, and low reliability in path planning. In this paper, we address these challenges in urban building reconstruction from aerial images, and we propose an effective workflow and a few novel algorithms for efficient 3D building instance proxy reconstruction for large urban scenes. Specifically, we propose a novel learning-based approach to instance segmentation of urban buildings from aerial images followed by a voting-based algorithm to fuse the multi-view instance information to a sparse point cloud (reconstructed using a standard Structure from Motion pipeline). Our method enables effective instance segmentation of the building instances from the point cloud. We also introduce a layer-based surface reconstruction method dedicated to the 3D reconstruction of building proxies from extremely sparse point clouds. Extensive experiments on both synthetic and real-world aerial images of large urban scenes have demonstrated the effectiveness of our approach. The generated scene proxy models can already provide a promising 3D surface representation of the buildings in large urban scenes, and when applied to aerial path planning, the instance-enhanced building proxy models can significantly improve data completeness and accuracy, yielding highly detailed 3D building models.

Abstract:
Transformers have shown remarkable performance, however, their architecture design is a time-consuming process that demands expertise and trial-and-error. Thus, it is worthwhile to investigate efficient methods for automatically searching high-performance Transformers via Transformer Architecture Search (TAS). In order to improve the search efficiency, training-free proxy based methods have been widely adopted in Neural Architecture Search (NAS). Whereas, these proxies have been found to be inadequate in generalizing well to Transformer search spaces, as confirmed by several studies and our own experiments. This paper presents an effective scheme for TAS called TRansformer Architecture search with ZerO-cost pRoxy guided evolution (T-Razor) that achieves exceptional efficiency. First, through theoretical analysis, we discover that the synaptic diversity of multi-head self-attention (MSA) and the saliency of multi-layer perceptron (MLP) are correlated with the performance of corresponding Transformers. The properties of synaptic diversity and synaptic saliency motivate us to introduce the ranks of synaptic diversity and saliency that denoted as DSS++ for evaluating and ranking Transformers. DSS++ incorporates correlation information among sampled Transformers to provide unified scores for both synaptic diversity and synaptic saliency. We then propose a block-wise evolution search guided by DSS++ to find optimal Transformers. DSS++ determines the positions for mutation and crossover, enhancing the exploration ability. Experimental results demonstrate that our T-Razor performs competitively against the state-of-the-art manually or automatically designed Transformer architectures across four popular Transformer search spaces. Significantly, T-Razor improves the searching efficiency across different Transformer search spaces, e.g., reducing required GPU days from more than 24 to less than 0.4 and outperforming existing zero-cost approaches. We also apply T-Razor to the BERT search space and find that the searched Transformers achieve competitive GLUE results on several Neural Language Processing (NLP) datasets. This work provides insights into training-free TAS, revealing the usefulness of evaluating Transformers based on the properties of their different blocks.

Abstract:
Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks.

Abstract:
Transformer based methods have achieved great success in image inpainting recently. However, we find that these solutions regard each pixel as a token, thus suffering from an information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration. 2) They quantize 256^32563 RGB values to a small number (such as 512) of quantized color values. The indices of quantized pixels are used as tokens for the inputs and prediction targets of the transformer. To mitigate these issues, we propose a new transformer based framework called “PUT”. Specifically, to avoid input downsampling while maintaining computation efficiency, we design a patch-based auto-encoder P-VQVAE. The encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by input quantization, an Un-quantized Transformer is applied. It directly takes features from the P-VQVAE encoder as input without any quantization and only regards the quantized tokens as prediction targets. Furthermore, to make the inpainting process more controllable, we introduce semantic and structural conditions as extra guidance. Extensive experiments show that our method greatly outperforms existing transformer based methods on image fidelity and achieves much higher diversity and better fidelity than state-of-the-art pluralistic inpainting methods on complex large-scale datasets (e.g., ImageNet).

Abstract:
Most artificial lights exhibit subtle fluctuations in intensity and frequency in response to the influence of the grid's alternating current, providing the potential to estimate the Electric Network Frequency (ENF) from conventional frame-based videos. Nevertheless, the performance of Video-based ENF (V-ENF) estimation largely relies on the imaging quality and thus may suffer from significant interference caused by non-ideal sampling, scene diversity, motion interference, and extreme lighting conditions. In this paper, we show that the ENF can be extracted without the above limitations from a new modality provided by the so-called event camera, a neuromorphic sensor that encodes the light intensity variations and asynchronously emits events with extremely high temporal resolution and high dynamic range. Specifically, we formulate and validate the physical mechanism for the ENF captured in events and then propose a simple yet robust Event-based ENF (E-ENF) estimation method through mode filtering and harmonic enhancement. To validate the effectiveness, we build the first Event-Video ENF Dataset (EV-ENFD) and its extension EV-ENFD+ with diverse scenarios, including static, dynamic, and extreme lighting scenes. Comprehensive experiments have been conducted on our proposed datasets, showcasing that our proposed E-ENF significantly outperforms the V-ENF in extracting accurate ENF traces, especially in challenging environments.

Abstract:
Neural Radiance Field (NeRF) has achieved substantial progress in novel view synthesis given multi-view images. Recently, some works have attempted to train a NeRF from a single image with 3D priors. They mainly focus on a limited field of view with a few occlusions, which greatly limits their scalability to real-world 360-degree panoramic scenarios with large-size occlusions. In this paper, we present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama. Notably, PERF allows 3D roaming in a complex scene without expensive and tedious image collection. To achieve this goal, we propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first predict a panoramic depth map as initialization given a single panorama and reconstruct visible 3D regions with volume rendering. Then we introduce a collaborative RGBD inpainting approach into a NeRF for completing RGB images and depth maps from random views, which is derived from an RGB Stable Diffusion model and a monocular depth estimator. Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-sampled view and reference views. The two components are integrated into the learning of NeRFs in a unified optimization framework and achieve promising results. Extensive experiments on Replica and a new dataset PERF-in-the-wild demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.

Abstract:
This paper presents a new end-to-end signal classification method using the signed cumulative distribution transform (SCDT). We adopt a transport generative model to define the classification problem. We then make use of mathematical properties of the SCDT to render the problem easier in transform domain, and solve for the class of an unknown sample using a nearest local subspace (NLS) search algorithm in SCDT domain. Experiments show that the proposed method provides high accuracy classification results while being computationally cheap, data efficient, and robust to out-of-distribution samples with respect to the existing end-to-end classification methods. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit [1].

Abstract:
We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.6% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.55% in text detection and spotting tasks, along with a 47.1% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 4.7% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection.

Abstract:
The main challenge for fine-grained few-shot image classification is to learn feature representations with higher inter-class and lower intra-class variations, with a mere few labelled samples. Conventional few-shot learning methods however cannot be naively adopted for this fine-grained setting – a quick pilot study reveals that they in fact push for the opposite (i.e., lower inter-class variations and higher intra-class variations). To alleviate this problem, prior works predominately use a support set to reconstruct the query image and then utilize metric learning to determine its category. Upon careful inspection, we further reveal that such unidirectional reconstruction methods only help to increase inter-class variations and are not effective in tackling intra-class variations. In this paper, we introduce a bi-reconstruction mechanism that can simultaneously accommodate for inter-class and intra-class variations. In addition to using the support set to reconstruct the query set for increasing inter-class variations, we further use the query set to reconstruct the support set for reducing intra-class variations. This design effectively helps the model to explore more subtle and discriminative features which is key for the fine-grained problem in hand. Furthermore, we also construct a self-reconstruction module to work alongside the bi-directional module to make the features even more discriminative. We introduce the snapshot ensemble method in the episodic learning strategy – a simple trick to further improve model performance without increasing training costs. Experimental results on three widely used fine-grained image classification datasets, as well as general and cross-domain few-shot image datasets, consistently show considerable improvements compared with other methods.

Abstract:
Universal approximation capability, also referred to as universality, is an important property of deep neural networks, endowing them with the potency to accurately represent the underlying target function in learning tasks. In practice, the architecture of deep neural networks largely influences the performance of the models. However, most existing methodologies for designing neural architectures, such as the heuristic manual design or neural architecture search, ignore the universal approximation property, thus losing a potential safeguard about the performance. In this paper, we propose a unified framework to design the architectures of deep neural networks with a universality guarantee based on first-order optimization algorithms, where the forward pass is interpreted as the updates of an optimization algorithm. The (explicit or implicit) network is designed by replacing each gradient term in the algorithm with a learnable module similar to a two-layer network or its derivatives. Specifically, we explore the realm of width-bounded neural networks, a common practical scenario, showcasing their universality. Moreover, adding operations of normalization, downsampling, and upsampling does not hurt the universality. To the best of our knowledge, this is the first work that width-bounded networks with universal approximation guarantee can be designed in a principled way. Our framework can inspire a variety of neural architectures including some renowned structures such as ResNet and DenseNet, as well as novel innovations. The experimental results on image classification problems demonstrate that the newly inspired networks are competitive and surpass the baselines of ResNet, DenseNet, as well as the advanced ConvNeXt and ViT, testifying to the effectiveness of our framework.

Abstract:
One-to-one matching is a crucial design in DETR-like object detection frameworks. It enables the DETR to perform end-to-end detection. However, it also faces challenges of lacking positive sample supervision and slow convergence speed. Several recent works proposed the one-to-many matching mechanism to accelerate training and boost detection performance. We revisit these methods and model them in a unified format of augmenting the object queries. In this paper, we propose two methods that realize one-to-many matching from a different perspective of augmenting images or image features. The first method is One-to-many Matching via Data Augmentation (denoted as DataAug-DETR). It spatially transforms the images and includes multiple augmented versions of each image in the same training batch. Such a simple augmentation strategy already achieves one-to-many matching and surprisingly improves DETR's performance. The second method is One-to-many matching via Feature Augmentation (denoted as FeatAug-DETR). Unlike DataAug-DETR, it augments the image features instead of the original images and includes multiple augmented features in the same batch to realize one-to-many matching. FeatAug-DETR significantly accelerates DETR training and boosts detection performance while keeping the inference speed unchanged. We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and \mathcal HH-Deformable-DETR. Without extra training data, FeatAug-DETR shortens the training convergence periods of Deformable-DETR (Zhu et al. 2020) to 24 epochs and achieves 58.3 AP on COCO val2017 set with Swin-L as the backbone.

Abstract:
Event camera shows great potential in 3D hand pose estimation, especially addressing the challenges of fast motion and high dynamic range in a low-power way. However, due to the asynchronous differential imaging mechanism, it is challenging to design event representation to encode hand motion information especially when the hands are not moving (causing motion ambiguity), and it is infeasible to fully annotate the temporally dense event stream. In this paper, we propose EvHandPose with novel hand flow representations in Event-to-Pose module for accurate hand pose estimation and alleviating the motion ambiguity issue. To solve the problem under sparse annotation, we design contrast maximization and hand-edge constraints in Pose-to-IWE (Image with Warped Events) module and formulate EvHandPose in a weakly-supervision framework. We further build EvRealHands, the first large-scale real-world event-based hand pose dataset on several challenging scenes to bridge the real-synthetic domain gap. Experiments on EvRealHands demonstrate that EvHandPose outperforms previous event-based methods under all evaluation scenes, achieves accurate and stable hand pose estimation with high temporal resolution in fast motion and strong light scenes compared with RGB-based methods, generalizes well to outdoor scenes and another type of event camera, and shows the potential for the hand gesture recognition task.

Abstract:
Despite recent progress in Graph Neural Networks (GNNs), explaining predictions made by GNNs remains a challenging and nascent problem. The leading method mainly considers the local explanations, i.e., important subgraph structure and node features, to interpret why a GNN model makes the prediction for a single instance, e.g. a node or a graph. As a result, the explanation generated is painstakingly customized at the instance level. The unique explanation interpreting each instance independently is not sufficient to provide a global understanding of the learned GNN model, leading to the lack of generalizability and hindering it from being used in the inductive setting. Besides, training the explanation model explaining for each instance is time-consuming for large-scale real-life datasets. In this study, we address these key challenges and propose PGExplainer, a parameterized explainer for GNNs. PGExplainer adopts a deep neural network to parameterize the generation process of explanations, which renders PGExplainer a natural approach to multi-instance explanations. Compared to the existing work, PGExplainer has better generalization ability and can be utilized in an inductive setting without training the model for new instances. Thus, PGExplainer is much more efficient than the leading method with significant speed-up. In addition, the explanation networks can also be utilized as a regularizer to improve the generalization power of existing GNNs when jointly trained with downstream tasks. Experiments on both synthetic and real-life datasets show highly competitive performance with up to 24.7% relative improvement in AUC on explaining graph classification over the leading baseline.

Abstract:
Generative Adversarial Networks (GANs) are widely-used generative models for synthesizing complex and realistic data. However, mode collapse, where the diversity of generated samples is significantly lower than that of real samples, poses a major challenge for further applications. Our theoretical analysis demonstrates that the generator loss function is non-convex with respect to its parameters when there are multiple modes in real data. In particular, parameters that result in generated distributions with perfect partial mode coverage of the real distribution are the local minima of the generator loss function. To address mode collapse, we propose a unified framework called Dynamic GAN. This method detects collapsed samples in the generator by thresholding on observable discriminator outputs, divides the training set based on these collapsed samples, and trains a dynamic conditional model on the partitions. The theoretical outcome ensures progressive mode coverage and experiments on synthetic and real-world data sets demonstrate that our method surpasses several GAN variants. In conclusion, we examine the root cause of mode collapse and offer a novel approach to quantitatively detect and resolve it in GANs.

Affiliations: Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China; Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China; Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi, China

Abstract:
Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often tend to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these datasets. Third, we propose a typology that presents the development process, similarities and differences, robustness comparison, and technical features of existing debiasing methods. Furthermore, we analyze and discuss the robustness of representative vision-and-language pre-training models on VQA. Finally, through a thorough review of the available literature and experimental analysis, we discuss the key areas for future research from various viewpoints.

Abstract:
Randomness is widely introduced in neural network training to simplify model optimization or avoid the over-fitting problem. Among them, dropout and its variations in different aspects (e.g., data, model structure) are prevalent in regularizing the training of deep neural networks. Though effective and performing well, the randomness introduced by these dropout-based methods causes nonnegligible inconsistency between training and inference. In this paper, we introduce a simple consistency training strategy to regularize such randomness, namely R-Drop, which forces two output distributions sampled by each type of randomness to be consistent. Specifically, R-Drop minimizes the bidirectional KL-divergence between two output distributions produced by dropout-based randomness for each training sample. Theoretical analysis reveals that R-Drop can reduce the above inconsistency by reducing the inconsistency among the sampled sub structures and bridging the gap between the loss calculated by the full model and sub structures. Experiments on \mathbf77 widely-used deep learning tasks (\mathbf2323 datasets in total) demonstrate that R-Drop is universally effective for different types of neural networks (i.e., feed-forward, recurrent, and graph neural networks) and different learning paradigms (supervised, parameter-efficient, and semi-supervised). In particular, it achieves state-of-the-art performances with the vanilla Transformer model on WMT14 English \to→ German translation (\mathbf30.9130.91 BLEU) and WMT14 English \to→ French translation (\mathbf43.9543.95 BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models.

Abstract:
Video compression is indispensable to most video analysis systems. Despite saving the transportation bandwidth, it also deteriorates downstream video understanding tasks, especially at low-bitrate settings. To systematically investigate this problem, we first thoroughly review the previous methods, revealing that three principles, i.e., task-decoupled, label-free, and data-emerged semantic prior, are critical to a machine-friendly coding framework but are not fully satisfied so far. In this paper, we propose a traditional-neural mixed coding framework that simultaneously fulfills all these principles, by taking advantage of both traditional codecs and neural networks (NNs). On one hand, the traditional codecs can efficiently encode the pixel signal of videos but may distort the semantic information. On the other hand, highly non-linear NNs are proficient in condensing video semantics into a compact representation. The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved w.r.t. the coding procedure, which is spontaneously learned from unlabeled data in a self-supervised manner. The videos collaboratively decoded from two streams (codec and NN) are of rich semantics, as well as visually photo-realistic, empirically boosting several mainstream downstream video analysis task performances without any post-adaptation procedure. Furthermore, by introducing the attention mechanism and adaptive modeling scheme, the video semantic modeling ability of our approach is further enhanced. Fianlly, we build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach. All codes, data, and models will be open-sourced for facilitating future research.

Abstract:
We investigate the explainability of graph neural networks (GNNs) as a step toward elucidating their working mechanisms. While most current methods focus on explaining graph nodes, edges, or features, we argue that, as the inherent functional mechanism of GNNs, message flows are more natural for performing explainability. To this end, we propose a novel method here, known as FlowX, to explain GNNs by identifying important message flows. To quantify the importance of flows, we propose to follow the philosophy of Shapley values from cooperative game theory. To tackle the complexity of computing all coalitions’ marginal contributions, we propose a flow sampling scheme to compute Shapley value approximations as initial assessments of further training. We then propose an information-controlled learning algorithm to train flow scores toward diverse explanation targets: necessary or sufficient explanations. Experimental studies on both synthetic and real-world datasets demonstrate that our proposed FlowX and its variants lead to improved explainability of GNNs.

Abstract:
Nowadays, Deepfake videos are widely spread over the Internet, which severely impairs the public trustworthiness and social security. Although more and more reliable detectors have recently sprung up for resisting against that new-emerging tampering technique, some challengeable issues still need to be addressed, such that most of Deepfake video detectors under the framework of the supervised mechanism require a large scale of samples with accurate labels for training. When the amount of the training samples with the true labels are not enough or the training data are maliciously poisoned by adversaries, the supervised classifier is probably not reliable for detection. To tackle that tough issue, it is proposed to design a fully unsupervised Deepfake detector. In particular, in the whole procedure of training or testing, we have no idea of any information about the true labels of samples. First, we novelly design a pseudo-label generator for labeling the training samples, where the traditional hand-crafted features are used to characterize both types of samples. Second, the training samples with the pseudo-labels are fed into the proposed enhanced contrastive learner, in which the discriminative features are further extracted and continually refined by iteration on the guidance of the contrastive loss. Last, relying on the inter-frame correlation, we complete the final binary classification between real and fake videos. A large scale of experimental results empirically verify the effectiveness of our proposed unsupervised Deepfake detector on the benchmark datasets including FF++, Celeb-DF, DFD, DFDC, and UADFV. Furthermore, our proposed well-performed detector is superior to the current unsupervised method, and comparable to the baseline supervised methods. More importantly, when facing the problem of the labeled data poisoned by malicious adversaries or insufficient data for training, our proposed unsupervised Deepfake detector performs its powerful superiority.

Abstract:
Scene-dependent adaptive compressive sensing (CS) has been a long pursuing goal that has huge potential to significantly improve the performance of CS. However, with no access to the ground truth, how to design the scene-dependent adaptive strategy is still an open problem. In this paper, a restricted isometry property (RIP) condition-based error-clamping is proposed, which could directly predict the reconstruction error, i.e., the difference between the current-stage reconstructed image and the ground truth image, and adaptively allocate more samples to regions with larger reconstruction error at the next sampling stage. Furthermore, we propose a CS reconstruction network composed of Progressively inverse transform and Alternating Bi-directional Multi-grid Network, named PiABM-Net, that could efficiently utilize the multi-scale information for reconstructing the target image. The effectiveness of the proposed adaptive and cascaded CS method is demonstrated with extensive quantitative and qualitative experiments, compared with the state-of-the-art CS algorithms.

Abstract:
Since acquiring perfect supervision is usually difficult, real-world machine learning tasks often confront inaccurate, incomplete, or inexact supervision, collectively referred to as weak supervision. In this work, we present WSAUC, a unified framework for weakly supervised AUC optimization problems, which covers noisy label learning, positive-unlabeled learning, multi-instance learning, and semi-supervised learning scenarios. Within the WSAUC framework, we first frame the AUC optimization problems in various weakly supervised scenarios as a common formulation of minimizing the AUC risk on contaminated sets, and demonstrate that the empirical risk minimization problems are consistent with the true AUC. Then, we introduce a new type of partial AUC, specifically, the reversed partial AUC (rpAUC), which serves as a robust training objective for AUC maximization in the presence of contaminated labels. WSAUC offers a universal solution for AUC optimization in various weakly supervised scenarios by maximizing the empirical rpAUC. Theoretical and experimental results under multiple settings support the effectiveness of WSAUC on a range of weakly supervised AUC optimization tasks.

Abstract:
Many machine learning algorithms are known to be fragile on simple instance-independent noisy labels. However, noisy labels in real-world data are more devastating since they are produced by more complicated mechanisms in an instance-dependent manner. In this paper, we target this practical challenge of Instance-Dependent Noisy Labels by jointly training (1) a model reversely engineering the noise generating mechanism, which produces an instance-dependent mapping between the clean label posterior and the observed noisy label and (2) a robust classifier that produces clean label posteriors. Compared to previous methods, the former model is novel and enables end-to-end learning of the latter directly from noisy labels. An extensive empirical study indicates that the time-consistency of data is critical to the success of training both models and motivates us to develop a curriculum selecting training data based on their dynamics on the two models’ outputs over the course of training. We show that the curriculum-selected data provide both clean labels and high-quality input-output pairs for training the two models. Therefore, it leads to promising and robust classification performance even in notably challenging settings of instance-dependent noisy labels where many SoTA methods could easily fail. Extensive experimental comparisons and ablation studies further demonstrate the advantages and significance of the time-consistency curriculum in learning from instance-dependent noisy labels on multiple benchmark datasets.

Abstract:
Data association is at the core of many computer vision tasks, e.g., multiple object tracking, image matching, and point cloud registration. however, current data association solutions have some defects: they mostly ignore the intra-view context information; besides, they either train deep association models in an end-to-end way and hardly utilize the advantage of optimization-based assignment methods, or only use an off-the-shelf neural network to extract features. In this paper, we propose a general learnable graph matching method to address these issues. Especially, we model the intra-view relationships as an undirected graph. Then data association turns into a general graph matching problem between graphs. Furthermore, to make optimization end-to-end differentiable, we relax the original graph matching problem into continuous quadratic programming and then incorporate training into a deep graph neural network with KKT conditions and implicit function theorem. In MOT task, our method achieves state-of-the-art performance on several MOT datasets. For image matching, our method outperforms state-of-the-art methods on a popular indoor dataset, ScanNet. For point cloud registration, we also achieve competitive results.

Abstract:
Supervised person re-identification (Re-ID) approaches are sensitive to label corrupted data, which is inevitable and generally ignored in the field of person Re-ID. In this paper, we propose a two-stage noise-tolerant paradigm (TSNT) for labeling corrupted person Re-ID. Specifically, at stage one, we present a self-refining strategy to separately train each network in TSNT by concentrating more on pure samples. These pure samples are progressively refurbished via mining the consistency between annotations and predictions. To enhance the tolerance of TSNT to noisy labels, at stage two, we employ a co-training strategy to collaboratively supervise the learning of the two networks. Concretely, a rectified cross-entropy loss is proposed to learn the mutual information from the peer network by assigning large weights to the refurbished reliable samples. Moreover, a noise-robust triplet loss is formulated for further improving the robustness of TSNT by increasing inter-class distances and reducing intra-class distances in the label-corrupted dataset, where a constraint condition for reliability discrimination is carefully designed to select reliable triplets. Extensive experiments demonstrate the superiority of TSNT, for instance, on the Market1501 dataset, our paradigm achieves 90.3% rank-1 accuracy (6.2% improvement over the state-of-the-art method) under noise ratio 20%.

Abstract:
Structure from Motion (SfM) is a fundamental computer vision problem which has not been well handled by deep learning. One of the promising solutions is to apply explicit structural constraint, e.g., 3D cost volume, into the neural network. Obtaining accurate camera poses from images alone can be challenging, especially with complicated environmental factors. Existing methods usually assume accurate camera poses from GT or other methods, which is unrealistic in practice and additional sensors are needed. In this work, we design a physical driven architecture, namely DeepSFM, inspired by traditional Bundle Adjustment, which consists of two cost volume based architectures to iteratively refine depth and pose. The explicit constraints on both depth and pose, when combined with the learning components, bring merit from both traditional BA and emerging deep learning technology. To speed up the learning and inference efficiency, we apply the Gated Recurrent Units (GRUs)-based depth and pose update modules with coarse to fine cost volumes on the iterative refinements. In addition, with the extended residual depth prediction module, our model can be adapted to dynamic scenes effectively. Extensive experiments on various datasets show that our model achieves state-of-the-art performance with superior robustness against challenging inputs.

Abstract:
Deep neural networks are very successful on many vision tasks, but hard to interpret due to their black box nature. To overcome this, various post-hoc attribution methods have been proposed to identify image regions most influential to the models’ decisions. Evaluating such methods is challenging since no ground truth attributions exist. We thus propose three novel evaluation schemes to more reliably measure the faithfulness of those methods, to make comparisons between them more fair, and to make visual inspection more systematic. To address faithfulness, we propose a novel evaluation setting (DiFull) in which we carefully control which parts of the input can influence the output in order to distinguish possible from impossible attributions. To address fairness, we note that different methods are applied at different layers, which skews any comparison, and so evaluate all methods on the same layers (ML-Att) and discuss how this impacts their performance on quantitative metrics. For more systematic visualizations, we propose a scheme (AggAtt) to qualitatively evaluate the methods on complete datasets. We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models. Finally, we propose a post-processing smoothing step that significantly improves the performance of some attribution methods, and discuss its applicability.

Abstract:
In this paper, we propose FGPR: a Federated Gaussian process (\mathcal GPGP) regression framework that uses an averaging strategy for model aggregation and stochastic gradient descent for local computations. Notably, the resulting global model excels in personalization as FGPR jointly learns a shared prior across all devices. The predictive posterior is then obtained by exploiting this shared prior and conditioning on local data, which encodes personalized features from a specific dataset. Theoretically, we show that FGPR converges to a critical point of the full log-marginal likelihood function, subject to statistical errors. This result offers standalone value as it brings federated learning theoretical results to correlated paradigms. Through extensive case studies on several regression tasks, we show that FGPR excels in a wide range of applications and is a promising approach for privacy-preserving multi-fidelity data modeling.

Abstract:
Lossy image compression is a fundamental technology in media transmission and storage. Variable-rate approaches have recently gained much attention to avoid the usage of a set of different models for compressing images at different rates. During the media sharing, multiple re-encodings with different rates would be inevitably executed. However, existing Variational Autoencoder (VAE)-based approaches would be readily corrupted in such circumstances, resulting in the occurrence of strong artifacts and the destruction of image fidelity. Based on the theoretical findings of preserving image fidelity via invertible transformation, we aim to tackle the issue of high-fidelity fine variable-rate image compression and thus propose the Invertible Continuous Codec (I2C). We implement the I2C in a mathematical invertible manner with the core Invertible Activation Transformation (IAT) module. I2C is constructed upon a single-rate Invertible Neural Network (INN) based model and the quality level (QLevel) would be fed into the IAT to generate scaling and bias tensors. Extensive experiments demonstrate that the proposed I2C method outperforms state-of-the-art variable-rate image compression methods by a large margin, especially after multiple continuous re-encodings with different rates, while having the ability to obtain a very fine variable-rate control without any performance compromise.

Abstract:
Detection of human body and its parts has been intensively studied. However, most of CNNs-based detectors are trained independently, making it difficult to associate detected parts with body. In this paper, we focus on the joint detection of human body and its parts. Specifically, we propose a novel extended object representation integrating center-offsets of body parts, and construct an end-to-end generic Body-Part Joint Detector (BPJDet). In this way, body-part associations are neatly embedded in a unified representation containing both semantic and geometric contents. Therefore, we can optimize multi-loss to tackle multi-tasks synergistically. Moreover, this representation is suitable for anchor-based and anchor-free detectors. BPJDet does not suffer from error-prone post matching, and keeps a better trade-off between speed and accuracy. Furthermore, BPJDet can be generalized to detect body-part or body-parts of either human or quadruped animals. To verify the superiority of BPJDet, we conduct experiments on datasets of body-part (CityPersons, CrowdHuman and BodyHands) and body-parts (COCOHumanParts and Animals5C). While keeping high detection accuracy, BPJDet achieves state-of-the-art association performance on all datasets. Besides, we show benefits of advanced body-part association capability by improving performance of two representative downstream applications: accurate crowd head detection and hand contact estimation.

Affiliations: School of Instrument Science and Engineering, the State Key Laboratory of Digital Medical Engineering, the School of Biological Science and Medical Engineering, Southeast University, Nanjing, China; State Key Laboratory of Digital Medical Engineering, School of Instrument Science and Engineering, Southeast University, Nanjing, China; Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi, Kuala Lumpur, Malaysia; School of Computer Science and Engineering, Southeast University, Nanjing, China; Department of Engineering Science, University of Oxford, Oxford, U.K.

Abstract:
Noisy labels are often encountered in datasets, but learning with them is challenging. Although natural discrepancies between clean and mislabeled samples in a noisy category exist, most techniques in this field still gather them indiscriminately, which leads to their performances being partially robust. In this paper, we reveal both empirically and theoretically that the learning robustness can be improved by assuming deep features with the same labels follow a student distribution, resulting in a more intuitive method called student loss. By embedding the student distribution and exploiting the sharpness of its curve, our method is naturally data-selective and can offer extra strength to resist mislabeled samples. This ability makes clean samples aggregate tightly in the center, while mislabeled samples scatter, even if they share the same label. Additionally, we employ the metric learning strategy and develop a large-margin student (LT) loss for better capability. It should be noted that our approach is the first work that adopts the prior probability assumption in feature representation to decrease the contributions of mislabeled samples. This strategy can enhance various losses to join the student loss family, even if they have been robust losses. Experiments demonstrate that our approach is more effective in inaccurate supervision. Enhanced LT losses significantly outperform various state-of-the-art methods in most cases. Even huge improvements of over 50% can be obtained under some conditions.

Abstract:
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transformations in DNNs by our novel B-cos transformation. As we show, a sequence (network) of such transformations induces a single linear transformation that faithfully summarises the full model computations. Moreover, the B-cos transformation is designed such that the weights align with relevant signals during optimisation. As a result, those induced linear transformations become highly interpretable and highlight task-relevant features. Importantly, the B-cos transformation is designed to be compatible with existing architectures and we show that it can easily be integrated into virtually all of the latest state of the art models for computer vision—e.g. ResNets, DenseNets, ConvNext models, as well as Vision Transformers—by combining the B-cos-based explanations with normalisation and attention layers, all whilst maintaining similar accuracy on ImageNet. Finally, we show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.

Abstract:
Partial-label learning (PLL) utilizes instances with PLs, where a PL includes several candidate labels but only one is the true label (TL). In PLL, identification-based strategy (IBS) purifies each PL on the fly to select the (most likely) TL for training; average-based strategy (ABS) treats all candidate labels equally for training and let trained models be able to predict TL. Although PLL research has focused on IBS for better performance, ABS is also worthy of study since modern IBS behaves like ABS in the beginning of training to prepare for PL purification and TL selection. In this paper, we analyze why ABS was unsatisfactory and propose how to improve it. Theoretically, we propose two problem settings of PLL and prove that average PL losses (APLLs) with bounded multi-class losses are always robust, while APLLs with unbounded losses may be non-robust, which is the first robustness analysis for PLL. Experimentally, we have two promising findings: ABS using bounded losses can match/exceed state-of-the-art performance of IBS using unbounded losses; after using robust APLLs to warm start, IBS can further improve upon itself. Our work draws attention to ABS research, which can in turn boost IBS and push forward the whole PLL.

Abstract:
Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a set of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the concurrent t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

Abstract:
When given a group of relevant images for co-salient object detection (Co-SOD), humans first summarize consensus cues from the whole group and then search for co-salient objects in each image. Most previous methods do not consider robustness, scalability, or stability in the summarization stage and adopt a simple fusion strategy to fuse consensus and image features in the searching stage. Our work presents a novel consensus-aware dynamic convolution (CADC) model directly from the “summarize and search” perspective to explicitly and effectively perform Co-SOD. For the summarization stage, we extract robust individual image features by a pooling method and integrate them to generate consensus features via self-attention, thus modeling the scalability and stability. Then, we simultaneously learn two types of consensus-aware dynamic kernels, i.e., a common kernel to capture group-wise common knowledge and adaptive kernels to mine image-specific consensus cues. For the second stage, we adopt dynamic convolution to perform object searching. A novel data synthesis strategy is also developed for model training. Although CADC has obtained competitive performance, we argue that incrementally learning dynamic kernels and representations is more intuitive and natural instead of using a simultaneous scheme, thus presenting our CADC++, an extension of CADC. Concretely, we first adopt the common kernel based dynamic convolution to capture coarse common cues as priors and then use the adaptive kernel based dynamic convolution for mining image-specific details. We also propose a recursive guidance strategy to further explore deep interactions among the two kinds of kernels and image features. Besides, we annotate several challenging attributes for Co-SOD datasets and perform attribute-based evaluation and robustness analysis to promote thorough model evaluation for the Co-SOD field. Extensive experimental results on four benchmark datasets verify both the effectiveness and robustness of our proposed method.

Abstract:
For multi-modal image processing, network interpretability is essential due to the complicated dependency across modalities. Recently, a promising research direction for interpretable network is to incorporate dictionary learning into deep learning through unfolding strategy. However, the existing multi-modal dictionary learning models are both single-layer and single-scale, which restricts the representation ability. In this paper, we first introduce a multi-scale multi-modal convolutional dictionary learning (\mathrm M^2M2CDL) model, which is performed in a multi-layer strategy, to associate different image modalities in a coarse-to-fine manner. Then, we propose a unified framework namely Deep\mathrm M^2M2CDL derived from the \mathrm M^2M2CDL model for both multi-modal image restoration (MIR) and multi-modal image fusion (MIF) tasks. The network architecture of Deep\mathrm M^2M2CDL fully matches the optimization steps of the \mathrm M^2M2CDL model, which makes each network module with good interpretability. Different from handcrafted priors, both the dictionary and sparse feature priors are learned through the network. The performance of the proposed Deep\mathrm M^2M2CDL is evaluated on a wide variety of MIR and MIF tasks, which shows the superiority of it over many state-of-the-art methods both quantitatively and qualitatively. In addition, we also visualize the multi-modal sparse features and dictionary filters learned from the network, which demonstrates the good interpretability of the Deep\mathrm M^2M2CDL network.

Abstract:
The remarkable performance of deep Convolutional neural networks (CNNs) is generally attributed to their deeper and wider architectures, which can come with significant computational costs. Pruning neural networks has thus gained interest since it effectively lowers storage and computational costs. In contrast to weight pruning, which results in unstructured models, structured pruning provides the benefit of realistic acceleration by producing models that are friendly to hardware implementation. The special requirements of structured pruning have led to the discovery of numerous new challenges and the development of innovative solutions. This article surveys the recent progress towards structured pruning of deep CNNs. We summarize and compare the state-of-the-art structured pruning techniques with respect to filter ranking methods, regularization methods, dynamic execution, neural architecture search, the lottery ticket hypothesis, and the applications of pruning. While discussing structured pruning algorithms, we briefly introduce the unstructured pruning counterpart to emphasize their differences. Furthermore, we provide insights into potential research opportunities in the field of structured pruning. A curated list of neural network pruning papers can be found at: https://github.com/he-y/Awesome-Pruning. A dedicated website offering a more interactive comparison of structured pruning methods can be found at: https://huggingface.co/spaces/he-yang/Structured-Pruning-Survey.

Abstract:
Matching hand-drawn sketches with photos (a.k.a sketch-photo recognition or re-identification) faces the information asymmetry challenge due to the abstract nature of the sketch modality. Existing works tend to learn shared embedding spaces with CNN models by discarding the appearance cues for photo images or introducing GAN for sketch-photo synthesis. The former unavoidably loses discriminability, while the latter contains ineffaceable generation noise. In this paper, we start the first attempt to design an information-aligned sketch transformer (SketchTrans_++) via cross-modal disentangled prototype learning, while the transformer has shown great promise for discriminative visual modelling. Specifically, we design an asymmetric disentanglement scheme with a dynamic updatable auxiliary sketch (A-sketch) to align the modality representations without sacrificing information. The asymmetric disentanglement decomposes the photo representations into sketch-relevant and sketch-irrelevant cues, transferring sketch-irrelevant knowledge into the sketch modality to compensate for the missing information. Moreover, considering the feature discrepancy between the two modalities, we present a modality-aware prototype contrastive learning method that mines representative modality-sharing information using the modality-aware prototypes rather than the original feature representations. Extensive experiments on category- and instance-level sketch-based datasets validate the superiority of our proposed method under various metrics.

Abstract:
Various methods have been proposed to defend against adversarial attacks. However, there is a lack of enough theoretical guarantee of the performance, thus leading to two problems: First, deficiency of necessary adversarial training samples might attenuate the normal gradient's back-propagation, which leads to overfitting and gradient masking potentially. Second, point-wise adversarial sampling offers an insufficient support region for adversarial data and thus cannot form a robust decision-boundary. To solve these issues, we provide a theoretical analysis to reveal the relationship between robust accuracy and the complexity of the training set in adversarial training. As a result, we propose a novel training scheme called Variational Adversarial Defense. Based on the distribution of adversarial samples, this novel construction upgrades the defend scheme from local point-wise to distribution-wise, yielding an enlarged support region for safeguarding robust training, thus possessing a higher promising to defense attacks. The proposed method features the following advantages: 1) Instead of seeking adversarial examples point-by-point (in a sequential way), we draw diverse adversarial examples from the inferred distribution; and 2) Augmenting the training set by a larger support region consolidates the smoothness of the decision boundary. Finally, the proposed method is analyzed via the Taylor expansion technique, which casts our solution with natural interpretability.

Abstract:
This paper presents a novel technique for the dense reconstruction of light fields (LFs) from sparse input views. Our approach leverages the Epipolar Focus Spectrum (EFS) representation, which models the LF in the transformed spatial-focus domain, avoiding the dependence on the scene depth and providing a high-quality basis for dense LF reconstruction. Previous EFS-based LF reconstruction methods learn the cross-view, occlusion, depth and shearing terms simultaneously, which makes the training difficult due to stability and convergence problems and further results in limited reconstruction performance for challenging scenarios. To address this issue, we conduct a theoretical study on the transformation between the EFSs derived from one LF with sparse and dense angular samplings, and propose that a dense EFS can be decomposed into a linear combination of the EFS of the sparse input, the sheared EFS, and a high-order occlusion term explicitly. The devised learning-based framework with the input of the under-sampled EFS and its sheared version provides high-quality reconstruction results, especially in large disparity areas. Comprehensive experimental evaluations show that our approach outperforms state-of-the-art methods, especially achieves at most > 4>4 dB advantages in reconstructing scenes containing thin structures.

Abstract:
While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers’ performance, i.e., the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks.

Abstract:
Partial label learning (PLL) is an important problem that allows each training example to be labeled with a coarse candidate set with the ground-truth label included. However, in a more practical but challenging scenario, the annotator may miss the ground-truth and provide a wrong candidate set, which is known as the noisy PLL problem. To remedy this problem, we propose the PiCO+ framework that simultaneously disambiguates the candidate sets and mitigates label noise. Core to PiCO+, we develop a novel label disambiguation algorithm PiCO that consists of a contrastive learning module along with a novel class prototype-based disambiguation method. Theoretically, we show that these two components are mutually beneficial, and can be rigorously justified from an expectation-maximization (EM) algorithm perspective. To handle label noise, we extend PiCO to PiCO+, which further performs distance-based clean sample selection, and learns robust classifiers by a semi-supervised contrastive learning algorithm. Beyond this, we further investigate the robustness of PiCO+ in the context of out-of-distribution noise and incorporate a novel energy-based rejection method for improved robustness. Extensive experiments demonstrate that our proposed methods significantly outperform the current state-of-the-art approaches in standard and noisy PLL tasks and even achieve comparable results to fully supervised learning.

Abstract:
Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios.

Abstract:
A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this article, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR.

Abstract:
Visual-language navigation (VLN) is a challenging task that requires embodied agents to follow natural language instructions to navigate in previously unseen environments. However, existing literature put most emphasis on interpreting instructions into actions, only delivering “dumb” wayfinding agents which cannot actively use natural language to communicate with humans. In this article, we devise Lana, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We further extend Lana by exploiting object semantics during route encoding. This leads to Lana+, a more powerful framework that simulates the way humans refer to landmarks for instructions composition and wayfinding. We empirically verify that, compared with recent advanced task-specific solutions, Lana attains better performances on both instruction following and generation, with nearly half complexity. In addition, endowed with language generation capability, Lana can explain to humans its behaviors and assist human's wayfinding. Benefiting from landmark information, Lana+ exhibits even more impressive performance. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots.

Abstract:
Neural radiance fields (NeRF) have shown great success in novel view synthesis. However, recovering high-quality details from real-world scenes is still challenging for the existing NeRF-based approaches, due to the potential imperfect calibration information and scene representation inaccuracy. Even with high-quality training frames, the synthetic novel views produced by NeRF models still suffer from notable rendering artifacts, such as noise and blur. To address this, we propose NeRFLiX, a general NeRF-agnostic restorer paradigm that learns a degradation-driven inter-viewpoint mixer. Specially, we design a NeRF-style degradation modeling approach and construct large-scale training data, enabling the possibility of effectively removing NeRF-native rendering artifacts for deep neural networks. Moreover, beyond the degradation removal, we propose an inter-viewpoint aggregation framework that fuses highly related high-quality training images, pushing the performance of cutting-edge NeRF models to entirely new levels and producing highly photo-realistic synthetic views. Based on this paradigm, we further present NeRFLiX++ with a stronger two-stage NeRF degradation simulator and a faster inter-viewpoint mixer, achieving superior performance with significantly improved computational efficiency. Notably, NeRFLiX++ is capable of restoring photo-realistic ultra-high-resolution outputs from noisy low-resolution NeRF-rendered views. Extensive experiments demonstrate the excellent restoration ability of NeRFLiX++ on various novel view synthesis benchmarks.

Abstract:
The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multiple query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multiple matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a CLIP-Transformer based muLtI-factor Matching Network (LIMN), which consists of three key modules: disentanglement-based latent factor tokens mining, dual aggregation-based matching token learning, and dual query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a weakly-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on four datasets, including FashionIQ, Shoes, CIRR, and Fashion200 K, show that our proposed LIMN and LIMN+ significantly surpass the state-of-the-art baselines.

Abstract:
Facial editing is to manipulate the facial attributes of a given face image. Nowadays, with the development of generative models, users can easily generate 2D and 3D facial images with high fidelity and 3D-aware consistency. However, existing works are incapable of delivering a continuous and fine-grained editing mode (e.g., editing a slightly smiling face to a big laughing one) with natural interactions with users. In this work, we propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system. Our key insight is to model a continual “semantic field” in the GAN latent space. 1) Unlike previous works that regard the editing as traversing straight lines in the latent space, here the fine-grained editing is formulated as finding a curving trajectory that respects fine-grained attribute landscape on the semantic field. 2) The curvature at each step is location-specific and determined by the input image as well as the users’ language requests. 3) To engage the users in a meaningful dialog, our system generates language feedback by considering both the user request and the current state of the semantic field. We demonstrate the effectiveness of our proposed framework on both 2D and 3D-aware generative models. We term the semantic field for the 3D-aware models as “tri-plane” flow, as it corresponds to the changes not only in the color space but also in the density space. We also contribute CelebA-Dialog, a visual-language facial editing dataset to facilitate large-scale study. Specifically, each image has manually annotated fine-grained attribute annotations as well as template-based textual descriptions in natural language. Extensive quantitative and qualitative experiments demonstrate the superiority of our framework in terms of 1) the smoothness of fine-grained editing, 2) the identity/attribute preservation, and 3) the visual photorealism and dialog fluency. Notably, the user study validates that our overall system is consistently favored by around 80% of the participants.

Abstract:
The cross-model transferability of adversarial examples makes black-box attacks to be practical. However, it typically requires access to the input of the same modality as black-box models to attain reliable transferability. Unfortunately, the collection of datasets may be difficult in security-critical scenarios. Hence, developing cross-modal attacks for fooling models with different modalities of inputs would highly threaten real-world DNNs applications. The above considerations motivate us to investigate cross-modal transferability of adversarial examples. In particular, we aim to generate video adversarial examples from white-box image models to attack video CNN and ViT models. We introduce the Image To Video (I2V) attack based on the observation that image and video models share similar low-level features. For each video frame, I2V optimizes perturbations by reducing the similarity of intermediate features between benign and adversarial frames on image models. Then I2V combines adversarial frames together to generate video adversarial examples. I2V can be easily extended to simultaneously perturb multi-layer features extracted from an ensemble of image models. To efficiently integrate various features, we introduce an adaptive approach to re-weight the contributions of each layer based on its cosine similarity values of the previous attack step. Experimental results demonstrate the effectiveness of the proposed method.

Abstract:
Successful point cloud registration relies on accurate correspondences established upon powerful descriptors. However, existing neural descriptors either leverage a rotation-variant backbone whose performance declines under large rotations, or encode local geometry that is less distinctive. To address this issue, we introduce RIGA to learn descriptors that are Rotation-Invariant by design and Globally-Aware. From the Point Pair Features (PPFs) of sparse local regions, rotation-invariant local geometry is encoded into geometric descriptors. Global awareness of 3D structures and geometric context is subsequently incorporated, both in a rotation-invariant fashion. More specifically, 3D structures of the whole frame are first represented by our global PPF signatures, from which structural descriptors are learned to help geometric descriptors sense the 3D world beyond local regions. Geometric context from the whole scene is then globally aggregated into descriptors. Finally, the description of sparse regions is interpolated to dense point descriptors, from which correspondences are extracted for registration. To validate our approach, we conduct extensive experiments on both object- and scene-level data. With large rotations, RIGA surpasses the state-of-the-art methods by a margin of 8^\circ∘ in terms of the Relative Rotation Error on ModelNet40 and improves the Feature Matching Recall by at least 5 percentage points on 3DLoMatch.

Abstract:
Nondestructive detection methods, based on vibrational spectroscopy, are vitally important in a wide range of applications including industrial chemistry, pharmacy and national defense. Recently, deep learning has been introduced into vibrational spectroscopy showing great potential. Different from images, text, etc. that offer large labeled data sets, vibrational spectroscopic data is very limited, which requires novel concepts beyond transfer and meta learning. To tackle this, we propose a task-enhanced augmentation network (TeaNet). The key component of TeaNet is a reconstruction module that inputs randomly masked spectra and outputs reconstructed samples that are similar to the original ones, but include additional variations learned from the domain. These augmented samples are used to train the classification model. The reconstruction and prediction parts are trained simultaneously, end-to-end with back-propagation. Results on both synthetic and real-world datasets verified the superiority of the proposed method. In the most difficult synthetic scenarios TeaNet outperformed CNN by 17%. We visualized and analysed the neuron responses of TeaNet and CNN, and found that TeaNet's ability to identify discriminant wavenumbers was excellent compared to CNN. Our approach is general and can be easily adapted to other domains, offering a solution to more accurate and interpretable few-shot learning.

Abstract:
Rolling shutter temporal super-resolution (RSSR), which aims to synthesize intermediate global shutter (GS) video frames between two consecutive rolling shutter (RS) frames, has made remarkable progress with the development of deep convolutional neural networks over the past years. Existing methods cascade multiple separated networks to sequentially estimate intermediate motion fields and synthesize target GS frames. Nevertheless, they are typically complex, do not facilitate the interaction of complementary motion and appearance information, and suffer from problems such as pixel aliasing or poor interpretation. In this paper, we derive the uniform bilateral motion fields for RS-aware backward warping, which endows our network a more explicit geometric meaning by injecting spatio-temporal consistency information through time-offset embedding. More importantly, we develop a unified, single-stage RSSR pipeline to recover the latent GS video in a coarse-to-fine manner. It first extracts pyramid features from given inputs, and then refines the bilateral motion fields together with the anchor frame until generating the desired output. With the help of our proposed bilateral cost volume, which uses the anchor frame as a common reference to model the correlation with two RS frames, the gradually refined anchor frames not only facilitate intermediate motion estimation, but also compensate for contextual details, making additional frame synthesis or refinement networks unnecessary. Meanwhile, an asymmetric bilateral motion model built on top of the symmetric bilateral motion model further improves the generality and adaptability, yielding better GS video reconstruction performance. Extensive quantitative and qualitative experiments on synthetic and real data demonstrate that our method achieves new state-of-the-art results.

Abstract:
Backpropagation (BP) is widely used for calculating gradients in deep neural networks (DNNs). Applied often along with stochastic gradient descent (SGD) or its variants, BP is considered as a de-facto choice in a variety of machine learning tasks including DNN training and adversarial attack/defense. Recently, a linear variant of BP named LinBP was introduced for generating more transferable adversarial examples for performing black-box attacks, by (Guo et al. 2020). Although it has been shown empirically effective in black-box attacks, theoretical studies and convergence analyses of such a method is lacking. This paper serves as a complement and somewhat an extension to Guo et al. (2020) paper, by providing theoretical analyses on LinBP in neural-network-involved learning tasks, including adversarial attack and model training. We demonstrate that, somewhat surprisingly, LinBP can lead to faster convergence in these tasks in the same hyper-parameter settings, compared to BP. We confirm our theoretical results with extensive experiments.

Abstract:
Well-calibrated probabilistic regression models are a crucial learning component in robotics applications as datasets grow rapidly and tasks become more complex. Unfortunately, classical regression models are usually either probabilistic kernel machines with a flexible structure that does not scale gracefully with data or deterministic and vastly scalable automata, albeit with a restrictive parametric form and poor regularization. In this paper, we consider a probabilistic hierarchical modeling paradigm that combines the benefits of both worlds to deliver computationally efficient representations with inherent complexity regularization. The presented approaches are probabilistic interpretations of local regression techniques that approximate nonlinear functions through a set of local linear or polynomial units. Importantly, we rely on principles from Bayesian nonparametrics to formulate flexible models that adapt their complexity to the data and can potentially encompass an infinite number of components. We derive two efficient variational inference techniques to learn these representations and highlight the advantages of hierarchical infinite local regression models, such as dealing with non-smooth functions, mitigating catastrophic forgetting, and enabling parameter sharing and fast predictions. Finally, we validate this approach on large inverse dynamics datasets and test the learned models in real-world control scenarios.

Abstract:
This paper considers a network referred to as SoftGroup for accurate and scalable 3D instance segmentation. Existing state-of-the-art methods produce hard semantic predictions followed by grouping instance segmentation results. Unfortunately, errors stemming from hard decisions propagate into the grouping, resulting in poor overlap between predicted instances and ground truth and substantial false positives. To address the abovementioned problems, SoftGroup allows each point to be associated with multiple classes to mitigate the uncertainty stemming from semantic prediction. It also suppresses false positive instances by learning to categorize them as background. Regarding scalability, the existing fast methods require computational time on the order of tens of seconds on large-scale scenes, which is unsatisfactory and far from applicable for real-time. Our finding is that the kk-Nearest Neighbor (kk-NN) module, which serves as the prerequisite of grouping, introduces a computational bottleneck. SoftGroup is extended to resolve this computational bottleneck, referred to as SoftGroup++. The proposed SoftGroup++ reduces time complexity with octree kk-NN and reduces search space with class-aware pyramid scaling and late devoxelization. Experimental results on various indoor and outdoor datasets demonstrate the efficacy and generality of the proposed SoftGroup and SoftGroup++. Their performances surpass the best-performing baseline by a large margin (6% ～∼ 16%) in terms of AP_5050. On datasets with large-scale scenes, SoftGroup++ achieves a 6× speed boost on average compared to SoftGroup. Furthermore, SoftGroup can be extended to perform object detection and panoptic segmentation with nontrivial improvements over existing methods.

Abstract:
Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has become a popular strategy to significantly improve generalization performance. However, the contribution of pre-training to generalization performance is often overlooked and understudied, with limited theoretical understanding. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Second, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific.

Abstract:
External fingerprints (EFs) based only on epidermal information are vulnerable to spoofing attacks and non-ideal skin conditions. To solve such shortcomings, internal fingerprints (IFs) collected using optical coherence tomography (OCT) have been proposed and widely researched. However, the development of IF is limited by the lack of in-depth researches on the IF and the EF-IF interoperability, which is partially caused by the lack of public OCT database. The obvious gap in the applications of EF and IF recognition motivated us to design and publish a comprehensive fingerprint database containing both traditional EFs and OCT IFs, denoted as ZJUT-EIFD. To the best of our knowledge, ZJUT-EIFD is the first public database that combines OCT and total internal reflection (TIR) via synchronous acquisition, with 399 different fingers from 60 subjects. In this article, the composition of the database, the quality of EFs and IFs, and the verification performance of different types of fingerprints were detailed. In addition, potential application directions of ZJUT-EIFD were demonstrated. ZJUT-EIFD can serve benchmarks and interoperability tests for EF-IF research, which will promote the research and development of EF and IF.

Abstract:
Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in the neural implicit rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of SDF-based neural renderer cannot scale to UDF, we formalize the rules of neural volume rendering for open surface reconstruction (e.g., self-consistent, unbiased, occlusion-aware), and derive a dedicated rendering weight function specially tailored for UDF. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including two typical open surface datasets MGN (Bhatnagar et al., 2019) and Deep Fashion 3D (Zhu et al., 2020). Experimental results demonstrate that NeUDF can significantly outperform the state-of-the-art methods in the task of multi-view surface reconstruction, especially for the complex shapes with open boundaries.

Abstract:
Stereo matching is a fundamental building block for many vision and robotics applications. An informative and concise cost volume representation is vital for stereo matching of high accuracy and efficiency. In this article, we present a novel cost volume construction method, named attention concatenation volume (ACV), which generates attention weights from correlation clues to suppress redundant information and enhance matching-related information in the concatenation volume. The ACV can be seamlessly embedded into most stereo matching networks, the resulting networks can use a more lightweight aggregation network and meanwhile achieve higher accuracy. We further design a fast version of ACV to enable real-time performance, named Fast-ACV, which generates high likelihood disparity hypotheses and the corresponding attention weights from low-resolution correlation clues to significantly reduce computational and memory cost and meanwhile maintain a satisfactory accuracy. Furthermore, we design a highly accurate network ACVNet and a real-time network Fast-ACVNet based on our ACV and Fast-ACV respectively, which achieve state-of-the-art performance on several benchmarks.

Abstract:
This article targets the task of novel category discovery (NCD), which aims to discover unknown categories when a certain number of classes are already known. The NCD task is challenging due to its closeness to real-world scenarios, where we have only encountered some partial classes and corresponding images. Unlike previous approaches to NCD, we propose a novel adaptive prototype learning method that leverages prototypes to emphasize category discrimination and alleviate the issue of missing annotations for novel classes. Concretely, the proposed method consists of two main stages: prototypical representation learning and prototypical self-training. In the first stage, we develop a robust feature extractor that could effectively handle images from both base and novel categories. This ability of instance and category discrimination of the feature extractor is boosted by self-supervised learning and adaptive prototypes. In the second stage, we utilize the prototypes again to rectify offline pseudo labels and train a final parametric classifier for category clustering. We conduct extensive experiments on four benchmark datasets, demonstrating our method’s effectiveness and robustness with state-of-the-art performance.

Abstract:
Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available at test time. In this work, we overcome this assumption operating on the open world setting, where no limit is imposed on the compositional space at test time, and the search space contains a large number of unseen compositions. To address this problem, we propose a new approach, Compositional Cosine Graph Embeddings (Co-CGE), based on two principles. First, Co-CGE models the dependency between states, objects and their compositions through a graph convolutional neural network. The graph propagates information from seen to unseen concepts, improving their representations. Second, since not all unseen compositions are equally feasible, and less feasible ones may damage the learned representations, Co-CGE estimates a feasibility score for each unseen composition, using the scores as margins in a cosine similarity-based loss and as weights in the adjacency matrix of the graphs. Experiments show that our approach achieves state-of-the-art performances in standard CZSL while outperforming previous methods in the open world scenario.

Abstract:
Understanding model decision under novel test scenarios is central to the community. A common practice is evaluating models on labeled test sets. However, many real-world scenarios see unlabeled test data, rendering the common supervised evaluation protocols infeasible. In this paper, we investigate such an important but under-explored problem, named Automatic model Evaluation (AutoEval). Specifically, given a trained classifier, we aim to estimate its accuracy on various unlabeled test datasets. We construct a meta-dataset: a dataset comprised of datasets (sample sets) created from original images via various transformations such as rotation and background substitution. Correlation studies on the meta-dataset show that classifier accuracy exhibits a strong negative linear relationship with distribution shift (Pearson’s Correlation r <-0.88r<-0.88). This new finding inspires us to formulate AutoEval as a dataset-level regression problem. Specifically, we learn regression models (e.g., a regression neural network) to estimate classifier accuracy from overall feature statistics of a test set. In the experiment, we show that the meta-dataset contains sufficient and diverse sample sets, allowing us to train robust regression models and report reasonable and promising predictions of the classifier accuracy on various test sets. We also provide insights into application scopes, limitations, and potential future directions of AutoEval.

Abstract:
Diffusion, a fundamental internal mechanism emerging in many physical processes, describes the interaction among different objects. In many learning tasks with limited training samples, the diffusion connects the labeled and unlabeled data points and is a critical component for achieving high classification accuracy. Many existing deep learning approaches directly impose the fusion loss when training neural networks. In this work, inspired by the convection-diffusion ordinary differential equations (ODEs), we propose a novel diffusion residual network (Diff-ResNet), internally introduces diffusion into the architectures of neural networks. Under the structured data assumption, it is proved that the proposed diffusion block can increase the distance-diameter ratio that improves the separability of inter-class points and reduces the distance among local intra-class points. Moreover, this property can be easily adopted by the residual networks for constructing the separable hyperplanes. Extensive experiments of synthetic binary classification, semi-supervised graph node classification and few-shot image classification in various datasets validate the effectiveness of the proposed method.

Abstract:
Event cameras are ideally suited to capture High Dynamic Range (HDR) visual information without blur but provide poor imaging capability for static or slowly varying scenes. Conversely, conventional image sensors measure absolute intensity of slowly changing scenes effectively but do poorly on HDR or quickly changing scenes. In this paper, we present an asynchronous linear filter architecture, fusing event and frame camera data, for HDR video reconstruction and spatial convolution that exploits the advantages of both sensor modalities. The key idea is the introduction of a state that directly encodes the integrated or convolved image information and that is updated asynchronously as each event or each frame arrives from the camera. The state can be read-off as-often-as and whenever required to feed into subsequent vision modules for real-time robotic systems. Our experimental results are evaluated on both publicly available datasets with challenging lighting conditions and fast motions, along with a new dataset with HDR reference that we provide. The proposed AKF pipeline outperforms other state-of-the-art methods in both absolute intensity error (69.4% reduction) and image similarity indexes (average 35.5% improvement). We also demonstrate the integration of image convolution with linear spatial kernels Gaussian, Sobel, and Laplacian as an application of our architecture.

Abstract:
Given a high-level instruction, the task of Embodied Referring Expression (REVERIE) requires an embodied agent to localise a remote referred object via navigating in the unseen environment. Previous vision-language navigation methods utilise the provided fine-grained instruction as step-by-step navigation guidance to conduct strict instruction-following, while REVERIE aims to achieve efficient goal-oriented exploration according to the high-level command. In this work, we propose a Cross-modal Knowledge Reasoning (abbreviated as CKR+) framework, which incorporates the prior knowledge as decision guidance to learn the navigation scheme comprehensively. Specifically, we design a Room-Object Aware (ROA) mechanism to explicitly decouple the room- and object-related clues from instruction and visual observations. Moreover, we propose a Knowledge-enabled Entity Relation Reasoning (KERR+) module to leverage the structured knowledge from the knowledge graph explicitly and unstructured knowledge from pre-trained model implicitly, to learn the internal-external correlations among room- and object-entities for the agent to make proper decisions. We devise an Entity Prompter (EP) that embeds in the KERR+ module, which utilises the navigation history and visual entities as prompts to transfer knowledge from the pre-trained CLIP model. In addition, we develop a Reinforced End Decider (RED) to learn the stopping scheme specifically, which is achieved by a customised reinforcement learning strategy and knowledge enhanced matching. Two techniques are also introduced to improve navigation efficiency further. Extensive experiments conducted on the REVERIE benchmark demonstrate the effectiveness and superiority of our proposed methods, which boosts the key metrics, i.e., SPL and REVERIE-success rate, to 14.46% and 13.81% respectively.

Abstract:
Rule-based models, e.g., decision trees, are widely used in scenarios demanding high model interpretability for their transparent inner structures and good model expressivity. However, rule-based models are hard to optimize, especially on large data sets, due to their discrete parameters and structures. Ensemble methods and fuzzy/soft rules are commonly used to improve performance, but they sacrifice the model interpretability. To obtain both good scalability and interpretability, we propose a new classifier, named Rule-based Representation Learner (RRL), that automatically learns interpretable non-fuzzy rules for data representation and classification. To train the non-differentiable RRL effectively, we project it to a continuous space and propose a novel training method, called Gradient Grafting, that can directly optimize the discrete model using gradient descent. A novel design of logical activation functions is also devised to increase the scalability of RRL and enable it to discretize the continuous features end-to-end. Exhaustive experiments on ten small and four large data sets show that RRL outperforms the competitive interpretable approaches and can be easily adjusted to obtain a trade-off between classification accuracy and model complexity for different scenarios.

Abstract:
Data augmentation is an effective method to improve model robustness and generalization. Conventional data augmentation pipelines are commonly used as preprocessing modules for neural networks with predefined heuristics and restricted differentiability. Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the augmentation policy searching strategies. Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the searching of augmentation policy strategies. This survey provides a comprehensive and structured overview of the advances in DDA. Specifically, we focus on fundamental elements including differentiable operations, operation relaxations, and gradient estimations, then categorize existing DDA works accordingly, and investigate the utilization of DDA in selected of practical applications, specifically neural augmentation networks and differentiable augmentation search. Finally, we discuss current challenges of DDA and future research directions.

Abstract:
Federated learning (FL) allows multiple clients to collaboratively learn a globally shared model through cycles of model aggregation and local model training, without the need to share data. Most existing FL methods train local models separately on different clients, and then simply average their parameters to obtain a centralized model on the server side. However, these approaches generally suffer from large aggregation errors and severe local forgetting, which are particularly bad in heterogeneous data settings. To tackle these issues, in this paper, we propose a novel FL framework that uses online Laplace approximation to approximate posteriors on both the client and server side. On the server side, a multivariate Gaussian product mechanism is employed to construct and maximize a global posterior, largely reducing the aggregation errors induced by large discrepancies between local models. On the client side, a prior loss that uses the global posterior probabilistic parameters delivered from the server is designed to guide the local training. Binding such learning constraints from other clients enables our method to mitigate local forgetting. Finally, we achieve state-of-the-art results on several benchmarks, clearly demonstrating the advantages of the proposed method.

Abstract:
Modeling non-euclidean data is drawing extensive attention along with the unprecedented successes of deep neural networks in diverse fields. Particularly, a symmetric positive definite matrix is being actively studied in computer vision, signal processing, and medical image analysis, due to its ability to learn beneficial statistical representations. However, owing to its rigid constraints, it remains challenging to optimization problems and inefficient computational costs, especially, when incorporating it with a deep learning framework. In this paper, we propose a framework to exploit a diffeomorphism mapping between Riemannian manifolds and a Cholesky space, by which it becomes feasible not only to efficiently solve optimization problems but also to greatly reduce computation costs. Further, for dynamic modeling of time-series data, we devise a continuous manifold learning method by systematically integrating a manifold ordinary differential equation and a gated recurrent neural network. It is worth noting that due to the nice parameterization of matrices in a Cholesky space, training our proposed network equipped with Riemannian geometric metrics is straightforward. We demonstrate through experiments over regular and irregular time-series datasets that our proposed model can be efficiently and reliably trained and outperforms existing manifold methods and state-of-the-art methods in various time-series tasks.

Abstract:
In a wide range of dense prediction tasks, large-scale Vision Transformers have achieved state-of-the-art performance while requiring expensive computation. In contrast to most existing approaches accelerating Vision Transformers for image classification, we focus on accelerating Vision Transformers for dense prediction without any fine-tuning. We present two non-parametric operators specialized for dense prediction tasks, a token clustering layer to decrease the number of tokens for expediting and a token reconstruction layer to increase the number of tokens for recovering high-resolution. To accomplish this, the following steps are taken: i) token clustering layer is employed to cluster the neighboring tokens and yield low-resolution representations with spatial structures; ii) the following transformer layers are performed only to these clustered low-resolution tokens; and iii) reconstruction of high-resolution representations from refined low-resolution representations is accomplished using token reconstruction layer. The proposed approach shows promising results consistently on 6 dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, depth estimation, and video instance segmentation. Additionally, we validate the effectiveness of the proposed approach on the very recent state-of-the-art open-vocabulary recognition methods. Furthermore, a number of recent representative approaches are benchmarked and compared on dense prediction tasks.

Abstract:
The rapid advances of high-performance sensation empowered gigapixel-level imaging/videography for large-scale scenes, yet the abundant details in gigapixel images were rarely valued in 3d reconstruction solutions. Bridging the gap between the sensation capacity and that of reconstruction requires to attack the large-baseline challenge imposed by the large-scale scenes, while utilizing the high-resolution details provided by the gigapixel images. This paper introduces GiganticNVS for gigapixel large-scale novel view synthesis (NVS). Existing NVS methods suffer from excessively blurred artifacts and fail on the full exploitation of image resolution, due to their inefficacy of recovering a faithful underlying geometry and the dependence on dense observations to accurately interpolate radiance. Our key insight is that, a highly-expressive implicit field with view-consistency is critical for synthesizing high-fidelity details from large-baseline observations. In light of this, we propose meta-deformed manifold, where meta refers to the locally defined surface manifold whose geometry and appearance are embedded into high-dimensional latent space. Technically, meta can be decoded as neural fields using an MLP (i.e., implicit representation). Upon this novel representation, multi-view geometric correspondence can be effectively enforced with featuremetric deformation and the reflectance field can be learned purely on the surface. Experimental results verify that the proposed method outperforms state-of-the-art methods both quantitatively and qualitatively, not only on the standard datasets containing complex real-world scenes with large baseline angles, but also on the challenging gigapixel-level ultra-large-scale benchmarks.

Abstract:
We present a complete classification of all minimal problems for generic arrangements of points and lines completely observed by calibrated perspective cameras. We show that there are only 30 minimal problems in total, no problems exist for more than 6 cameras, for more than 5 points, and for more than 6 lines. We present a sequence of tests for detecting minimality starting with counting degrees of freedom and ending with full symbolic and numeric verification of representative examples. For all minimal problems discovered, we present their algebraic degrees, i.e.the number of solutions, which measure their intrinsic difficulty. It shows how exactly the difficulty of problems grows with the number of views. Importantly, several new minimal problems have small degrees that might be practical in image matching and 3D reconstruction.

Abstract:
Image segmentation is fundamental task for medical image analysis, whose accuracy is improved by the development of neural networks. However, the existing algorithms that achieve high-resolution performance require high-resolution input, resulting in substantial computational expenses and limiting their applicability in the medical field. Several studies have proposed dual-stream learning frameworks incorporating a super-resolution task as auxiliary. In this paper, we rethink these frameworks and reveal that the feature similarity between tasks is insufficient to constrain vessels or lesion segmentation in the medical field, due to their small proportion in the image. To address this issue, we propose a DS2F (Dual-Stream Shared Feature) framework, including a Shared Feature Extraction Module (SFEM). Specifically, we present Multi-Scale Cross Gate (MSCG) utilizing multi-scale features as a novel example of SFEM. Then we define a proxy task and proxy loss to enable the features focus on the targets based on the assumption that a limited set of shared features between tasks is helpful for their performance. Extensive experiments on six publicly available datasets across three different scenarios are conducted to verify the effectiveness of our framework. Furthermore, various ablation studies are conducted to demonstrate the significance of our DS2F.

Abstract:
Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions and occlusions. Consequently, existing methods show poor accuracy in dynamic scenes, and the estimated depth map is blurred at object boundaries because they are usually occluded in other training views. In this paper, we propose SC-DepthV3 for addressing the challenges. Specifically, we introduce an external pretrained monocular depth estimation model for generating single-image depth prior, namely pseudo-depth, based on which we propose novel losses to boost self-supervised training. As a result, our model can predict sharp and accurate depth maps, even when training from monocular videos of highly dynamic scenes. We demonstrate the significantly superior performance of our method over previous methods on six challenging datasets, and we provide detailed ablation studies for the proposed terms.

Abstract:
3D Skeleton-based human action recognition has attracted increasing attention in recent years. Most of the existing work focuses on supervised learning which requires a large number of labeled action sequences that are often expensive and time-consuming to annotate. In this paper, we address self-supervised 3D action representation learning for skeleton-based action recognition. We investigate self-supervised representation learning and design a novel skeleton cloud colorization technique that is capable of learning spatial and temporal skeleton representations from unlabeled skeleton sequence data. We represent a skeleton action sequence as a 3D skeleton cloud and colorize each point in the cloud according to its temporal and spatial orders in the original (unannotated) skeleton sequence. Leveraging the colorized skeleton point cloud, we design an auto-encoder framework that can learn spatial-temporal features from the artificial color labels of skeleton joints effectively. Specifically, we design a two-steam pretraining network that leverages fine-grained and coarse-grained colorization to learn multi-scale spatial-temporal features. In addition, we design a Masked Skeleton Cloud Repainting task that can pretrain the designed auto-encoder framework to learn informative representations. We evaluate our skeleton cloud colorization approach with linear classifiers trained under different configurations, including unsupervised, semi-supervised, fully-supervised, and transfer learning settings. Extensive experiments on NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and UWA3D datasets show that the proposed method outperforms existing unsupervised and semi-supervised 3D action recognition methods by large margins and achieves competitive performance in supervised 3D action recognition as well.

Abstract:
Normalized-Cut (N-Cut) is a famous model of spectral clustering. The traditional N-Cut solvers are two-stage: 1) calculating the continuous spectral embedding of normalized Laplacian matrix; 2) discretization via KK-means or spectral rotation. However, this paradigm brings two vital problems: 1) two-stage methods solve a relaxed version of the original problem, so they cannot obtain good solutions for the original N-Cut problem; 2) solving the relaxed problem requires eigenvalue decomposition, which has \mathcal O(n^3)error(n3) time complexity (nn is the number of nodes). To address the problems, we propose a novel N-Cut solver designed based on the famous coordinate descent method. Since the vanilla coordinate descent method also has \mathcal O(n^3)error(n3) time complexity, we design various accelerating strategies to reduce the time complexity to \mathcal O(|E|)error(|E|) (|E||E| is the number of edges). To avoid reliance on random initialization which brings uncertainties to clustering, we propose an efficient initialization method that gives deterministic outputs. Extensive experiments on several benchmark datasets demonstrate that the proposed solver can obtain larger objective values of N-Cut, meanwhile achieving better clustering performance compared to traditional solvers.

Abstract:
Human gaze provides valuable information on human focus and intentions, making it a crucial area of research. Recently, deep learning has revolutionized appearance-based gaze estimation. However, due to the unique features of gaze estimation research, such as the unfair comparison between 2D gaze positions and 3D gaze vectors and the different pre-processing and post-processing methods, there is a lack of a definitive guideline for developing deep learning-based gaze estimation algorithms. In this paper, we present a systematic review of the appearance-based gaze estimation methods using deep learning. First, we survey the existing gaze estimation algorithms along the typical gaze estimation pipeline: deep feature extraction, deep learning model design, personal calibration and platforms. Second, to fairly compare the performance of different approaches, we summarize the data pre-processing and post-processing methods, including face/eye detection, data rectification, 2D/3D gaze conversion and gaze origin conversion. Finally, we set up a comprehensive benchmark for deep learning-based gaze estimation. We characterize all the public datasets and provide the source code of typical gaze estimation algorithms. This paper serves not only as a reference to develop deep learning-based gaze estimation methods, but also a guideline for future gaze estimation research.

Abstract:
Video Coding for Machines (VCM) aims to compress visual signals for machine analysis. However, existing methods only consider a few machines, neglecting the majority. Moreover, the machine's perceptual characteristics are not leveraged effectively, resulting in suboptimal compression efficiency. To overcome these limitations, this paper introduces Satisfied Machine Ratio (SMR), a metric that statistically evaluates the perceptual quality of compressed images and videos for machines by aggregating satisfaction scores from them. Each score is derived from machine perceptual differences between original and compressed images. Targeting image classification and object detection tasks, we build two representative machine libraries for SMR annotation and create a large-scale SMR dataset to facilitate SMR studies. We then propose an SMR prediction model based on the correlation between deep feature differences and SMR. Furthermore, we introduce an auxiliary task to increase the prediction accuracy by predicting the SMR difference between two images in different quality. Extensive experiments demonstrate that SMR models significantly improve compression performance for machines and exhibit robust generalizability on unseen machines, codecs, datasets, and frame types.

Abstract:
Exploiting consistent structure from multiple graphs is vital for multi-view graph clustering. To achieve this goal, we propose an Efficient Balanced Multi-view Graph Clustering via Good Neighbor Fusion (EBMGC-GNF) model which comprehensively extracts credible consistent neighbor information from multiple views by designing a Cross-view Good Neighbors Voting module. Moreover, a novel balanced regularization term based on pp-power function is introduced to adjust the balance property of clusters, which helps the model adapt to data with different distributions. To solve the optimization problem of EBMGC-GNF, we transform EBMGC-GNF into an efficient form with graph coarsening method and optimize it based on accelareted coordinate descent algorithm. In experiments, extensive results demonstrate that, in the majority of scenarios, our proposals outperform state-of-the-art methods in terms of both effectiveness and efficiency.

Abstract:
In the field of healthcare, the acquisition of sample is usually restricted by multiple considerations, including cost, labor- intensive annotation, privacy concerns, and radiation hazards, therefore, synthesizing images-of-interest is an important tool to data augmentation. Diffusion models have recently attained state-of-the-art results in various synthesis tasks, and embedding energy functions has been proved that can effectively guide the pre-trained model to synthesize target samples. However, we notice that current method development and validation are still limited to improving indicators, such as FrÃ©chet Inception Distance score (FID) and Inception Score (IS), and have not provided deeper investigations on downstream tasks, like disease grading and diagnosis. Moreover, existing classifier guidance which can be regarded as a special case of energy function can only has a singular effect on altering the distribution of the synthetic dataset. This may contribute to in-distribution synthetic sample that has limited help to downstream model optimization. All these limitations remind that we still have a long way to go to achieve controllable generation. In this work, we first conducted an analysis on previous guidance as well as its contributions on further applications from the perspective of data distribution. To synthesize samples which can help downstream applications, we then introduce uncertainty guidance in each sampling step and design an uncertainty-guided diffusion models. Extensive experiments on four medical datasets, with ten classic networks trained on the augmented sample sets provided a comprehensive evaluation on the practical contributions of our methodology. Furthermore, we provide a theoretical guarantee for general gradient guidance in diffusion models, which would benefit future research on investigating other forms of measurement guidance for specific generative tasks.

Abstract:
Learning based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as kernel generator, and is optimized via learning from the MCMC simulation on random Gaussian distributions. This procedure provides an approximation for the rational blur kernel, and introduces a network-level Langevin dynamics into SISR optimization processes, which contributes to preventing bad local optimal solutions for kernel estimation. Meanwhile, a meta-learning based alternating optimization procedure is proposed to optimize the kernel generator and image restorer, respectively. In contrast to the conventional alternating minimization strategy, a meta-learning based framework is applied to learn an adaptive optimization strategy, which is less-greedy and results in better convergence performance. These two procedures are iteratively processed in a plug-and-play fashion, for the first time, realizing a learning-based but plug-and-play blind SISR solution in unsupervised inference. Extensive simulations demonstrate the superior performance and generalization ability of the proposed approach when compared with the Start-of-the-Art solutions on synthesis and real-world datasets.

Abstract:
In this paper, we study the problem of 3D object segmentation from raw point clouds. Unlike existing methods which usually require a large amount of human annotations for full supervision, we propose the first unsupervised method, called OGC, to simultaneously identify multiple 3D objects in a single forward pass, without needing any type of human annotations. The key to our approach is to fully leverage the dynamic motion patterns over sequential point clouds as supervision signals to automatically discover rigid objects. Our method consists of three major components, 1) the object segmentation network to directly estimate multi-object masks from a single point cloud frame, 2) the auxiliary self-supervised scene flow estimator, and 3) our core object geometry consistency component. By carefully designing a series of loss functions, we effectively take into account the multi-object rigid consistency and the object shape invariance in both temporal and spatial scales. This allows our method to truly discover the object geometry even in the absence of annotations. We extensively evaluate our method on five datasets, demonstrating the superior performance for object part instance segmentation and general object segmentation in both indoor and the challenging outdoor scenarios.

Abstract:
Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control.

Abstract:
Weakly supervised object localization (WSOL), adopting only image-level annotations to learn the pixel-level localization model, can release human resources in the annotation process. Most one-stage WSOL methods learn the localization model with multi-instance learning, making them only activate discriminative object parts rather than the whole object. In our work, we attribute this problem to the domain shift between the training and test process of WSOL and provide a novel perspective that views WSOL as a domain adaption (DA) task. Under this perspective, a DA-WSOL pipeline is elaborated to better assist WSOL with DA approaches by considering the specificities for the adaption of WSOL. Our DA-WSOL pipeline can discern the source-related and the Universum samples from other target samples based on a proposed target sampling strategy and then utilize them to solve the sample unbalancing and label unmatching between the source and target domain of WSOL. Experiments show that our pipeline outperforms SOTA methods on three WSOL benchmarks and can improve the performance of downstream weakly supervised semantic segmentation tasks.

Abstract:
To overcome the restriction of identical distribution assumption, invariant representation learning for unsupervised domain adaptation (UDA) has made significant advances in computer vision and pattern recognition communities. In UDA scenario, the training and test data belong to different domains while the task model is learned to be invariant. Recently, empirical connections between transferability and discriminability have received increasing attention, which is the key to understand the invariant representations. However, theoretical study of these abilities and in-depth analysis of the learned feature structures are unexplored yet. In this work, we systematically analyze the essentials of transferability and discriminability from the geometric perspective. Our theoretical results provide insights into understanding the co-regularization relation and prove the possibility of learning these abilities. From methodology aspect, the abilities are formulated as geometric properties between domain/cluster subspaces (i.e., orthogonality and equivalence) and characterized as the relation between the norms/ranks of multiple matrices. Two optimization-friendly learning principles are derived, which also ensure some intuitive explanations. Moreover, a feasible range for the co-regularization parameters is deduced to balance the learning of geometric structures. Based on the theoretical results, a geometry-oriented model is proposed for enhancing the transferability and discriminability via nuclear norm optimization. Extensive experiment results validate the effectiveness of the proposed model in empirical applications, and verify that the geometric abilities can be sufficiently learned in the derived feasible range.

Abstract:
Filters and wrappers represent two mainstream approaches to feature selection (FS). Although evolutionary wrapper-based FS outperforms filters in addressing real-world classification problems, extending these methods to high-dimensional, many-objective optimization problems with imbalanced data poses substantial challenges. Overcoming computational costs and identifying suitable performance metrics are vital for navigating search operation complexities. Here, we propose using the Jaccard similarity (JS) in a set-based evolutionary many-objective (JSEMO) FS search, addressing both evolutionary FS and imbalanced classifier choice concurrently. This study highlights the mutual influence between these aspects, impacting overall algorithm performance. JSEMO integrates JS into population initialization, reproduction, and elitism steps, enhancing diversity and avoiding duplicate solutions. The set-based variation operator utilizes intersection and union operators for compatibility with binary coding. We also introduce a double-weighted KNN (KNN2W) classifier with four supportive objectives as a many-objective FS problem to handle imbalanced distributions. Compared with 20 methods on 15 benchmark problems, JSEMO produces distinct optimal features, significantly improving overall accuracy, balance accuracy, and g-mean metrics with comparable feature set size and computational cost. The ablation study underscores the positive impact of all JSEMO components, highlighting the set-based variation operation with JS and KNN2W with relevant evaluation metrics as the most influential aspects.

Abstract:
Source-free domain adaptation is a crucial machine learning topic, as it contains numerous applications in the real world, particularly with respect to data privacy. Existing approaches predominantly focus on euclidean data, such as images and videos, while the exploration of non-euclidean graph data remains scarce. Recent graph neural network (GNN) approaches can suffer from serious performance decline due to domain shift and label scarcity in source-free adaptation scenarios. In this study, we propose a novel method named Graph Diffusion-based Alignment with Jigsaw (GALA), tailored for source-free graph domain adaptation. To achieve domain alignment, GALA employs a graph diffusion model to reconstruct source-style graphs from target data. Specifically, a score-based graph diffusion model is trained using source graphs to learn the generative source styles. Then, we introduce perturbations to target graphs via a stochastic differential equation instead of sampling from a prior, followed by the reverse process to reconstruct source-style graphs. We feed the source-style graphs into an off-the-shelf GNN and introduce class-specific thresholds with curriculum learning, which can generate accurate and unbiased pseudo-labels for target graphs. Moreover, we develop a simple yet effective graph-mixing strategy named graph jigsaw to combine confident graphs and unconfident graphs, which can enhance generalization capabilities and robustness via consistency learning. Extensive experiments on benchmark datasets validate the effectiveness of GALA.

Abstract:
Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance. However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL), a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. First, we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences. Second, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain.

Abstract:
Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0L0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance.

Abstract:
In the past decade, deep neural networks have achieved significant progress in point cloud learning. However, collecting large-scale precisely-annotated point clouds is extremely laborious and expensive, which hinders the scalability of existing point cloud datasets and poses a bottleneck for efficient exploration of point cloud data in various tasks and applications. Label-efficient learning offers a promising solution by enabling effective deep network training with much-reduced annotation efforts. This paper presents the first comprehensive survey of label-efficient learning of point clouds. We address three critical questions in this emerging research field: i) the importance and urgency of label-efficient learning in point cloud processing, ii) the subfields it encompasses, and iii) the progress achieved in this area. To this end, we propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels. We categorize four typical label-efficient learning approaches that significantly reduce point cloud annotation efforts: data augmentation, domain transfer learning, weakly-supervised learning, and pretrained foundation models. For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges. Finally, we share our views on the current research challenges and potential future directions.

Abstract:
Thin-plate spline (TPS) is a principal warp that allows for representing elastic, nonlinear transformation with control point motions. With the increase of control points, the warp becomes increasingly flexible but usually encounters a bottleneck caused by undesired issues, e.g., content distortion. In this paper, we explore generic applications of TPS in single-image-based warping tasks, such as rotation correction, rectangling, and portrait correction. To break this bottleneck, we propose the coupled thin-plate spline model (CoupledTPS), which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. Concretely, we first design an iterative search to predict new control points according to the current latent condition. Then, we present the warping flow as a bridge for the coupling of different TPS transformations, effectively eliminating interpolation errors caused by multiple warps. Besides, in light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. It is formulated through dual transformation between the searched control points of unlabeled data and its graphic augmentation, yielding an implicit correction consistency constraint. Finally, we collect massive unlabeled data to exhibit the benefit of our semi-supervised scheme in rotation correction. Extensive experiments demonstrate the superiority and universality of CoupledTPS over the existing State-of-the-Art (SoTA) solutions for rotation correction and beyond.

Abstract:
Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, i.e., zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.

Abstract:
Due to data privacy issues, accelerating networks with tiny training sets has become a critical need in practice. Previous methods achieved promising results empirically by filter-level pruning. In this paper, we both study this problem theoretically and propose an effective algorithm aligning well with our theoretical results. First, we propose the finetune convexity hypothesis to explain why recent few-shot compression algorithms do not suffer from overfitting problems. Based on it, a theory is further established to explain these methods for the first time. Compared to naively finetuning a pruned network, feature mimicking is proved to achieve a lower variance of parameters and hence enjoys easier optimization. With our theoretical conclusions, we claim dropping blocks is a fundamentally superior few-shot compression scheme in terms of more convex optimization and a higher acceleration ratio. To choose which blocks to drop, we propose a new metric, recoverability, to effectively measure the difficulty of recovering the compressed network. Finally, we propose an algorithm named Practise to accelerate networks using only tiny sets of training images. Practise outperforms previous methods by a significant margin. For 22% latency reduction, Practise surpasses previous methods by on average 7 percentage points on ImageNet-1k. It also enjoys high generalization ability, working well under data-free or out-of-domain data settings, too.

Affiliations: Pen-Tung Sah Institute of Micro-Nano Science and Technology, and National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China; Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, Xiamen, Fujian, China; School of Information Engineering, Jimei University, Xiamen, Fujian, China; Institute of Neuroinformatics, University of Zurich, Zurich, Switzerland; School of Computer Science and School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, Shanxi, China

Abstract:
Principal Component Analysis (PCA) aims to acquire the principal component space containing the essential structure of data, instead of being used for mining and extracting the essential structure of data. In other words, the principal component space contains not only information related to the essential structure of data but also some unrelated information. This frequently occurs when the intrinsic dimensionality of data is unknown or when it has complex distribution characteristics such as multi-modalities, manifolds, etc. Therefore, it is unreasonable to identify noise and useful information based solely on reconstruction error. For this reason, PCA is unsuitable as a preprocessing technique for most applications, especially in noisy environment. To solve this problem, this paper proposes robust PCA based on fuzzy local information reservation (FLIPCA). By analyzing the impact of reconstruction error on sample discriminability, FLIPCA provides a theoretical basis for noise identification and processing. This not only greatly improves its robustness but also extends its applicability and effectiveness as a data preprocessing technique. Meanwhile, FLIPCA maintains consistent mathematical descriptions with traditional PCA while having few adjustable hyperparameters and low algorithmic complexity. Finally, we conducted comprehensive experiments on synthetic and real-world datasets, which substantiated the superiority of our proposed algorithm.

Abstract:
Federated learning has emerged as a promising paradigm for privacy-preserving collaboration among different parties. Recently, with the popularity of federated learning, an influx of approaches have delivered towards different realistic challenges. In this survey, we provide a systematic overview of the important and recent developments of research on federated learning. First, we introduce the study history and terminology definition of this area. Then, we comprehensively review three basic lines of research: generalization, robustness, and fairness, by introducing their respective background concepts, task settings, and main challenges. We also offer a detailed overview of representative literature on both methods and datasets. We further benchmark the reviewed methods on several well-known datasets. Finally, we point out several open issues in this field and suggest opportunities for further research.

Abstract:
In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an \epsilonε-approximate first-order stationary point within \mathcal O(\epsilon ^-3.5)O(ε-3.5) stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, etc, and also shows great tolerance to a large range of minibatch size, e.g., from 1 k to 32 k.

Abstract:
Previous knowledge distillation (KD) methods mostly focus on compressing network architectures, which is not thorough enough in deployment as some costs like transmission bandwidth and imaging equipment are related to the image size. Therefore, we propose Pixel Distillation that extends knowledge distillation into the input level while simultaneously breaking architecture constraints. Such a scheme can achieve flexible cost control for deployment, as it allows the system to adjust both network architecture and image quality according to the overall requirement of resources. Specifically, we first propose an input spatial representation distillation (ISRD) mechanism to transfer spatial knowledge from large images to student's input module, which can facilitate stable knowledge transfer between CNN and ViT. Then, a Teacher-Assistant-Student (TAS) framework is further established to disentangle pixel distillation into the model compression stage and input compression stage, which significantly reduces the overall complexity of pixel distillation and the difficulty of distilling intermediate knowledge. Finally, we adapt pixel distillation to object detection via an aligned feature for preservation (AFP) strategy for TAS, which aligns output dimensions of detectors at each stage by manipulating features and anchors of the assistant. Comprehensive experiments on image classification and object detection demonstrate the effectiveness of our method.

Abstract:
Reconstructing a 3D shape based on a single sketch image is challenging due to the inherent sparsity and ambiguity present in sketches. Existing methods lose fine details when extracting features to predict 3D objects from sketches. Upon analyzing the 3D-to-2D projection process, we observe that the density map, characterizing the distribution of 2D point clouds, can serve as a proxy to facilitate the reconstruction process. In this work, we propose a novel sketch-based 3D reconstruction model named SketchSampler. It initiates the process by translating a sketch through an image translation network into a more informative 2D representation, which is then used to generate a density map. Subsequently, a two-stage probabilistic sampling process is employed to reconstruct a 3D point cloud: first, recovering the 2D points (i.e., the xx and yy coordinates) by sampling the density map; and second, predicting the depth (i.e., the zz coordinate) by sampling the depth values along the ray determined by each 2D point. Additionally, we convert the reconstructed point cloud into a 3D mesh for wider applications. To reduce ambiguity, we incorporate hidden lines in sketches. Experimental results demonstrate that our proposed approach significantly outperforms other baseline methods.

Abstract:
Reconstruction of a continuous surface of two-dimensional manifold from its raw, discrete point cloud observation is a long-standing problem in computer vision and graphics research. The problem is technically ill-posed, and becomes more difficult considering that various sensing imperfections would appear in the point clouds obtained by practical depth scanning. In literature, a rich set of methods has been proposed, and reviews of existing methods are also provided. However, existing reviews are short of thorough investigations on a common benchmark. The present paper aims to review and benchmark existing methods in the new era of deep learning surface reconstruction. To this end, we contribute a large-scale benchmarking dataset consisting of both synthetic and real-scanned data; the benchmark includes object- and scene-level surfaces and takes into account various sensing imperfections that are commonly encountered in practical depth scanning. We conduct thorough empirical studies by comparing existing methods on the constructed benchmark, and pay special attention on robustness of existing methods against various scanning imperfections; we also study how different methods generalize in terms of reconstructing complex surface shapes. Our studies help identity the best conditions under which different methods work, and suggest some empirical findings. For example, while deep learning methods are increasingly popular in the research community, our systematic studies suggest that, surprisingly, a few classical methods perform even better in terms of both robustness and generalization; our studies also suggest that the practical challenges of misalignment of point sets from multi-view scanning, missing of surface points, and point outliers remain unsolved by all the existing surface reconstruction methods. We expect that the benchmark and our studies would be valuable both for practitioners and as a guidance for new innovations in future research.

Abstract:
The computational complexity of video models increases linearly with the square number of frames. Thus, constrained bycomputational resources, training video models to learn long-term temporal semantics end-to-end is quite a challenge. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow and failure of modeling long-term semantics. In this paper, we design the Markov Progressive framework (MaPro), a theoretical framework consisting of the progressive modeling method and a paradigm model tailored for it. Thecore idea of MaPro is to find a paradigm model consisting of proposed Markov operators which can be trained in multiple sequential steps and ensure that the multi-step progressive modeling is equivalent to the conventional end-to-endmodeling. By training the paradigm model under the progressive method, we are able to model long videos end-to-endwith limited resources and ensure the effective transmission of long-term temporal information. We provide implementations of this theoretical system on the mainstream CNN- and Transformer-based models, where they are modified to conform to the Markov paradigm. As a general and robust training method, we experimentally demonstrate that it yields significant performance improvements on different backbones and datasets. As an illustrative example, the proposed method improves the SlowOnly network by 4.1 mAP on Charades and 2.5 top-1 accuracy on Kinetics. And for TimeSformer, MaPro improves its performance on Kinetics by 2.0 top-1 accuracy. Importantly, all these improvements areachieved with a little parameter and computation overhead.

Abstract:
Learning based single image super-resolution (SISR) for real-world images has been an active research topic yet a challenging task, due to the lack of paired low-resolution (LR) and high-resolution (HR) training images. Most of the existing unsupervised real-world SISR methods adopt a two-stage training strategy by synthesizing realistic LR images from their HR counterparts first, then training the super-resolution (SR) models in a supervised manner. However, the training of image degradation and SR models in this strategy are separate, ignoring the inherent mutual dependency between downscaling and its inverse upscaling process. Additionally, the ill-posed nature of image degradation is not fully considered. In this paper, we propose an image downscaling and SR model dubbed as SDFlow, which simultaneously learns a bidirectional many-to-many mapping between real-world LR and HR images unsupervisedly. The main idea of SDFlow is to decouple image content and degradation information in the latent space, where content information distribution of LR and HR images is matched in a common latent space. Degradation information of the LR images and the high-frequency information of the HR images are fitted to an easy-to-sample conditional distribution. Experimental results on real-world image SR datasets indicate that SDFlow can generate diverse realistic LR and SR images both quantitatively and qualitatively.

Abstract:
We propose a novel method called SHS-Net for point cloud normal estimation by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our algorithm outperforms the state-of-the-art methods in both unoriented and oriented normal estimation.

Abstract:
Modern image editing software enables anyone to alter the content of an image to deceive the public, which can pose a security hazard to personal privacy and public safety. The detection and localization of image tampering is becoming an urgent issue to be addressed. We have revealed that the tampered region exhibits homogenous differences (the changes in metadata organization form and organization structure of the image) from the real region after manipulations such as splicing, copy-move, and removal. Therefore, we propose a novel end-to-end network named HDF-Net to extract these homogeny difference features for precise localization of tampering artifacts. The HDF-Net is composed of RGB and SRM dual-stream networks, including three complementary modules, namely the suspicious tampering-artifact prominent (STP) module, the fine tampering-artifact salient (FTS) module, and the tampering-artifact edge refined (TER) module. We utilize the fully attentional block (FLA) to enhance the characterization ability of homogeny difference features extracted by each module and preserve the specifics of tampering artifacts. These modules are gradually merged according to the strategy of “coarse-fine-finer”, which significantly improves the localization accuracy and edge refinement. Extensive experiments demonstrate that HDF-Net performs better than state-of-the-art tampering localization models on five benchmarks, achieving satisfactory generalization and robustness.

Abstract:
Data distribution gaps often pose significant challenges to the use of deep segmentation models. However, retraining models for each distribution is expensive and time-consuming. In clinical contexts, device-embedded algorithms and networks, typically unretrainable and unaccessable post-manufacture, exacerbate this issue. Generative translation methods offer a solution to mitigate the gap by transferring data across domains. However, existing methods mainly focus on intensity distributions while ignoring the gaps due to structure disparities. In this paper, we formulate a new image-to-image translation task to reduce structural gaps. We propose a simple, yet powerful Structure-Unbiased Adversarial (SUA) network which accounts for both intensity and structural differences between the training and test sets for segmentation. It consists of a spatial transformation block followed by an intensity distribution rendering module. The spatial transformation block is proposed to reduce the structural gaps between the two images. The intensity distribution rendering module then renders the deformed structure to an image with the target intensity distribution. Experimental results show that the proposed SUA method has the capability to transfer both intensity distribution and structural content between multiple pairs of datasets and is superior to prior arts in closing the gaps for improving segmentation.

Abstract:
Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs’ modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.

Abstract:
Causal partitioning is an effective approach for causal discovery based on the divide-and-conquer strategy. Up to now, various heuristic methods based on conditional independence (CI) tests have been proposed for causal partitioning. However, most of these methods fail to achieve satisfactory partitioning without violating dd-separation, leading to poor inference performance. In this work, we transform causal partitioning into an alternative problem that can be more easily solved. Concretely, we first construct a superstructure GG of the true causal graph G_\mathcal TGT by performing a set of low-order CI tests on the observed data DD. Then, we leverage point-line duality to obtain a graph G_\mathcal AGA adjoint to GG. We show that the solution of minimizing edge-cut ratio on G_\mathcal AGA can lead to a valid causal partitioning with smaller causal-cut ratio on GG and without violating dd-separation. We design an efficient algorithm to solve this problem. Extensive experiments show that the proposed method can achieve significantly better causal partitioning without violating dd-separation than the existing methods.

Abstract:
Dynamic graphs arise in various real-world applications, and it is often welcomed to model the dynamics in continuous time domain for its flexibility. This paper aims to design an easy-to-use pipeline (EasyDGL which is also due to its implementation by DGL toolkit) composed of three modules with both strong fitting ability and interpretability, namely encoding, training and interpreting: i) a temporal point process (TPP) modulated attention architecture to endow the continuous-time resolution with the coupled spatiotemporal dynamics of the graph with edge-addition events; ii) a principled loss composed of task-agnostic TPP posterior maximization based on observed events, and a task-aware loss with a masking strategy over dynamic graph, where the tasks include dynamic link prediction, dynamic node classification and node traffic forecasting; iii) interpretation of the outputs (e.g., representations and predictions) with scalable perturbation-based quantitative analysis in the graph Fourier domain, which could comprehensively reflect the behavior of the learned model. Empirical results on public benchmarks show our superior performance for time-conditioned predictive tasks, and in particular EasyDGL can effectively quantify the predictive power of frequency content that a model learns from evolving graph data.

Abstract:
Spectral clustering has been attracting increasing attention due to its well-defined framework and excellent performance. However, most traditional spectral clustering methods consist of two separate steps: 1) Solving a relaxed optimization problem to learn the continuous clustering labels, and 2) Rounding the continuous clustering labels into discrete ones. The clustering results of the relax-and-discretize strategy inevitably result in information loss and unsatisfactory clustering performance. Moreover, the similarity matrix constructed from original data may not be optimal for clustering since data usually have noise and redundancy. To address these problems, we propose a novel and effective algorithm to directly optimize the original spectral clustering model, called Direct Spectral Clustering (DSC). We theoretically prove that the original spectral clustering model can be solved by simultaneously learning a weighted discrete indicator matrix and a structured similarity matrix whose connected components are equal to the number of clusters. Both of them can be used to directly obtain the final clustering results without any post-processing. Further, an effective iterative optimization algorithm is exploited to solve the proposed method. Extensive experiments performed on synthetic and real-world datasets demonstrate the superiority and effectiveness of the proposed method compared to the state-of-the-art algorithms.

Abstract:
Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.

Abstract:
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS.

Abstract:
Reinforcement Learning (RL) has achieved tremendous success in many complex decision-making tasks. However, safety concerns are raised during deploying RL in real-world applications, leading to a growing demand for safe RL algorithms, such as in autonomous driving and robotics scenarios. While safe control has a long history, the study of safe RL algorithms is still in the early stages. To establish a good foundation for future safe RL research, in this paper, we provide a review of safe RL from the perspectives of methods, theories, and applications. First, we review the progress of safe RL from five dimensions and come up with five crucial problems for safe RL being deployed in real-world applications, coined as “2H3W”. Second, we analyze the algorithm and theory progress from the perspectives of answering the “2H3W” problems. Particularly, the sample complexity of safe RL algorithms is reviewed and discussed, followed by an introduction to the applications and benchmarks of safe RL algorithms. Finally, we open the discussion of the challenging problems in safe RL, hoping to inspire future research on this thread. To advance the study of safe RL algorithms, we release an open-sourced repository containing major safe RL algorithms at the link.

Abstract:
While Graph Neural Networks (GNNs) have achieved enormous success in multiple graph analytical tasks, modern variants mostly rely on the strong inductive bias of homophily. However, real-world networks typically exhibit both homophilic and heterophilic linking patterns, wherein adjacent nodes may share dissimilar attributes and distinct labels. Therefore, GNNs smoothing node proximity holistically may aggregate both task-relevant and irrelevant (even harmful) information, limiting their ability to generalize to heterophilic graphs and potentially causing non-robustness. In this work, we propose a novel Edge Splitting GNN (ES-GNN) framework to adaptively distinguish between graph edges either relevant or irrelevant to learning tasks. This essentially transfers the original graph into two subgraphs with the same node set but complementary edge sets dynamically. Given that, information propagation separately on these subgraphs and edge splitting are alternatively conducted, thus disentangling the task-relevant and irrelevant features. Theoretically, we show that our ES-GNN can be regarded as a solution to a disentangled graph denoising problem, which further illustrates our motivations and interprets the improved generalization beyond homophily. Extensive experiments over 11 benchmark and 1 synthetic datasets not only demonstrate the effective performance of ES-GNN but also highlight its robustness to adversarial graphs and mitigation of the over-smoothing problem.

Abstract:
This paper proposes an end-to-end deep learning approach for removing defocus blur from a single defocused image. Defocus blur is a common issue in digital photography that poses a challenge due to its spatially-varying and large blurring effect. The proposed approach addresses this challenge by employing a pixel-wise Gaussian kernel mixture (GKM) model to accurately yet compactly parameterize spatially-varying defocus point spread functions (PSFs), which is motivated by the isotropy in defocus PSFs. We further propose a grouped GKM (GGKM) model that decouples the coefficients in GKM, so as to improve the modeling accuracy with an economic manner. Afterward, a deep neural network called GGKMNet is then developed by unrolling a fixed-point iteration process of GGKM-based image deblurring, which avoids the efficiency issues in existing unrolling DNNs. Using a lightweight scale-recurrent architecture with a coarse-to-fine estimation scheme to predict the coefficients in GGKM, the GGKMNet can efficiently recover an all-in-focus image from a defocused one. Such advantages are demonstrated with extensive experiments on five benchmark datasets, where the GGKMNet outperforms existing defocus deblurring methods in restoration quality, as well as showing advantages in terms of model complexity and computational efficiency.

Abstract:
Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.

Abstract:
Feature attribution explains Artificial Intelligence (AI) at the instance level by providing importance scores of input features’ contributions to model prediction. Integrated Gradients (IG) is a prominent path attribution method for deep neural networks, involving the integration of gradients along a path from the explained input (explicand) to a counterfactual instance (baseline). Current IG variants primarily focus on the gradient of explicand's output. However, our research indicates that the gradient of the counterfactual output significantly affects feature attribution as well. To achieve this, we propose Iterative Gradient path Integrated Gradients (IG2), considering both gradients. IG2 incorporates the counterfactual gradient iteratively into the integration path, generating a novel path (GradPath) and a novel baseline (GradCF). These two novel IG components effectively address the issues of attribution noise and arbitrary baseline choice in earlier IG methods. IG2, as a path method, satisfies many desirable axioms, which are theoretically justified in the paper. Experimental results on XAI benchmark, ImageNet, MNIST, TREC questions answering, wafer-map failure patterns, and CelebA face attributes validate that IG2 delivers superior feature attributions compared to the state-of-the-art techniques.

Abstract:
Curriculum reinforcement learning (CRL) allows solving complex tasks by generating a tailored sequence of learning tasks, starting from easy ones and subsequently increasing their difficulty. Although the potential of curricula in RL has been clearly shown in various works, it is less clear how to generate them for a given learning environment, resulting in various methods aiming to automate this task. In this work, we focus on framing curricula as interpolations between task distributions, which has previously been shown to be a viable approach to CRL. Identifying key issues of existing methods, we frame the generation of a curriculum as a constrained optimal transport problem between task distributions. Benchmarks show that this way of curriculum generation can improve upon existing CRL methods, yielding high performance in various tasks with different characteristics.

Abstract:
In the open world, various label sets and domain configurations give rise to a variety of Domain Adaptation (DA) setups, including closed-set, partial-set, open-set, and universal DA, as well as multi-source and multi-target DA. It is notable that existing DA methods are generally designed only for a specific setup, and may under-perform in setups they are not tailored to. This paper shifts the common paradigm of DA to Versatile Domain Adaptation (VDA), where one method can handle several different DA setups without any modification. Towards this goal, we first delve into a general inductive bias: class confusion, and then uncover that reducing such pairwise class confusion leads to significant transfer gains. With this insight, we propose one general class confusion loss (CC-Loss) to learn many setups. We estimate class confusion based only on classifier predictions and minimize the class confusion to enable accurate target predictions. Further, we improve the loss by enforcing the consistency of confusion matrices under different data augmentations to encourage its invariance to distribution perturbations. Experiments on 2D vision and 3D vision benchmarks show that the CC-Loss performs competitively in different mainstream DA setups.

Abstract:
Explainable AI aims to overcome the black-box nature of complex ML models like neural networks by generating explanations for their predictions. Explanations often take the form of a heatmap identifying input features (e.g. pixels) that are relevant to the model's decision. These explanations, however, entangle the potentially multiple factors that enter into the overall complex decision strategy. We propose to disentangle explanations by extracting at some intermediate layer of a neural network, subspaces that capture the multiple and distinct activation patterns (e.g. visual concepts) that are relevant to the prediction. To automatically extract these subspaces, we propose two new analyses, extending principles found in PCA or ICA to explanations. These novel analyses, which we call principal relevant component analysis (PRCA) and disentangled relevant subspace analysis (DRSA), maximize relevance instead of e.g. variance or kurtosis. This allows for a much stronger focus of the analysis on what the ML model actually uses for predicting, ignoring activations or concepts to which the model is invariant. Our approach is general enough to work alongside common attribution techniques such as Shapley Value, Integrated Gradients, or LRP. Our proposed methods show to be practically useful and compare favorably to the state of the art as demonstrated on benchmarks and three use cases.

Abstract:
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated “detect-then-describe” pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features. Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance. We also insert additional spatial information to the caption head for more accurate descriptions. Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional “detect-then-describe” methods by a large margin.

Abstract:
Audio-visual video recognition (AVVR) integrates audio and visual cues to accurately categorize videos. While current methods using provided datasets achieve satisfactory results, they face challenges in retaining historical class knowledge when new classes appear in real-world situations. There are no dedicated methods to address this issue, prompting this paper to explore Class Incremental Audio-Visual Video Recognition (CIAVVR). CIAVVR aims to preserve historical knowledge contained in stored data and learned models to prevent catastrophic forgetting. Audio-visual data and models inherently have hierarchical structures, where the model contains both low-level and high-level semantic information, and data includes snippet-level, video-level, and distribution-level spatial information. It is crucial to fully exploit these hierarchical structures for data knowledge preservation and model knowledge preservation. However, existing image class incremental learning methods do not explicitly consider these hierarchical structures. Therefore, we introduce Hierarchical Augmentation and Distillation (HAD), which includes the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM). These modules efficiently utilize the hierarchical structure of data and models. Specifically, HAM uses a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Simultaneously, HDM employs newly designed hierarchical logical distillation (video-distribution) and hierarchical correlative distillation (snippet-video) to maintain intra-sample and inter-sample hierarchical knowledge. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) show that HAD effectively captures hierarchical information, enhancing the preservation of historical class knowledge and performance. We also provide a theoretical analysis to support the segmental feature augmentation strategy.

Abstract:
Learning from crowds describes that the annotations of training data are obtained with crowd-sourcing services. Multiple annotators each complete their own small part of the annotations, where labeling mistakes that depend on annotators occur frequently. Modeling the label-noise generation process by the noise transition matrix is a powerful tool to tackle the label noise. In real-world crowd-sourcing scenarios, noise transition matrices are both annotator- and instance-dependent. However, due to the high complexity of annotator- and instance-dependent transition matrices (AIDTM), annotation sparsity, which means each annotator only labels a tiny part of instances, makes modeling AIDTM very challenging. Without prior knowledge, existing works simplify the problem by assuming the transition matrix is instance-independent or using simple parametric ways, which lose modeling generality. Motivated by this, we target a more realistic problem, estimating general AIDTM in practice. Without losing modeling generality, we parameterize AIDTM with deep neural networks. To alleviate the modeling challenge, we suppose every annotator shares its noise pattern with similar annotators, and estimate AIDTM via knowledge transfer. We hence first model the mixture of noise patterns by all annotators, and then transfer this modeling to individual annotators. Furthermore, considering that the transfer from the mixture of noise patterns to individuals may cause two annotators with highly different noise generations to perturb each other, we employ the knowledge transfer between identified neighboring annotators to calibrate the modeling. Theoretical analyses are derived to demonstrate that both the knowledge transfer from global to individuals and the knowledge transfer between neighboring individuals can effectively help mitigate the challenge of modeling general AIDTM. Experiments confirm the superiority of the proposed approach on synthetic and real-world crowd-sourcing data.

Abstract:
Deep learning models dealing with image understanding in real-world settings must be able to adapt to a wide variety of tasks across different domains. Domain adaptation and class incremental learning deal with domain and task variability separately, whereas their unified solution is still an open problem. We tackle both facets of the problem together, taking into account the semantic shift within both input and label spaces. We start by formally introducing continual learning under task and domain shift. Then, we address the proposed setup by using style transfer techniques to extend knowledge across domains when learning incremental tasks and a robust distillation framework to effectively recollect task knowledge under incremental domain shift. The devised framework (LwS, Learning with Style) is able to generalize incrementally acquired task knowledge across all the domains encountered, proving to be robust against catastrophic forgetting. Extensive experimental evaluation on multiple autonomous driving datasets shows how the proposed method outperforms existing approaches, which prove to be ill-equipped to deal with continual semantic segmentation under both task and domain shift.

Abstract:
Image fusion plays a key role in a variety of multi-sensor-based vision systems, especially for enhancing visual quality and/or extracting aggregated features for perception. However, most existing methods just consider image fusion as an individual task, thus ignoring its underlying relationship with these downstream vision problems. Furthermore, designing proper fusion architectures often requires huge engineering labor. It also lacks mechanisms to improve the flexibility and generalization ability of current fusion approaches. To mitigate these issues, we establish a Task-guided, Implicit-searched and Meta-initialized (TIM) deep model to address the image fusion problem in a challenging real-world scenario. Specifically, we first propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion. Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency. In addition, a pretext meta initialization technique is introduced to leverage divergence fusion data to support fast adaptation for different kinds of image fusion tasks. Qualitative and quantitative experimental results on different categories of image fusion problems and related downstream tasks (e.g., visual enhancement and semantic understanding) substantiate the flexibility and effectiveness of our TIM.

Abstract:
Despite providing high-performance solutions for computer vision tasks, the deep neural network (DNN) model has been proved to be extremely vulnerable to adversarial attacks. Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. Besides, commonly used adaptive learning and fine-tuning technique is unsuitable for adversarial defense since it is essentially a zero-shot problem when deployed. Thus, to tackle this challenge, we propose an attack-agnostic defense method named Meta Invariance Defense (MID). Specifically, various combinations of adversarial attacks are randomly sampled from a manually constructed Attacker Pool to constitute different defense tasks against unknown attacks, in which a student encoder is supervised by multi-consistency distillation to learn the attack-invariant features via a meta principle. The proposed MID has two merits: 1) Full distillation from pixel-, feature- and prediction-level between benign and adversarial samples facilitates the discovery of attack-invariance. 2) The model simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration. Theoretical and empirical studies on numerous benchmarks such as ImageNet verify the generalizable robustness and superiority of MID under various attacks.

Abstract:
In the literature on deep neural networks, there is considerable interest in developing activation functions that can enhance neural network performance. In recent years, there has been renewed scientific interest in proposing activation functions that can be trained throughout the learning process, as they appear to improve network performance, especially by reducing overfitting. In this paper, we propose a trainable activation function whose parameters need to be estimated. A fully Bayesian model is developed to automatically estimate from the learning data both the model weights and activation function parameters. An MCMC-based optimization scheme is developed to build the inference. The proposed method aims to solve the aforementioned problems and improve convergence time by using an efficient sampling scheme that guarantees convergence to the global maximum. The proposed scheme has been tested across a diverse datasets, encompassing both classification and regression tasks, and implemented in various CNN architectures to demonstrate its versatility and effectiveness. Promising results demonstrate the usefulness of our proposed approach in improving models accuracy due to the proposed activation function and Bayesian estimation of the parameters.

Affiliations: Huzhou Institute of Zhejiang University, Huzhou, China; Alibaba Group, China; School of Computer Science and Technology, Zhejiang Normal University, Jinhua, Zhejiang, China; College of Control Science and Engineering, Zhejiang University, Hangzhou, China; Faculty of Information Technology, Monash University, Clayton, VIC, Australia; Ant Group, Hangzhou, Zhejiang, China; INTR & DSA Thrust, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; School of Computing and Information Systems, Singapore Management University, Singapore; School of Computing, University of Connecticut, Storrs, CT, USA; School of Information and Communication Technology, Griffith University, Southport, Qld, Australia

Abstract:
Self-supervised learning (SSL) has recently achieved impressive performance on various time series tasks. The most prominent advantage of SSL is that it reduces the dependence on labeled data. Based on the pre-training and fine-tuning strategy, even a small amount of labeled data can achieve high performance. Compared with many published self-supervised surveys on computer vision and natural language processing, a comprehensive survey for time series SSL is still missing. To fill this gap, we review current state-of-the-art SSL methods for time series data in this article. To this end, we first comprehensively review existing surveys related to SSL and time series, and then provide a new taxonomy of existing time series SSL methods by summarizing them from three perspectives: generative-based, contrastive-based, and adversarial-based. These methods are further divided into ten subcategories with detailed reviews and discussions about their key intuitions, main frameworks, advantages and disadvantages. To facilitate the experiments and validation of time series SSL methods, we also summarize datasets commonly used in time series forecasting, classification, anomaly detection, and clustering tasks. Finally, we present the future directions of SSL for time series analysis.

Abstract:
Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by kk-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods.

Abstract:
Existing multiple kernel clustering (MKC) algorithms have two ubiquitous problems. From the theoretical perspective, most MKC algorithms lack sufficient theoretical analysis, especially the consistency of learned parameters, such as the kernel weights. From the practical perspective, the high complexity makes MKC unable to handle large-scale datasets. This paper tries to address the above two issues. We first make a consistency analysis of an influential MKC method named Simple Multiple Kernel kk-Means (SimpleMKKM). Specifically, suppose that \hat\boldsymbol\gamma _nγ^n are the kernel weights learned by SimpleMKKM from the training samples. We also define the expected version of SimpleMKKM and denote its solution as \boldsymbol\gamma ^γ. We establish an upper bound of \Vert \hat\boldsymbol\gamma _n-\boldsymbol\gamma ^\Vert _\infty∥γ^n-γ∥∞ in the order of \widetilde\mathcal O(1/\sqrtn)O˜(1/n), where nn is the sample number. Based on this result, we also derive its excess clustering risk calculated by a standard clustering loss function. For the large-scale extension, we replace the eigen decomposition of SimpleMKKM with singular value decomposition (SVD). Consequently, the complexity can be decreased to \mathcal O(n)O(n) such that SimpleMKKM can be implemented on large-scale datasets. We then deduce several theoretical results to verify the approximation ability of the proposed SVD-based method. The results of comprehensive experiments demonstrate the superiority of the proposed method.

Abstract:
Estimating the rigid transformation with 6 degrees of freedom based on a putative 3D correspondence set is a crucial procedure in point cloud registration. Existing correspondence identification methods usually lead to large outlier ratios (>95% is common), underscoring the significance of robust registration methods. Many researchers turn to parameter search-based strategies (e.g., Branch-and-Bround) for robust registration. Although related methods show high robustness, their efficiency is limited to the high-dimensional search space. This paper proposes a heuristics-guided parameter search strategy to accelerate the search while maintaining high robustness. We first sample some correspondences (i.e., heuristics) and then just need to sequentially search the feasible regions that make each sample an inlier. Our strategy largely reduces the search space and can guarantee accuracy with only a few inlier samples, therefore enjoying an excellent trade-off between efficiency and robustness. Since directly parameterizing the 6-dimensional nonlinear feasible region for efficient search is intractable, we construct a three-stage decomposition pipeline to reparameterize the feasible region, resulting in three lower-dimensional sub-problems that are easily solvable via our strategy. Besides reducing the searching dimension, our decomposition enables the leverage of 1-dimensional interval stabbing at all three stages for searching acceleration. Moreover, we propose a valid sampling strategy to guarantee our sampling effectiveness, and a compatibility verification setup to further accelerate our search. Extensive experiments on both simulated and real-world datasets demonstrate that our approach exhibits comparable robustness with state-of-the-art methods while achieving a significant efficiency boost.

Abstract:
The intellectual property of deep networks can be easily “stolen” by surrogate model attack. There has been significant progress in protecting the model IP in classification tasks. However, little attention has been devoted to the protection of image processing models. By utilizing consistent invisible spatial watermarks, the work (Zhang et al. 2020) first considered model watermarking for deep image processing networks and demonstrated its efficacy in many downstream tasks. Its success depends on the hypothesis that if a consistent watermark exists in all prediction outputs, that watermark will be learned into the attacker's surrogate model. However, when the attacker uses common data augmentation attacks (e.g., rotate, crop, and resize) during surrogate model training, it will fail because the underlying watermark consistency is destroyed. To mitigate this issue, we propose a new watermarking methodology, “structure consistency”, based on which a new deep structure-aligned model watermarking algorithm is designed. Specifically, the embedded watermarks are designed to be aligned with physically consistent image structures, such as edges or semantic regions. Experiments demonstrate that our method is more robust than the baseline in resisting data augmentation attacks. Besides that, we test the generalization ability and robustness of our method to a broader range of adaptive attacks.

Abstract:
We propose the gradient-weighted Object Detector Activation Maps (ODAM), a visual explanation technique for interpreting the predictions of object detectors. Utilizing the gradients of detector targets flowing into the intermediate feature maps, ODAM produces heat maps that show the influence of regions on the detector's decision for each predicted attribute. Compared to previous works on classification activation maps (CAM), ODAM generates instance-specific explanations rather than class-specific ones. We show that ODAM is applicable to one-stage, two-stage, and transformer-based detectors with different types of detector backbones and heads, and produces higher-quality visual explanations than the state-of-the-art in terms of both effectiveness and efficiency. We discuss two explanation tasks for object detection: 1) object specification: what is the important region for the prediction? 2) object discrimination: which object is detected? Aiming at these two aspects, we present a detailed analysis of the visual explanations of detectors and carry out extensive experiments to validate the effectiveness of the proposed ODAM. Furthermore, we investigate user trust on the explanation maps, how well the visual explanations of object detectors agrees with human explanations, as measured through human eye gaze, and whether this agreement is related with user trust. Finally, we also propose two applications, ODAM-KD and ODAM-NMS, based on these two abilities of ODAM. ODAM-KD utilizes the object specification of ODAM to generate top-down attention for key predictions and instruct the knowledge distillation of object detection. ODAM-NMS considers the location of the model's explanation for each prediction to distinguish the duplicate detected objects. A training scheme, ODAM-Train, is proposed to improve the quality on object discrimination, and help with ODAM-NMS.

Abstract:
We propose a real-time convolutional neural network (CNN) training and compression method for delivering high-quality live video even in a poor network environment. The server delivers a low-resolution video segment along with the corresponding CNN for super resolution (SR), after which the client applies the CNN to the segment in order to recover high-resolution video frames. To generate a trained CNN corresponding to a video segment in real-time, our method rapidly increases the training accuracy by promoting the overfitting property of the CNN while also using curriculum-based training. In addition, assuming that the pretrained CNN is already downloaded on the client side, we transfer only residual values between the updated and pretrained CNN parameters. These values can be quantized with low bits in real time while minimizing the amount of loss, as the distribution range is significantly narrower than that of the updated CNN. Quantitatively, our neural-enhanced adaptive live streaming pipeline (NEALS) achieves higher SR accuracy and a lower CNN compression loss rate within a constrained training time compared to the state-of-the-art CNN training and compression method. NEALS achieves 15 to 48% higher quality of the user experience compared to state-of-the-art neural-enhanced live streaming systems.

Abstract:
This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.

Abstract:
HD map reconstruction is crucial for autonomous driving. LiDAR-based methods are limited due to expensive sensors and time-consuming computation. Camera-based methods usually need to perform road segmentation and view transformation separately, which often causes distortion and missing content. To push the limits of the technology, we present a novel framework that reconstructs a local map formed by road layout and vehicle occupancy in the bird’s-eye view given a front-view monocular image only. We propose a front-to-top view projection (FTVP) module, which takes the constraint of cycle consistency between views into account and makes full use of their correlation to strengthen the view transformation and scene understanding. In addition, we apply multi-scale FTVP modules to propagate the rich spatial information of low-level features to mitigate spatial deviation of the predicted object location. Experiments on public benchmarks show that our method achieves various tasks on road layout estimation, vehicle occupancy estimation, and multi-class semantic estimation, at a performance level comparable to the state-of-the-arts, while maintaining superior efficiency.

Abstract:
We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context. The difficulty of reliably capturing geometric context in existing methods impedes their ability to accurately enforce the consistency between the different geometric properties, thereby leading to a bottleneck of geometric estimation quality. We therefore propose the Adaptive Surface Normal (ASN) constraint, a simple yet efficient method. Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints. By dynamically determining reliable local geometry from randomly sampled candidates, we establish a surface normal constraint, where the validity of these candidates is evaluated using the geometric context. Furthermore, our normal estimation leverages the geometric context to prioritize regions that exhibit significant geometric variations, which makes the predicted normals accurately capture intricate and detailed geometric information. Through the integration of geometric context, our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images. We validate the superiority of our approach over state-of-the-art methods through extensive evaluations and comparisons on diverse indoor and outdoor datasets, showcasing its efficiency and robustness.

Abstract:
Creating novel views from a single image has achieved tremendous strides with advanced autoregressive models, as unseen regions have to be inferred from the visible scene contents. Although recent methods generate high-quality novel views, synthesizing with only one explicit or implicit 3D geometry has a trade-off between two objectives that we call the “seesaw” problem: 1) preserving reprojected contents and 2) completing realistic out-of-view regions. Also, autoregressive models require a considerable computational cost. In this paper, we propose a single-image view synthesis framework for mitigating the seesaw problem while utilizing an efficient non-autoregressive model. Motivated by the characteristics that explicit methods well preserve reprojected pixels and implicit methods complete realistic out-of-view regions, we introduce a loss function to complement two renderers. Our loss function promotes that explicit features improve the reprojected area of implicit features and implicit features improve the out-of-view area of explicit features. With the proposed architecture and loss function, we can alleviate the seesaw problem, outperforming autoregressive-based state-of-the-art methods and generating an image \approx≈100 times faster. We validate the efficiency and effectiveness of our method with experiments on RealEstate10 K and ACID datasets.

Abstract:
To cope with real-world dynamics, an intelligent system needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance drop of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of promising directions, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.

Abstract:
Source-free domain adaptation (SFDA) shows the potential to improve the generalizability of deep learning-based face anti-spoofing (FAS) while preserving the privacy and security of sensitive human faces. However, existing SFDA methods are significantly degraded without accessing source data due to the inability to mitigate domain and identity bias in FAS. In this paper, we propose a novel Source-free Domain Adaptation framework for FAS (SDA-FAS) that systematically addresses the challenges of source model pre-training, source knowledge adaptation, and target data exploration under the source-free setting. Specifically, we develop a generalized method for source model pre-training that leverages a causality-inspired PatchMix data augmentation to diminish domain bias and designs the patch-wise contrastive loss to alleviate identity bias. For source knowledge adaptation, we propose a contrastive domain alignment module to align conditional distribution across domains with a theoretical equivalence to adaptation based on source data. Furthermore, target data exploration is achieved via self-supervised learning with patch shuffle augmentation to identify unseen attack types, which is ignored in existing SFDA methods. To our best knowledge, this paper provides the first full-stack privacy-preserving framework to address the generalization problem in FAS. Extensive experiments on nineteen cross-dataset scenarios show our framework considerably outperforms state-of-the-art methods.

Abstract:
Estimating and synthesizing the hand's manipulation of objects is central to understanding human behaviour. To accurately model the interaction between the hand and object (referred to as the “hand-object”), we must not only focus on the pose of the hand and object, but also consider the contact between them. This contact provides valuable information for generating semantically and physically plausible grasps. In this paper, we propose an explicit contact representation called Contact Potential Field (CPF). In CPF, we model the contact between a pair of hand-object vertices as a spring-mass system. This system encodes the distance of the pair, as well as a likelihood of that contact being stable. Therefore, the system of multiple extended and compressed springs forms an elastic potential field with minimal energy at the optimal grasp position. We apply CPF to two relevant tasks, namely, hand-object pose estimation and grasping pose generation. Extensive experiments on the two challenging tasks and three commonly used datasets have demonstrated that our method can achieve state-of-the-art in several reconstruction metrics, allowing us to produce more physically plausible hand-object poses even when the ground-truth exhibits severe interpenetration or disjointedness.

Abstract:
Neural radiance fields (NeRF) achieve highly photo-realistic novel-view synthesis, but it's a challenging problem to edit the scenes modeled by NeRF-based methods, especially for dynamic scenes. We propose editable neural radiance fields that enable end-users to easily edit dynamic scenes and support topological changes. Input with an image sequence from a single camera, our network is trained automatically and models topologically varying dynamics using our picked-out surface key points. Then end-users can edit the scene by easily dragging the key points to desired new positions. To achieve this, we propose a scene analysis method to detect and initialize key points by considering the dynamics in the scene, and a weighted key points strategy to model topologically varying dynamics by joint key points and weights optimization. Our method supports intuitive multi-dimensional (up to 3D) editing and can generate novel scenes that are unseen in the input sequence. Experiments demonstrate that our method achieves high-quality editing on various dynamic scenes and outperforms the state-of-the-art.

Abstract:
To cost-effectively transmit high-quality dynamic 3D human images in immersive multimedia applications, efficient data compression is crucial. Unlike existing methods that focus on reducing signal-level reconstruction errors, we propose the first dynamic 3D human compression framework based on human priors. The layered coding architecture significantly enhances the perceptual quality while also supporting a variety of downstream tasks, including visual analysis and content editing. Specifically, a high-fidelity pose-driven Avatar is generated from the original frames as the basic structure layer to implicitly represent the human shape. Then, human movements between frames are parameterized via a commonly-used human prior model, i.e., the Skinned Multi-Person Linear Model (SMPL), to form the motion layer and drive the Avatar. Furthermore, the normals are also introduced as an enhancement layer to preserve fine-grained geometric details. Finally, the Avatar, SMPL parameters, and normal maps are efficiently compressed into layered semantic bitstreams. Extensive qualitative and quantitative experiments show that the proposed framework remarkably outperforms other state-of-the-art 3D codecs in terms of subjective quality with only a few bits. More notably, as the size or frame number of the 3D human sequence increases, the superiority of our framework in perceptual quality becomes more significant while saving more bitrates.

Abstract:
Image reconstruction from incomplete measurements is one basic task in imaging. While supervised deep learning has emerged as a powerful tool for image reconstruction in recent years, its applicability is limited by its prerequisite on a large number of latent images for model training. To extend the application of deep learning to the imaging tasks where acquisition of latent images is challenging, this article proposes an unsupervised deep learning method that trains a deep model for image reconstruction with the access limited to measurement data. We develop a Siamese network whose twin sub-networks perform reconstruction cooperatively on a pair of complementary spaces: the null space of the measurement matrix and the range space of its pseudo inverse. The Siamese network is trained by a self-supervised loss with three terms: a data consistency loss over available measurements in the range space, a data consistency loss between intermediate results in the null space, and a mutual consistency loss on the predictions of the twin sub-networks in the full space. The proposed method is applied to four imaging tasks from different applications, and extensive experiments have shown its advantages over existing unsupervised solutions.

Abstract:
Previous work for video captioning aims to objectively describe the video content but the captions lack human interest and attractiveness, limiting its practical application scenarios. The intention of video title generation (video titling) is to produce attractive titles, but there is a lack of benchmarks. This work offers CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration dataset, to assist research and applications in video titling, video captioning, and video retrieval in Chinese. CREATE comprises a high-quality labeled 210 K dataset and two web-scale 3 M and 10 M pre-training datasets, covering 51 categories, 50K+ tags, 537K+ manually annotated titles and captions, and 10M+ short videos with original video information. This work presents ACTEr, a unique Attractiveness-Consensus-based Title Evaluation, to objectively evaluate the quality of video title generation. This metric measures the semantic correlation between the candidate (model-generated title) and references (manual-labeled titles) and introduces attractive consensus weights to assess the attractiveness and relevance of the video title. Accordingly, this work proposes a novel multi-modal ALignment WIth Generation model, ALWIG, as one strong baseline to aid future model development. With the help of a tag-driven video-text alignment module and a GPT-based generation module, this model achieves video titling, captioning, and retrieval simultaneously. We believe that the release of the CREATE dataset, ACTEr metric, and ALWIG model will encourage in-depth research on the analysis and creation of Chinese short videos.

Abstract:
This paper addresses the challenge of reconstructing an animatable human model from a multi-view video. Some recent works have proposed to decompose a non-rigidly deforming scene into a canonical neural radiance field and a set of deformation fields that map observation-space points to the canonical space, thereby enabling them to learn the dynamic scene from images. However, they represent the deformation field as translational vector field or SE(3) field, which makes the optimization highly under-constrained. Moreover, these representations cannot be explicitly controlled by input motions. Instead, we introduce blend weight fields to produce the deformation fields. Based on the skeleton-driven deformation, blend weight fields are used with 3D human skeletons to generate observation-to-canonical and canonical-to-observation correspondences. Since 3D human skeletons are more observable, they can regularize the learning of deformation fields. Moreover, the blend weight fields can be combined with input skeletal motions to generate new deformation fields to animate the human model. To improve the quality of human modeling, we further represent the human geometry as a signed distance field in the canonical space. Additionally, a neural point displacement field is introduced to enhance the capability of the blend weight field on modeling detailed human motions. Experiments show that our approach significantly outperforms recent human modeling methods.

Abstract:
Existing studies on knowledge distillation typically focus on teacher-centered methods, in which the teacher network is trained according to its own standards before transferring the learned knowledge to a student one. However, due to differences in network structure between the teacher and the student, the knowledge learned by the former may not be desired by the latter. Inspired by human educational wisdom, this paper proposes a Student-Centered Distillation (SCD) method that enables the teacher network to adjust its knowledge transfer according to the student network's needs. We implemented SCD based on various human educational wisdom, e.g., the teacher network identified and learned the knowledge desired by the student network on the validation set, and then transferred it to the latter through the training set. To address the problems of current deficiency knowledge, hard sample learning and knowledge forgetting faced by a student network in the learning process, we introduce and improve Proportional-Integral-Derivative (PID) algorithms from automation fields to make them effective in identifying the current knowledge required by the student network. Furthermore, we propose a curriculum learning-based fuzzy strategy and apply it to the proposed PID control algorithm, such that the student network in SCD can actively pay attention to the learning of challenging samples after with certain knowledge. The overall performance of SCD is verified in multiple tasks by comparing it with state-of-the-art ones. Experimental results show that our student-centered distillation method outperforms existing teacher-centered ones.

Abstract:
Nowadays, pre-training big models on large-scale datasets has achieved great success and dominated many downstream tasks in natural language processing and 2D vision, while pre-training in 3D vision is still under development. In this paper, we provide a new perspective of transferring the pre-trained knowledge from 2D domain to 3D domain with Point-to-Pixel Prompting in data space and Pixel-to-Point distillation in feature space, exploiting shared knowledge in images and point clouds that display the same visual world. Following the principle of prompting engineering, Point-to-Pixel Prompting transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring. Then the pre-trained image models can be directly implemented for point cloud tasks without structural changes or weight modifications. With projection correspondence in feature space, Pixel-to-Point distillation further regards pre-trained image models as the teacher model and distills pre-trained 2D knowledge to student point cloud models, remarkably enhancing inference efficiency and model capacity for point cloud analysis. We conduct extensive experiments in both object classification and scene segmentation under various settings to demonstrate the superiority of our method. In object classification, we reveal the important scale-up trend of Point-to-Pixel Prompting and attain 90.3% accuracy on ScanObjectNN dataset, surpassing previous literature by a large margin. In scene-level semantic segmentation, our method outperforms traditional 3D analysis approaches and shows competitive capacity in dense prediction tasks.

Abstract:
Label-noise learning (LNL) aims to increase the model's generalization given training data with noisy labels. To facilitate practical LNL algorithms, researchers have proposed different label noise types, ranging from class-conditional to instance-dependent noises. In this paper, we introduce a novel label noise type called BadLabel, which can significantly degrade the performance of existing LNL algorithms by a large margin. BadLabel is crafted based on the label-flipping attack against standard classification, where specific samples are selected and their labels are flipped to other labels so that the loss values of clean and noisy labels become indistinguishable. To address the challenge posed by BadLabel, we further propose a robust LNL method that perturbs the labels in an adversarial manner at each epoch to make the loss values of clean and noisy labels again distinguishable. Once we select a small set of (mostly) clean labeled data, we can apply the techniques of semi-supervised learning to train the model accurately. Empirically, our experimental results demonstrate that existing LNL algorithms are vulnerable to the newly introduced BadLabel noise type, while our proposed robust LNL method can effectively improve the generalization performance of the model under various types of label noise. The new dataset of noisy labels and the source codes of robust LNL algorithms are available at https://github.com/zjfheart/BadLabels.

Abstract:
Transfer learning has been widely used in different scenarios, especially in those lacking enough labeled data. However, most of the existing transfer learning methods are based on the assumption that the source and target domains should share the label space entirely or partially, which greatly limits their application scopes. In this article, a Selective Random Walk (SRW) method for transfer learning in heterogeneous label spaces is proposed to make full use of unlabeled auxiliary data, which acts as a bridge for knowledge transfer from the source domain to the target domain. The proposed SRW method can explicitly identify transfer sequences between source and target instances via auxiliary instances based on random walk techniques. Since not all of the transfer sequences generated by random walk are credible for the target task, the SRW method can learn to weight transfer sequences adaptively. Based on the weights of the transfer sequences, the SRW method leverages knowledge by forcing adjacent data points in the transfer sequence to be similar and making the target data point in the sequence represented by other data points in the same sequence. Experiments show that the SRW method outperforms state-of-the-art models in plenty of transfer learning tasks with heterogeneous label spaces constructed within and across several benchmark datasets.

Abstract:
Graph neural networks (GNNs) are among the most powerful tools in deep learning. They routinely solve complex problems on unstructured networks, such as node classification, graph classification, or link prediction, with high accuracy. However, both inference and training of GNNs are complex, and they uniquely combine the features of irregular graph processing with dense and regular computations. This complexity makes it very challenging to execute GNNs efficiently on modern massively parallel architectures. To alleviate this, we first design a taxonomy of parallelism in GNNs, considering data and model parallelism, and different forms of pipelining. Then, we use this taxonomy to investigate the amount of parallelism in numerous GNN models, GNN-driven machine learning tasks, software frameworks, or hardware accelerators. We use the work-depth model, and we also assess communication volume and synchronization. We specifically focus on the sparsity/density of the associated tensors, in order to understand how to effectively apply techniques such as vectorization. We also formally analyze GNN pipelining, and we generalize the established Message-Passing class of GNN models to cover arbitrary pipeline depths, facilitating future optimizations. Finally, we investigate different forms of asynchronicity, navigating the path for future asynchronous parallel GNN pipelines. The outcomes of our analysis are synthesized in a set of insights that help to maximize GNN performance, and a comprehensive list of challenges and opportunities for further research into efficient GNN computations. Our work will help to advance the design of future GNNs.

Abstract:
Federated learning (FL) has emerged as a powerful machine learning technique that enables the development of models from decentralized data sources. However, the decentralized nature of FL makes it vulnerable to adversarial attacks. In this survey, we provide a comprehensive overview of the impact of malicious attacks on FL by covering various aspects such as attack budget, visibility, and generalizability, among others. Previous surveys have primarily focused on the multiple types of attacks and defenses but failed to consider the impact of these attacks in terms of their budget, visibility, and generalizability. This survey aims to fill this gap by providing a comprehensive understanding of the attacks’ effect by identifying FL attacks with low budgets, low visibility, and high impact. Additionally, we address the recent advancements in the field of adversarial defenses in FL and highlight the challenges in securing FL. The contribution of this survey is threefold: first, it provides a comprehensive and up-to-date overview of the current state of FL attacks and defenses. Second, it highlights the critical importance of considering the impact, budget, and visibility of FL attacks. Finally, we provide ten case studies and potential future directions towards improving the security and privacy of FL systems.

Abstract:
We present CG-NeRF, a cascade and generalizable neural radiance fields method for view synthesis. Recent generalizing view synthesis methods can render high-quality novel views using a set of nearby input views. However, the rendering speed is still slow due to the nature of uniformly-point sampling of neural radiance fields. Existing scene-specific methods can train and render novel views efficiently but can not generalize to unseen data. Our approach addresses the problems of fast and generalizing view synthesis by proposing two novel modules: a coarse radiance fields predictor and a convolutional-based neural renderer. This architecture infers consistent scene geometry based on the implicit neural fields and renders new views efficiently using a single GPU. We first train CG-NeRF on multiple 3D scenes of the DTU dataset, and the network can produce high-quality and accurate novel views on unseen real and synthetic data using only photometric losses. Moreover, our method can leverage a denser set of reference images of a single scene to produce accurate novel views without relying on additional explicit representations and still maintains the high-speed rendering of the pre-trained model. Experimental results show that CG-NeRF outperforms state-of-the-art generalizable neural rendering methods on various synthetic and real datasets.

Abstract:
The message-passing paradigm has served as the foundation of graph neural networks (GNNs) for years, making them achieve great success in a wide range of applications. Despite its elegance, this paradigm presents several unexpected challenges for graph-level tasks, such as the long-range problem, information bottleneck, over-squashing phenomenon, and limited expressivity. In this study, we aim to overcome these major challenges and break the conventional “node- and edge-centric” mindset in graph-level tasks. To this end, we provide an in-depth theoretical analysis of the causes of the information bottleneck from the perspective of information influence. Building on the theoretical results, we offer unique insights to break this bottleneck and suggest extracting a skeleton tree from the original graph, followed by propagating information in a distinctive manner on this tree. Drawing inspiration from natural trees, we further propose to find trunks from graph skeleton trees to create powerful graph representations and develop the corresponding framework for graph-level tasks. Extensive experiments on multiple real-world datasets demonstrate the superiority of our model. Comprehensive experimental analyses further highlight its capability of capturing long-range dependencies and alleviating the over-squashing problem, thereby providing novel insights into graph-level tasks.

Abstract:
Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this paper, first, we present a novel self-supervised method for learning dense 3D facial geometry (i.e., depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Second, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (i.e., appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (i.e., VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks.

Abstract:
Cross-domain generalizable depth estimation aims to estimate the depth of target domains (i.e., real-world) using models trained on the source domains (i.e., synthetic). Previous methods mainly use additional real-world domain datasets to extract depth specific information for cross-domain generalizable depth estimation. Unfortunately, due to the large domain gap, adequate depth specific information is hard to obtain and interference is difficult to remove, which limits the performance. To relieve these problems, we propose a domain generalizable feature extraction network with adaptive guidance fusion (AGDF-Net) to fully acquire essential features for depth estimation at multi-scale feature levels. Specifically, our AGDF-Net first separates the image into initial depth and weak-related depth components with reconstruction and contrary losses. Subsequently, an adaptive guidance fusion module is designed to sufficiently intensify the initial depth features for domain generalizable intensified depth features acquisition. Finally, taking intensified depth features as input, an arbitrary depth estimation network can be used for real-world depth estimation. Using only synthetic datasets, our AGDF-Net can be applied to various real-world datasets (i.e., KITTI, NYUDv2, NuScenes, DrivingStereo and CityScapes) with state-of-the-art performances. Furthermore, experiments with a small amount of real-world data in a semi-supervised setting also demonstrate the superiority of AGDF-Net over state-of-the-art approaches.

Abstract:
The \alphaα-tree algorithm is a useful hierarchical representation technique which facilitates comprehension of images such as remote sensing and medical images. Most \alphaα-tree algorithms make use of priority queues to process image edges in a correct order, but because traditional priority queues are inefficient in \alphaα-tree algorithms using extreme-dynamic-range pixel dissimilarities, they run slower compared with other related algorithms such as component tree. In this paper, we propose a novel hierarchical heap priority queue algorithm that can process \alphaα-tree edges much more efficiently than other state-of-the-art priority queues. Experimental results using 48-bit Sentinel-2 A remotely sensed images and randomly generated images have shown that the proposed hierarchical heap priority queue improved the timings of the flooding \alphaα-tree algorithm by replacing the heap priority queue with the proposed queue: 1.68 times in 4-N and 2.41 times in 8-N on Sentinel-2 A images, and 2.56 times and 4.43 times on randomly generated images.

Abstract:
This study proposes a set of generic rules to revise existing neural networks for 3D point cloud processing to rotation-equivariant quaternion neural networks (REQNNs), in order to make feature representations of neural networks to be rotation-equivariant and permutation-invariant. Rotation equivariance of features means that the feature computed on a rotated input point cloud is the same as applying the same rotation transformation to the feature computed on the original input point cloud. We find that the rotation-equivariance of features is naturally satisfied, if a neural network uses quaternion features. Interestingly, we prove that such a network revision also makes gradients of features in the REQNN to be rotation-equivariant w.r.t. inputs, and the training of the REQNN to be rotation-invariant w.r.t. inputs. Besides, permutation-invariance examines whether the intermediate-layer features are invariant, when we reorder input points. We also evaluate the stability of knowledge representations of REQNNs, and the robustness of REQNNs to adversarial rotation attacks. Experiments have shown that REQNNs outperform traditional neural networks in both terms of classification accuracy and robustness on rotated testing samples.

Abstract:
Electrocardiography (ECG) is a non-invasive tool for predicting cardiovascular diseases (CVDs). Current ECG-based diagnosis systems show promising performance owing to the rapid development of deep learning techniques. However, the label scarcity problem, the co-occurrence of multiple CVDs and the poor performance on unseen datasets greatly hinder the widespread application of deep learning-based models. Addressing them in a unified framework remains a significant challenge. To this end, we propose a multi-label semi-supervised model (ECGMatch) to recognize multiple CVDs simultaneously with limited supervision. In the ECGMatch, an ECGAugment module is developed for weak and strong ECG data augmentation, which generates diverse samples for model training. Subsequently, a hyperparameter-efficient framework with neighbor agreement modeling and knowledge distillation is designed for pseudo-label generation and refinement, which mitigates the label scarcity problem. Finally, a label correlation alignment module is proposed to capture the co-occurrence information of different CVDs within labeled samples and propagate this information to unlabeled samples. Extensive experiments on four datasets and three protocols demonstrate the effectiveness and stability of the proposed model, especially on unseen datasets. As such, this model can pave the way for diagnostic systems that achieve robust performance on multi-label CVDs prediction with limited supervision.

Abstract:
Uncertainty quantification for inverse problems in imaging has drawn much attention lately. Existing approaches towards this task define uncertainty regions based on probable values per pixel, while ignoring spatial correlations within the image, resulting in an exaggerated volume of uncertainty. In this paper, we propose PUQ (Principal Uncertainty Quantification) – a novel definition and corresponding analysis of uncertainty regions that takes into account spatial relationships within the image, thus providing reduced volume regions. Using recent advancements in generative models, we derive uncertainty intervals around principal components of the empirical posterior distribution, forming an ambiguity region that guarantees the inclusion of true unseen values with a user-defined confidence probability. To improve computational efficiency and interpretability, we also guarantee the recovery of true unseen values using only a few principal directions, resulting in more informative uncertainty regions. Our approach is verified through experiments on image colorization, super-resolution, and inpainting; its effectiveness is shown through comparison to baseline methods, demonstrating significantly tighter uncertainty regions.

Abstract:
Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt (PTP) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP. In the VLP phase, PTP divides an image into N x N blocks and employs a widely-used object detector to identify objects within each block. PTP then reframes the visual grounding task as a fill-in-the-blank problem, encouraging the model to predict objects in given blocks or regress the blocks of a given object, exemplified by filling “[P]” or “[O]” in a PTP sentence such as “The block [P] has a [O].” This strategy enhances the visual grounding capabilities of VLP models, enabling them to better tackle various downstream tasks. Additionally, we integrate the seconda-order relationships between objects to further enhance the visual grounding capabilities of our proposed PTP paradigm. Incorporating PTP into several state-of-the-art VLP frameworks leads to consistently significant improvements across representative cross-modal learning model architectures and multiple benchmarks, such as zero-shot Flickr30 k Retrieval (+5.6 in average recall@1) for ViLT baseline, and COCO Captioning (+5.5 in CIDEr) for the state-of-the-art BLIP baseline. Furthermore, PTP attains comparable results with object-detector-based methods and a faster inference speed, as it discards its object detector during inference, unlike other approaches.

Abstract:
The effectiveness of active learning largely depends on the sampling efficiency of the acquisition function. Expected Loss Reduction (ELR) focuses on a Bayesian estimate of the reduction in classification error, and more general costs fit in the same framework. We propose Bayesian Estimate of Mean Proper Scores (BEMPS) to estimate the increase in strictly proper scores such as log probability or negative mean square error within this framework. We also prove convergence results for this general class of costs. To facilitate better experimentation with the new acquisition functions, we develop a complementary batch AL algorithm that encourages diversity in the vector of expected changes in scores for unlabeled data. To allow high-performance classifiers, we combine deep ensembles, and dynamic validation set construction on pretrained models, and further speed up the ensemble process with the idea of Monte Carlo Dropout. Extensive experiments on both texts and images show that the use of mean square error and log probability with BEMPS yields robust acquisition functions and well-calibrated classifiers, and consistently outperforms the others tested. The advantages of BEMPS over the others are further supported by a set of qualitative analyses, where we visualise their sampling behaviour using data maps and t-SNE plots.

Abstract:
With the rapid advances in autonomous driving, it becomes critical to equip its sensing system with more holistic 3D perception. However, widely explored tasks like 3D detection or point cloud semantic segmentation focus on parsing either the objects or scenes. In this work, we propose to address the challenging task of LiDAR-based Panoptic Segmentation, which aims to parse both objects and scenes in a unified manner. In particular, we propose Dynamic Shifting Network (DS-Net), which serves as an effective panoptic segmentation framework in the point cloud realm. DS-Net features a dynamic shifting module for complex LiDAR point cloud distributions. We present an efficient learnable clustering module, dynamic shifting, which adapts kernel functions for different instances. To further explore the temporal information, we extend the single-scan processing framework to its temporal version, 4D-DS-Net, for the task of 4D Panoptic Segmentation, where the same instance across multiple frames should be given the same ID prediction. Instead of naively appending a tracking module to DS-Net, we propose to solve the 4D panoptic segmentation in a more unified way. Specifically, 4D-DS-Net first constructs 4D data volume by aligning consecutive LiDAR scans, upon which the temporally unified instance clustering is performed to obtain the final results. Extensive experiments on two large-scale autonomous driving LiDAR datasets, SemanticKITTI and Panoptic nuScenes, are conducted to demonstrate the effectiveness and superior performance of the proposed solution.

Abstract:
3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.

Abstract:
In multi-view environment, it would yield missing observations due to the limitation of the observation process. The most current representation learning methods struggle to explore complete information by lacking either cross-generative via simply filling in missing view data, or solidative via inferring a consistent representation among the existing views. To address this problem, we propose a deep generative model to learn a complete generative latent representation, namely Complete Multi-view Variational Auto-Encoders (CMVAE), which models the generation of the multiple views from a complete latent variable represented by a mixture of Gaussian distributions. Thus, the missing view can be fully characterized by the latent variables and is resolved by estimating its posterior distribution. Accordingly, a novel variational lower bound is introduced to integrate view-invariant information into posterior inference to enhance the solidative of the learned latent representation. The intrinsic correlations between views are mined to seek cross-view generality, and information leading to missing views is fused by view weights to reach solidity. Benchmark experimental results in clustering, classification, and cross-view image generation tasks demonstrate the superiority of CMVAE, while time complexity and parameter sensitivity analyses illustrate the efficiency and robustness. Additionally, application to bioinformatics data exemplifies its practical significance.

Abstract:
Appearance-based gaze estimation has garnered increasing attention in recent years. However, deep learning-based gaze estimation models still suffer from suboptimal performance when deployed in new domains, e.g., unseen environments or individuals. In our previous work, we took this challenge for the first time by introducing a plug-and-play method (PnP-GA) to adapt the gaze estimation model to new domains. The core concept of PnP-GA is to leverage the diversity brought by a group of model variants to enhance the adaptability to diverse environments. In this article, we propose the PnP-GA+ by extending our approach to explore the impact of assembling model variants using three additional perspectives: color space, data augmentation, and model structure. Moreover, we propose an intra-group attention module that dynamically optimizes pseudo-labeling during adaptation. Experimental results demonstrate that by directly plugging several existing gaze estimation networks into the PnP-GA+ framework, it outperforms state-of-the-art domain adaptation approaches on four standard gaze domain adaptation tasks on public datasets. Our method consistently enhances cross-domain performance, and its versatility is improved through various ways of assembling the model group.

Abstract:
We propose a novel generalization of constrained Markov decision processes (CMDPs) that we call the semi-infinitely constrained Markov decision process (SICMDP). Particularly, we consider a continuum of constraints instead of a finite number of constraints as in the case of ordinary CMDPs. We also devise two reinforcement learning algorithms for SICMDPs that we refer to as SI-CMBRL and SI-CPO. SI-CMBRL is a model-based reinforcement learning algorithm. Given an estimate of the transition model, we first transform the reinforcement learning problem into a linear semi-infinitely programming (LSIP) problem and then use the dual exchange method in the LSIP literature to solve it. SI-CPO is a policy optimization algorithm. Borrowing ideas from the cooperative stochastic approximation approach, we make alternative updates to the policy parameters to maximize the reward or minimize the cost. To the best of our knowledge, we are the first to apply tools from semi-infinitely programming (SIP) to solve constrained reinforcement learning problems. We present theoretical analysis for SI-CMBRL and SI-CPO, identifying their iteration complexity and sample complexity. We also conduct extensive numerical experiments to illustrate the SICMDP model and demonstrate that our proposed algorithms are able to solve complex control tasks leveraging modern deep reinforcement learning techniques.

Abstract:
Real-time video perception tasks are often challenging on resource-constrained edge devices due to the issues of accuracy drop and hardware overhead, where saving computations is the key to performance improvement. Existing methods either rely on domain-specific neural chips or priorly searched models, which require specialized optimization according to different task properties. These limitations motivate us to design a general and task-independent methodology, called Patch Automatic Skip Scheme (PASS), which supports diverse video perception settings by decoupling acceleration and tasks. The gist is to capture inter-frame correlations and skip redundant computations at patch level, where the patch is a non-overlapping square block in visual. PASS equips each convolution layer with a learnable gate to selectively determine which patches could be safely skipped without degrading model accuracy. Specifically, we are the first to construct a self-supervisory procedure for gate optimization, which learns to extract contrastive representations from frame sequences. The pre-trained gates can serve as plug-and-play modules to implement patch-skippable neural backbones, and automatically generate proper skip strategy to accelerate different video-based downstream tasks, e.g., outperforming state-of-the-art MobileHumanPose in 3D pose estimation and FairMOT in multiple object tracking, by up to 9.43 ×9.43× and 12.19 ×12.19× speedups, respectively, on NVIDIA Jetson Nano devices.

Abstract:
The temporal action localization research aims to discover action instances from untrimmed videos, representing a fundamental step in the field of intelligent video understanding. With the advent of deep learning, backbone networks have been instrumental in providing representative spatiotemporal features, while the end-to-end learning paradigm has enabled the development of high-quality models through data-driven training. Both supervised and weakly supervised learning approaches have contributed to the rapid progress of temporal action localization, resulting in a multitude of methods and a large body of literature, making a comprehensive survey a pressing necessity. This paper presents a thorough analysis of existing action localization works, offering a well-organized taxonomy that highlights the strengths and weaknesses of each strategy. In the realm of supervised learning, in addition to the anchor mechanism, we introduce a novel classification mechanism to categorize and summarize existing works. Similarly, for weakly supervised learning, we extend the traditional pre-classification and post-classification mechanisms by providing a fresh perspective on enhancement strategies. Furthermore, we shed light on the bottleneck of confidence estimation, a critical yet overlooked aspect of current works. By conducting detailed analyses, this survey serves as a valuable resource for researchers, providing beneficial guidance to newcomers and inspiring seasoned researchers alike.

Abstract:
In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an “early-fusion” or “late-fusion” manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion.

Abstract:
One-shot fine-grained visual recognition often suffers from the problem of having few training examples for new fine-grained classes. To alleviate this problem, off-the-shelf image generation techniques based on Generative Adversarial Networks (GANs) can potentially create additional training images. However, these GAN-generated images are often not helpful for actually improving the accuracy of one-shot fine-grained recognition. In this paper, we propose a meta-learning framework to combine generated images with original images, so that the resulting “hybrid” training images improve one-shot learning. Specifically, the generic image generator is updated by a few training instances of novel classes, and a Meta Image Reinforcing Network (MetaIRNet) is proposed to conduct one-shot fine-grained recognition as well as image reinforcement. Our experiments demonstrate consistent improvement over baselines on one-shot fine-grained image classification benchmarks. Furthermore, our analysis shows that the reinforced images have more diversity compared to the original and GAN-generated images.

Abstract:
As a challenging problem, few-shot class-incremental learning (FSCIL) continually learns a sequence of tasks, confronting the dilemma between slow forgetting of old knowledge and fast adaptation to new knowledge. In this paper, we concentrate on this “slow versus fast” (SvF) dilemma to determine which knowledge components to be updated in a slow fashion or a fast fashion, and thereby balance old-knowledge preservation and new-knowledge adaptation. We propose a multi-grained SvF learning strategy to cope with the SvF dilemma from two different grains: intra-space (within the same feature space) and inter-space (between two different feature spaces). The proposed strategy designs a novel frequency-aware regularization to boost the intra-space SvF capability, and meanwhile develops a new feature space composition operation to enhance the inter-space SvF learning performance. With the multi-grained SvF learning strategy, our method outperforms the state-of-the-art approaches by a large margin.

Abstract:
Modern facial age estimation systems can achieve high accuracy when training and test datasets are identically distributed and captured under similar conditions. However, domain shifts in data, encountered in practice, lead to a sharp drop in accuracy of most existing age estimation algorithms. In this article, we propose a novel method, namely RAgE, to improve the robustness and reduce the uncertainty of age estimates by leveraging unlabelled data through a subject anchoring strategy and a novel consistency regularisation term. First, we propose an similarity-preserving pseudo-labelling algorithm by which the model generates pseudo-labels for a cohort of unlabelled images belonging to the same subject, while taking into account the similarity among age labels. In order to improve the robustness of the system, a consistency regularisation term is then used to simultaneously encourage the model to produce invariant outputs for the images in the cohort with respect to an anchor image. We propose a novel consistency regularisation term the noise-tolerant property of which effectively mitigates the so-called confirmation bias caused by incorrect pseudo-labels. Experiments on multiple benchmark ageing datasets demonstrate substantial improvements over the state-of-the-art methods and robustness to confounding external factors, including subject's head pose, illumination variation and appearance of expression in the face image.

Abstract:
This article studies the problem of learning weakly supervised semantic segmentation (WSSS) from image-level supervision only. Current popular solutions leverage object localization maps from classifiers as supervision for semantic segmentation learning, and struggle to make the localization maps capture more complete object content. Rather than previous efforts that primarily focus on intra-image information, we address the value of cross-image semantic relations for comprehensive object pattern mining. To achieve this, two neural co-attentions are incorporated into the classifier to complementarily capture cross-image semantic similarities and differences. In particular, given a pair of training images, one co-attention enforces the classifier to recognize the common semantics from co-attentive objects, while the other one, called contrastive co-attention, drives the classifier to identify the unique semantics from the rest, unshared objects. This helps the classifier discover more object patterns and better ground semantics in image regions. In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference, hence eventually benefiting semantic segmentation learning. More importantly, our algorithm provides a unified framework that handles well different WSSS settings, i.e., learning WSSS with 1) precise image-level supervision only, 2) extra simple single-label data, and 3) extra noisy web data. Without bells and whistles, it sets new state-of-the-arts on all these settings. Moreover, our approach ranked 1st place in the Weakly-Supervised Semantic Segmentation Track of CVPR2020 Learning from Imperfect Data Challenge. The extensive experimental results demonstrate well the efficacy and high utility of our method.

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss, and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5\rightarrow→Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.

Abstract:
We are concerned with a worst-case scenario in model generalization, in the sense that a model aims to perform well on many unseen domains while there is only one single domain available for training. We propose Meta-Learning based Adversarial Domain Augmentation to solve this Out-of-Domain generalization problem. The key idea is to leverage adversarial training to create “fictitious” yet “challenging” populations, from which a model can learn to generalize with theoretical guarantees. To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder to relax the widely used worst-case constraint. We further improve our method by integrating uncertainty quantification for efficient domain generalization. Extensive experiments on multiple benchmark datasets indicate its superior performance in tackling single domain generalization.

Abstract:
This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses and 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations.

Abstract:
In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.

Abstract:
Capitalizing on the recent advances in image generation models, existing controllable face image synthesis methods are able to generate high-fidelity images with some levels of controllability, e.g., controlling the shapes, expressions, textures, and poses of the generated face images. However, previous methods focus on controllable 2D image generative models, which are prone to producing inconsistent face images under large expression and pose changes. In this paper, we propose a new NeRF-based conditional 3D face synthesis framework, which enables 3D controllability over the generated face images by imposing explicit 3D conditions from 3D face priors. At its core is a conditional Generative Occupancy Field (cGOF++) that effectively enforces the shape of the generated face to conform to a given 3D Morphable Model (3DMM) mesh, built on top of EG3D (Chan et al. 2022), a recent tri-plane-based generative model. To achieve accurate control over fine-grained 3D face shapes of the synthesized images, we additionally incorporate a 3D landmark loss as well as a volume warping loss into our synthesis framework. Experiments validate the effectiveness of the proposed method, which is able to generate high-fidelity face images and shows more precise 3D controllability than state-of-the-art 2D-based controllable face synthesis methods.

Abstract:
By introducing randomness on the environments, domain randomization (DR) imposes diversity to the policy training of deep reinforcement learning, and thus improves its capability of generalization. The randomization of environments, however, introduces another source of variability for the estimate of policy gradients, in addition to the already high variance incurred by trajectory sampling. Therefore, with standard state-dependent baselines, the policy gradient methods may still suffer high variance, causing a low sample efficiency during the training of DR. In this paper, we theoretically derive a bias-free and state/environment-dependent optimal baseline for DR, and analytically show its ability to achieve further variance reduction over the standard constant and state-dependent baselines for DR. Based on our theory, we further propose a variance reduced domain randomization (VRDR) approach for policy gradient methods, to strike a tradeoff between the variance reduction and computational complexity for the practical implementation. By dividing the entire space of environments into some subspaces and then estimating the state/subspace-dependent baseline, VRDR enjoys a theoretical guarantee of variance reduction and faster convergence than the state-dependent baselines. Empirical evaluations on six robot control tasks with randomized dynamics demonstrate that VRDR not only accelerates the convergence of policy training, but can consistently achieve a better eventual policy with improved training stability.

Abstract:
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic Mutilmodal detection transformer (DETR) (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ～∼44% GFLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. With the same number of encoder layers as TransVG, our Dynamic MDETR (ResNet-50) outperforms TransVG (ResNet-101) but only brings marginal extra computational cost relative to TransVG. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

Abstract:
Eye gaze analysis is an important research problem in the field of Computer Vision and Human-Computer Interaction. Even with notable progress in the last 10 years, automatic gaze analysis still remains challenging due to the uniqueness of eye appearance, eye-head interplay, occlusion, image quality, and illumination conditions. There are several open questions, including what are the important cues to interpret gaze direction in an unconstrained environment without prior knowledge and how to encode them in real-time. We review the progress across a range of gaze analysis tasks and applications to elucidate these fundamental questions, identify effective methods in gaze analysis, and provide possible future directions. We analyze recent gaze estimation and segmentation methods, especially in the unsupervised and weakly supervised domain, based on their advantages and reported evaluation metrics. Our analysis shows that the development of a robust and generic gaze analysis method still needs to address real-world challenges such as unconstrained setup and learning with less supervision. We conclude by discussing future research directions for designing a real-world gaze analysis system that can propagate to other domains including Computer Vision, Augmented Reality (AR), Virtual Reality (VR), and Human Computer Interaction (HCI).

Abstract:
Conventional frequentist learning is known to yield poorly calibrated models that fail to reliably quantify the uncertainty of their decisions. Bayesian learning can improve calibration, but formal guarantees apply only under restrictive assumptions about correct model specification. Conformal prediction (CP) offers a general framework for the design of set predictors with calibration guarantees that hold regardless of the underlying data generation mechanism. However, when training data are limited, CP tends to produce large, and hence uninformative, predicted sets. This paper introduces a novel meta-learning solution that aims at reducing the set prediction size. Unlike prior work, the proposed meta-learning scheme, referred to as meta-XB, i) builds on cross-validation-based CP, rather than the less efficient validation-based CP; and ii) preserves formal per-task calibration guarantees, rather than less stringent task-marginal guarantees. Finally, meta-XB is extended to adaptive non-conformal scores, which are shown empirically to further enhance marginal per-input calibration.

Abstract:
Movie trailers perform multiple functions: they introduce viewers to the story, convey the mood and artistic style of the film, and encourage audiences to see the movie. These diverse functions make trailer creation a challenging endeavor. In this work, we focus on finding trailer moments in a movie, i.e., shots that could be potentially included in a trailer. We decompose this task into two subtasks: narrative structure identification and sentiment prediction. We model movies as graphs, where nodes are shots and edges denote semantic relations between them. We learn these relations using joint contrastive training which distills rich textual information (e.g., characters, actions, situations) from screenplays. An unsupervised algorithm then traverses the graph and selects trailer moments from the movie that human judges prefer to ones selected by competitive supervised approaches. A main advantage of our algorithm is that it uses interpretable criteria, which allows us to deploy it in an interactive tool for trailer creation with a human in the loop. Our tool allows users to select trailer shots in under 30 minutes that are superior to fully automatic methods and comparable to (exclusive) manual selection by experts.

Abstract:
Low-light raw image denoising is an essential task in computational photography, to which the learning-based method has become the mainstream solution. The standard paradigm of the learning-based method is to learn the mapping between the paired real data, i.e., the low-light noisy image and its clean counterpart. However, the limited data volume, complicated noise model, and underdeveloped data quality have constituted the learnability bottleneck of the data mapping between paired real data, which limits the performance of the learning-based method. To break through the bottleneck, we introduce a learnability enhancement strategy for low-light raw image denoising by reforming paired real data according to noise modeling. Our learnability enhancement strategy integrates three efficient methods: shot noise augmentation (SNA), dark shading correction (DSC) and a developed image acquisition protocol. Specifically, SNA promotes the precision of data mapping by increasing the data volume of paired real data, DSC promotes the accuracy of data mapping by reducing the noise complexity, and the developed image acquisition protocol promotes the reliability of data mapping by improving the data quality of paired real data. Meanwhile, based on the developed image acquisition protocol, we build a new dataset for low-light raw image denoising. Experiments on public datasets and our dataset demonstrate the superiority of the learnability enhancement strategy.

Abstract:
This paper proposes a scribble-based weakly supervised RGB-D salient object detection (SOD) method to relieve the annotation burden from pixel-wise annotations. In view of the ensuing performance drop, we summarize two natural deficiencies of the scribbles and try to alleviate them, which are the weak richness of the pixel training samples (WRPS) and the poor structural integrity of the salient objects (PSIO). WRPS hinders robust saliency perception learning, which can be alleviated via model design for robust feature learning and pseudo labels generation for training sample enrichment. Specifically, we first design a dynamic searching process module as a meta operation to conduct multi-scale and multi-modal feature fusion for the robust RGB-D SOD model construction. Then, a dual-branch consistency learning mechanism is proposed to generate enough pixel training samples for robust saliency perception learning. PSIO makes direct structural learning infeasible since scribbles can not provide integral structural supervision. Thus, we propose an edge-region structure-refinement loss to recover the structural information and make precise segmentation. We deploy all components and conduct ablation studies on two baselines to validate their effectiveness and generalizability. Experimental results on eight datasets show that our method outperforms other scribble-based SOD models and achieves comparable performance with fully supervised state-of-the-art methods.

Abstract:
The conventional approach to image recognition has been based on raster graphics, which can suffer from aliasing and information loss when scaled up or down. In this paper, we propose a novel approach that leverages the benefits of vector graphics for object localization and classification. Our method, called YOLaT (You Only Look at Text), takes the textual document of vector graphics as input, rather than rendering it into pixels. YOLaT builds multi-graphs to model the structural and spatial information in vector graphics and utilizes a dual-stream graph neural network (GNN) to detect objects from the graph. However, for real-world vector graphics, YOLaT only models in flat GNN with vertexes as nodes ignore higher-level information of vector data. Therefore, we propose YOLaT++ to learn Multi-level Abstraction Feature Learning from a new perspective: Primitive Shapes to Curves and Points. On the other hand, given few public datasets focus on vector graphics, data-driven learning cannot exert its full power on this format. We provide a large-scale and challenging dataset for Chart-based Vector Graphics Detection and Chart Understanding, termed VG-DCU, with vector graphics, raster graphics, annotations, and raw data drawn for creating these vector charts. Experiments show that the YOLaT series outperforms both vector graphics and raster graphics-based object detection methods on both subsets of VG-DCU in terms of both accuracy and efficiency, showcasing the potential of vector graphics for image recognition tasks.

Abstract:
Cross-modal hashing (CMH) has attracted considerable attention in recent years. Almost all existing CMH methods primarily focus on reducing the modality gap and semantic gap, i.e., aligning multi-modal features and their semantics in Hamming space, without taking into account the space gap, i.e., difference between the real number space and the Hamming space. In fact, the space gap can affect the performance of CMH methods. In this paper, we analyze and demonstrate how the space gap affects the existing CMH methods, which therefore raises two problems: solution space compression and loss function oscillation. These two problems eventually cause the retrieval performance deteriorating. Based on these findings, we propose a novel algorithm, namely Semantic Channel Hashing (SCH). First, we classify sample pairs into fully semantic-similar, partially semantic-similar, and semantic-negative ones based on their similarity and impose different constraints on them, respectively, to ensure that the entire Hamming space is utilized. Then, we introduce a semantic channel to alleviate the issue of loss function oscillation. Experimental results on three public datasets demonstrate that SCH outperforms the state-of-the-art methods. Furthermore, experimental validations are provided to substantiate the conjectures regarding solution space compression and loss function oscillation, offering visual evidence of their impact on the CMH methods.

Abstract:
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models’ features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper first mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.

Abstract:
Sample selection approaches are popular in robust learning from noisy labels. However, how to control the selection process properly so that deep networks can benefit from the memorization effect is a hard problem. In this paper, motivated by the success of automated machine learning (AutoML), we propose to control the selection process by bi-level optimization. Specifically, we parameterize the selection process by exploiting the general patterns of the memorization effect in the upper-level, and then update these parameters using predicting accuracy obtained from model training in the lower-level. We further introduce semi-supervised learning algorithms to utiilize noisy-labeled data as unlabeled data. To solve the bi-level optimization problem efficiently, we consider more information from the validation curvature by the Newton method and cubic regularization method. We provide convergence analysis for both optimization methods. Results show that while both methods can converge to an (approximately) stationary point, the cubic regularization method can find better local optimal than the Newton method with less time. Experiments on both benchmark and real-world data sets demonstrate that the proposed searching method can lead to significant improvements upon existing methods. Compared with existing AutoML approaches, our method is much more efficient on finding a good selection schedule.

Abstract:
Scene Graph Generation (SGG) has achieved significant progress recently. However, most previous works rely heavily on fixed-size entity representations based on bounding box proposals, anchors, or learnable queries. As each representation's cardinality has different trade-offs between performance and computation overhead, extracting highly representative features efficiently and dynamically is both challenging and crucial for SGG. In this work, a novel architecture called RepSGG is proposed to address the aforementioned challenges, formulating a subject as queries, an object as keys, and their relationship as the maximum attention weight between pairwise queries and keys. With more fine-grained and flexible representation power for entities and relationships, RepSGG learns to sample semantically discriminative and representative points for relationship inference. Moreover, the long-tailed distribution also poses a significant challenge for generalization of SGG. A run-time performance-guided logit adjustment (PGLA) strategy is proposed such that the relationship logits are modified via affine transformations based on run-time performance during training. This strategy encourages a more balanced performance between dominant and rare classes. Experimental results show that RepSGG achieves the state-of-the-art or comparable performance on the Visual Genome and Open Images V6 datasets with fast inference speed, demonstrating the efficacy and efficiency of the proposed methods.

Abstract:
State-of-the-art model for zero-shot cross-lingual spoken language understanding performs cross-lingual unsupervised contrastive learning to achieve the label-agnostic semantic alignment between each utterance and its code-switched data. However, it ignores the precious intent/slot labels, whose label information is promising to help capture the label-aware semantics structure and then leverage supervised contrastive learning to improve both source and target languages’ semantics. In this paper, we propose Hybrid and Cooperative Contrastive Learning to address this problem. Apart from cross-lingual unsupervised contrastive learning, we design a holistic approach that exploits source language supervised contrastive learning, cross-lingual supervised contrastive learning and multilingual supervised contrastive learning to perform label-aware semantics alignments in a comprehensive manner. Each kind of supervised contrastive learning mechanism includes both single-task and joint-task scenarios. In our model, one contrastive learning mechanism's input is enhanced by others. Thus the total four contrastive learning mechanisms are cooperative to learn more consistent and discriminative representations in the virtuous cycle during the training process. Experiments show that our model obtains consistent improvements over 9 languages, achieving new state-of-the-art performance.

Abstract:
We introduce a novel exploratory technique, termed biarchetype analysis, which extends archetype analysis to simultaneously identify archetypes of both observations and features. This innovative unsupervised machine learning tool aims to represent observations and features through instances of pure types, or biarchetypes, which are easily interpretable as they embody mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which makes the structure of the data easier to understand. We propose an algorithm to solve biarchetype analysis. Although clustering is not the primary aim of this technique, biarchetype analysis is demonstrated to offer significant advantages over biclustering methods, particularly in terms of interpretability. This is attributed to biarchetypes being extreme instances, in contrast to the centroids produced by biclustering, which inherently enhances human comprehension. The application of biarchetype analysis across various machine learning challenges underscores its value.

Abstract:
Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising.

Abstract:
Human-oriented image communication should take the quality of experience (QoE) as an optimization goal, which requires effective image perceptual quality metrics. However, traditional user-based assessment metrics are limited by the deviation caused by human high-level cognitive activities. To tackle this issue, in this paper, we construct a brain response-based image perceptual quality metric and develop a brain-inspired network to assess the image perceptual quality based on it. Our method aims to establish the relationship between image quality changes and underlying brain responses in image compression scenarios using the electroencephalography (EEG) approach. We first establish EEG datasets by collecting the corresponding EEG signals when subjects watch distorted images. Then, we design a measurement model to extract EEG features that reflect human perception to establish a new image perceptual quality metric: EEG perceptual score (EPS). To use this metric in practical scenarios, we embed the brain perception process into a prediction model to generate the EPS directly from the input images. Experimental results show that our proposed measurement model and prediction model can achieve better performance. The proposed brain response-based image perceptual quality metric can measure the human brain's perceptual state more accurately, thus performing a better assessment of image perceptual quality.

Abstract:
As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By “open-vocabulary”, we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed.

Abstract:
Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples. Existing methods based on Region Proposal Networks (RPNs) face two issues: 1) overfitting suppresses novel class objects and 2) dual-branch models require complex spatial correlation strategies to prevent spatial information loss when generating class prototypes. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS and related tasks. Our three main contributions are: 1) a novel transformer-based baseline that avoids overfitting, offering a new direction for FSIS; 2) demonstrating that support object queries encode key factors after base training, allowing query features to be enhanced twice at both feature and query levels using simple cross-attention, thus avoiding complex spatial correlation interaction; and 3) introducing a class-enhanced base knowledge distillation loss to address the issue of DETR-like models struggling with incremental settings due to the input projection layer, enabling easy extension to incremental FSIS. Extensive experimental evaluations on the COCO dataset under three FSIS settings demonstrate that our method performs favorably against existing approaches across different shots, e.g., +8.2/+9.4 performance gain over state-of-the-art methods with 10/30-shots.

Abstract:
As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner are proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.

Abstract:
Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module—that formulates the cut selection as a sequence/set to sequence learning problem—to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems.

Abstract:
Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between labeled source domains and unlabeled target domains. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically originates from multiple sensors, each with its unique distribution. This property poses difficulties in adapting existing UDA techniques, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, thus limiting their effectiveness for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to address domain discrepancy at both local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based higher-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on six public MTS datasets for MTS-UDA.

Abstract:
Continual learning, also known as incremental learning or life-long learning, stands at the forefront of deep learning and AI systems. It breaks through the obstacle of one-way training on close sets and enables continuous adaptive learning on open-set conditions. In the recent decade, continual learning has been explored and applied in multiple fields especially in computer vision covering classification, detection and segmentation tasks. Continual semantic segmentation (CSS), of which the dense prediction peculiarity makes it a challenging, intricate and burgeoning task. In this paper, we present a review of CSS, committing to building a comprehensive survey on problem formulations, primary challenges, universal datasets, neoteric theories and multifarious applications. Concretely, we begin by elucidating the problem definitions and primary challenges. Based on an in-depth investigation of relevant approaches, we sort out and categorize current CSS models into two main branches including data-replay and data-free sets. In each branch, the corresponding approaches are similarity-based clustered and thoroughly analyzed, following qualitative comparison and quantitative reproductions on relevant datasets. Besides, we also introduce four CSS specialities with diverse application scenarios and development tendencies. Furthermore, we develop a benchmark for CSS encompassing representative references, evaluation results and reproductions. We hope this survey can serve as a reference-worthy and stimulating contribution to the advancement of the life-long learning field, while also providing valuable perspectives for related fields.

Abstract:
Recent advances in Neural Radiance Fields (NeRF) have provided a new geometric primitive for novel view synthesis. High Dynamic Range NeRF (HDR NeRF) can render novel views with a higher dynamic range. However, effectively displaying the scene contents of HDR NeRF on diverse devices with limited dynamic range poses a significant challenge. To address this, we present LTM-NeRF, a method designed to recover HDR NeRF and support 3D local tone mapping. LTM-NeRF allows for the synthesis of HDR views, tone-mapped views, and LDR views under different exposure settings, using only the multi-view multi-exposure LDR inputs for supervision. Specifically, we propose a differentiable Camera Response Function (CRF) module for HDR NeRF reconstruction, globally mapping the scene’s HDR radiance to LDR pixels. Moreover, we introduce a Neural Exposure Field (NeEF) to represent the spatially varying exposure time of an HDR NeRF to achieve 3D local tone mapping, for compatibility with various displays. Comprehensive experiments demonstrate that our method can not only synthesize HDR views and exposure-varying LDR views accurately but also render locally tone-mapped views naturally.

Abstract:
Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.

Abstract:
We propose a solution for Active Visual Search of objects in an environment, whose 2D floor map is the only known information. Our solution has three key features that make it more plausible and robust to detector failures compared to state-of-the-art methods: i) it is unsupervised as it does not need any training sessions. ii) During the exploration, a probability distribution on the 2D floor map is updated according to an intuitive mechanism, while an improved belief update increases the effectiveness of the agent's exploration. iii) We incorporate the awareness that an object detector may fail into the aforementioned probability modelling by exploiting the success statistics of a specific detector. Our solution is dubbed POMP-BE-PD (Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection). It uses the current pose of an agent and an RGB-D observation to learn an optimal search policy, exploiting a POMDP solved by a Monte-Carlo planning approach. On the Active Vision Dataset Benchmark, we increase the average success rate over all the environments by a significant 35%% while decreasing the average path length by 4%% with respect to competing methods. Thus, our results are state-of-the-art, even without any training procedure.

Abstract:
Model intellectual property (IP) protection has gained attention due to the significance of safeguarding intellectual labor and computational resources. Ensuring IP safety for trainers and owners is critical, especially when ownership verification and applicability authorization are required. A notable approach involves preventing the transfer of well-trained models from authorized to unauthorized domains. We introduce a novel Compact Un-transferable Pyramid Isolation Domain (CUPI-Domain) which serves as a barrier against illegal transfers from authorized to unauthorized domains. Inspired by human transitive inference, the CUPI-Domain emphasizes distinctive style features of the authorized domain, leading to failure in recognizing irrelevant private style features on unauthorized domains. To this end, we propose CUPI-Domain generators, which select features from both authorized and CUPI-Domain as anchors. These generators fuse the style features and semantic features to create labeled, style-rich CUPI-Domain. Additionally, we design external Domain-Information Memory Banks (DIMB) for storing and updating labeled pyramid features to obtain stable domain class features and domain class-wise style features. Based on the proposed whole method, the novel style and discriminative loss functions are designed to effectively enhance the distinction in style and discriminative features between authorized and unauthorized domains. We offer two solutions for utilizing CUPI-Domain based on whether the unauthorized domain is known: target-specified CUPI-Domain and target-free CUPI-Domain. Comprehensive experiments on various public datasets demonstrate the effectiveness of our CUPI-Domain approach with different backbone models, providing an efficient solution for model intellectual property protection.

Abstract:
Single Image Super-Resolution (SISR) aims to reconstruct a high-resolution image from its corresponding low-resolution input. A common technique to enhance the reconstruction quality is Non-Local Attention (NLA), which leverages self-similar texture patterns in images. However, we have made a novel finding that challenges the prevailing wisdom. Our research reveals that NLA can be detrimental to SISR and even produce severely distorted textures. For example, when dealing with severely degrade textures, NLA may generate unrealistic results due to the inconsistency of non-local texture patterns. This problem is overlooked by existing works, which only measure the average reconstruction quality of the whole image, without considering the potential risks of using NLA. To address this issue, we propose a new perspective for evaluating the reconstruction quality of NLA, by focusing on the sub-pixel level that matches the pixel-wise fusion manner of NLA. From this perspective, we provide the approximate reconstruction performance upper bound of NLA, which guides us to design a concise yet effective Texture-Fidelity Strategy (TFS) to mitigate the degradation caused by NLA. Moreover, the proposed TFS can be conveniently integrated into existing NLA-based SISR models as a general building block. Based on the TFS, we develop a Deep Texture-Fidelity Network (DTFN), which achieves state-of-the-art performance for SISR. Our code and a pre-trained DTFN are available on GitHub† for verification.

Abstract:
Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size very-high-resolution (VHR) RSIs. However, the lack of datasets with large-size VHR RSIs limits the deep learning algorithms’ performance on bridge detection. Due to the limitation of GPU memory in tackling large-size images, deep learning-based object detection methods commonly adopt the cropping strategy, which inevitably results in label fragmentation and discontinuous prediction. To ameliorate the scarcity of datasets, this paper proposes a large-scale dataset named GLH-Bridge comprising 6,000 VHR RSIs sampled from diverse geographic locations across the globe. These images encompass a wide range of sizes, varying from 2,048 × 2,048 to 16,384 × 16,384 pixels, and collectively feature 59,737 bridges. These bridges span diverse backgrounds, and each of them has been manually annotated, using both an oriented bounding box (OBB) and a horizontal bounding box (HBB). Furthermore, we present an efficient network for holistic bridge detection (HBD-Net) in large-size RSIs. The HBD-Net presents a separate detector-based feature fusion (SDFF) architecture and is optimized via a shape-sensitive sample re-weighting (SSRW) strategy. The SDFF architecture performs inter-layer feature fusion (IFF) to incorporate multi-scale context in the dynamic image pyramid (DIP) of the large-size image, and the SSRW strategy is employed to ensure an equitable balance in the regression weight of bridges with various aspect ratios. Based on the proposed GLH-Bridge dataset, we establish a bridge detection benchmark including the OBB and HBB tasks, and validate the effectiveness of the proposed HBD-Net. Additionally, cross-dataset generalization experiments on two publicly available datasets illustrate the strong generalization capability of the GLH-Bridge dataset.

Abstract:
With vigorous development e.g., in autonomous driving and remote sensing, oriented object detection has gradually been featured. The majority of existing methods directly perform regression on the rotation angle, which we argue has fundamental limitations of boundary discontinuity (even if using Gaussian or RotatedIoU-based losses). In this paper, a novel angle coder named phase-shifting coder (PSC) is proposed to address this issue. Different from another well-explored alternative i.e., angle classification, PSC achieves boundary-discontinuity-free in a continuous and differentiable manner and thus can work together with Gaussian or RotatedIoU-based methods to further boost their performance. Moreover, by rethinking the boundary discontinuity of elongated and square-like objects as rotational symmetry of different cycles, a dual-frequency version (PSCD) is proposed to accurately predict the orientation of both types of objects. Visual analysis and extensive experiments on several popular backbone detectors and datasets demonstrate the effectiveness and the potentiality of our approach. When facing scenarios requiring high-quality bounding boxes, the proposed methods are expected to give a competitive performance.

Abstract:
The deep unfolding approach has attracted significant attention in computer vision tasks, which well connects conventional image processing modeling manners with more recent deep learning techniques. Specifically, by establishing a direct correspondence between algorithm operators at each implementation step and network modules within each layer, one can rationally construct an almost “white box” network architecture with high interpretability. In this architecture, only the predefined component of the proximal operator, known as a proximal network, needs manual configuration, enabling the network to automatically extract intrinsic image priors in a data-driven manner. In current deep unfolding methods, such a proximal network is generally designed as a CNN architecture, whose necessity has been proven by a recent theory. That is, CNN structure substantially delivers the translational symmetry image prior, which is the most universally possessed structural prior across various types of images. However, standard CNN-based proximal networks have essential limitations in capturing the rotation symmetry prior, another universal structural prior underlying general images. This leaves a large room for further performance improvement in deep unfolding approaches. To address this issue, this study makes efforts to suggest a high-accuracy rotation equivariant proximal network that effectively embeds rotation symmetry priors into the deep unfolding framework. Especially, we deduce, for the first time, the theoretical equivariant error for such a designed proximal network with arbitrary layers under arbitrary rotation degrees. This analysis should be the most refined theoretical conclusion for such error evaluation to date and is also indispensable for supporting the rationale behind such networks with intrinsic interpretability requirements. Through experimental validation on different vision tasks, including blind image super-resolution, medical image reconstruction, and image de-raining, the proposed method is validated to be capable of directly replacing the proximal network in current deep unfolding architecture and readily enhancing their state-of-the-art performance. This indicates its potential usability in general vision tasks.

Abstract:
Nearly all existing scene graph generation (SGG) models have overlooked the ground-truth annotation qualities of mainstream SGG datasets, i.e., they assume: 1) all the manually annotated positive samples are equally correct; 2) all the un-annotated negative samples are absolutely background. In this article, we argue that neither of the assumptions applies to SGG: there are numerous “noisy” ground-truth predicate labels that break these two assumptions and harm the training of unbiased SGG models. To this end, we propose a novel NoIsy label CorrEction and Sample Training strategy for SGG: NICEST, which rules out these noisy label issues by generating high-quality samples and designing an effective training strategy. Specifically, it consists of: 1) NICE: it detects noisy samples and then reassigns higher-quality soft predicate labels to them. To achieve this goal, NICE contains three main steps: negative Noisy Sample Detection (Neg-NSD), positive NSD (Pos-NSD), and Noisy Sample Correction (NSC). First, in Neg-NSD, it is treated as an out-of-distribution detection problem, and the pseudo labels are assigned to all detected noisy negative samples. Then, in Pos-NSD, we use a density-based clustering algorithm to detect noisy positive samples. Lastly, in NSC, we use weighted KNN to reassign more robust soft predicate labels rather than hard labels to all noisy positive samples. 2) NIST: it is a multi-teacher knowledge distillation based training strategy, which enables the model to learn unbiased fusion knowledge. A dynamic trade-off weighting strategy in NIST is designed to penalize the bias of different teachers. Due to the model-agnostic nature of both NICE and NIST, NICEST can be seamlessly incorporated into any SGG architecture to boost its performance on different predicate categories. In addition, to better assess the generalization ability of SGG models, we propose a new benchmark, VG-OOD, by reorganizing the prevalent VG dataset. This reorganization deliberately makes the predicate distributions between the training and test sets as different as possible for each subject-object category pair. This new benchmark helps disentangle the influence of subject-object category biases. Extensive ablations and results on different backbones and tasks have attested to the effectiveness and generalization ability of each component of NICEST.

Abstract:
Motion mapping between characters with different structures but corresponding to homeomorphic graphs, meanwhile preserving motion semantics and perceiving shape geometries, poses significant challenges in skinned motion retargeting. We propose M-R ^22 ET, a modular neural motion retargeting system to comprehensively address these challenges. The key insight driving M-R ^22 ET is its capacity to learn residual motion modifications within a canonical skeleton space. Specifically, a cross-structure alignment module is designed to learn joint correspondences among diverse skeletons, enabling motion copy and forming a reliable initial motion for semantics and geometry perception. Besides, two residual modification modules, i.e., the skeleton-aware module and shape-aware module, preserving source motion semantics and perceiving target character geometries, effectively reduce interpenetration and contact-missing. Driven by our distance-based losses that explicitly model the semantics and geometry, these two modules learn residual motion modifications to the initial motion in a single inference without post-processing. To balance these two motion modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our M-R ^22 ET achieves the state-of-the-art performance, enabling cross-structure motion retargeting, and providing a good balance among the preservation of motion semantics, as well as the attenuation of interpenetration and contact-missing.

Abstract:
This paper proposes a general spectral analysis framework that thwarts a security risk in federated Learning caused by groups of malicious Byzantine attackers or colluders, who conspire to upload vicious model updates to severely debase global model performances. The proposed framework delineates the strong consistency and temporal coherence between Byzantine colluders’ model updates from a spectral analysis lens, and, formulates the detection of Byzantine misbehaviours as a community detection problem in weighted graphs. The modified normalized graph cut is then utilized to discern attackers from benign participants. Moreover, the Spectral heuristics is adopted to make the detection robust against various attacks. The proposed Byzantine colluder resilient method, i.e., FedCut, is guaranteed to converge with bounded errors. Extensive experimental results under a variety of settings justify the superiority of FedCut, which demonstrates extremely robust model accuracy (MA) under various attacks. It was shown that FedCut's averaged MA is 2.1% to 16.5% better than that of the state of the art Byzantine-resilient methods. In terms of the worst-case model accuracy (MA), FedCut is 17.6% to 69.5% better than these methods.

Abstract:
With the prevalent use of LiDAR sensors in autonomous driving, 3D point cloud object tracking has received increasing attention. In a point cloud sequence, 3D object tracking aims to predict the location and orientation of an object in consecutive frames. Motivated by the success of transformers, we propose Point Tracking TRansformer (PTTR), which efficiently predicts high-quality 3D tracking results in a coarse-to-fine manner with the help of transformer operations. PTTR consists of three novel designs. 1) Instead of random sampling, we design Relation-Aware Sampling to preserve relevant points to the given template during subsampling. 2) We propose a Point Relation Transformer for effective feature aggregation and feature matching between the template and search region. 3) Based on the coarse tracking results, we employ a novel Prediction Refinement Module to obtain the final refined prediction through local feature pooling. In addition, motivated by the favorable properties of the Bird’s-Eye View (BEV) of point clouds in capturing object motion, we further design a more advanced framework named PTTR++, which incorporates both the point-wise view and BEV representation to exploit their complementary effect in generating high-quality tracking results. PTTR++ substantially boosts the tracking performance on top of PTTR with low computational overhead. Extensive experiments over multiple datasets show that our proposed approaches achieve superior 3D tracking accuracy and efficiency.

Abstract:
AI driven by deep learning is transforming many aspects of science and technology. The enormous success of deep learning stems from its unique capability of extracting essential features from Big Data for decision-making. However, the feature extraction and hidden representations in deep neural networks (DNNs) remain inexplicable, primarily because of lack of technical tools to comprehend and interrogate the feature space data. The main hurdle here is that the feature data are often noisy in nature, complex in structure, and huge in size and dimensionality, making it intractable for existing techniques to analyze the data reliably. In this work, we develop a computational framework named contrastive feature analysis (CFA) to facilitate the exploration of the DNN feature space and improve the performance of AI. By utilizing the interaction relations among the features and incorporating a novel data-driven kernel formation strategy into the feature analysis pipeline, CFA mitigates the limitations of traditional approaches and provides an urgently needed solution for the analysis of feature space data. The technique allows feature data exploration in unsupervised, semi-supervised and supervised formats to address different needs of downstream applications. The potential of CFA and its applications for pruning of neural network architectures are demonstrated using several state-of-the-art networks and well-annotated datasets across different disciplines.

Abstract:
Video-based remote physiological measurement utilizes facial videos to measure the blood volume change signal, which is also called remote photoplethysmography (rPPG). Supervised methods for rPPG measurements have been shown to achieve good performance. However, the drawback of these methods is that they require facial videos with ground truth (GT) physiological signals, which are often costly and difficult to obtain. In this paper, we propose Contrast-Phys+, a method that can be trained in both unsupervised and weakly-supervised settings. We employ a 3DCNN model to generate multiple spatiotemporal rPPG signals and incorporate prior knowledge of rPPG into a contrastive loss function. We further incorporate the GT signals into contrastive learning to adapt to partial or misaligned labels. The contrastive loss encourages rPPG/GT signals from the same video to be grouped together, while pushing those from different videos apart. We evaluate our methods on five publicly available datasets that include both RGB and Near-infrared videos. Contrast-Phys+ outperforms the state-of-the-art supervised methods, even when using partially available or misaligned GT signals, or no labels at all. Additionally, we highlight the advantages of our methods in terms of computational efficiency, noise robustness, and generalization.

Abstract:
Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to humans and – as usual – before being enhanced/analyzed by machine vision algorithms. Intuitively, it is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. Therefore, we propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis, thereby being versatile for both human and machine vision. Our VNVC framework has a feature-based compression loop. In the loop, one frame is encoded into compact representations and decoded to an intermediate feature that is obtained before performing reconstruction. The intermediate feature can be used as reference in motion compensation and motion estimation through feature-based temporal context mining and cross-domain motion encoder-decoder to compress the following frames. The intermediate feature is directly fed into video reconstruction, video enhancement, and video analysis networks to evaluate its effectiveness. The evaluation shows that our framework with the intermediate feature achieves high compression efficiency for video reconstruction and satisfactory task performances with lower complexities.

Abstract:
The complexity of learning problems, such as Generative Adversarial Network (GAN) and its variants, multi-task and meta-learning, hyper-parameter learning, and a variety of real-world vision applications, demands a deeper understanding of their underlying coupling mechanisms. Existing approaches often address these problems in isolation, lacking a unified perspective that can reveal commonalities and enable effective solutions. Therefore, in this work, we proposed a new framework, named Learning with Constraint Learning (LwCL), that can holistically examine challenges and provide a unified methodology to tackle all the above-mentioned complex learning and vision problems. Specifically, LwCL is designed as a general hierarchical optimization model that captures the essence of these diverse learning and vision problems. Furthermore, we develop a gradient-response based fast solution strategy to overcome optimization challenges of the LwCL framework. Our proposed framework efficiently addresses a wide range of applications in learning and vision, encompassing three categories and nine different problem types. Extensive experiments on synthetic tasks and real-world applications verify the effectiveness of our approach. The LwCL framework offers a comprehensive solution for tackling complex machine learning and computer vision problems, bridging the gap between theory and practice.

Abstract:
Tensor spectral clustering (TSC) is an emerging approach that explores multi-wise similarities to boost learning. However, two key challenges have yet to be well addressed in the existing TSC methods: (1) The construction and storage of high-order affinity tensors to encode the multi-wise similarities are memory-intensive and hampers their applicability, and (2) they mostly employ a two-stage approach that integrates multiple affinity tensors of different orders to learn a consensus tensor spectral embedding, thus often leading to a suboptimal clustering result. To this end, this paper proposes a tensor spectral clustering network (TSC-Net) to achieve one-stage learning of a consensus tensor spectral embedding, while reducing the memory cost. TSC-Net employs a deep neural network that learns to map the input samples to the consensus tensor spectral embedding, guided by a TSC objective with multiple affinity tensors. It uses stochastic optimization to calculate a small part of the affinity tensors, thereby avoiding loading the whole affinity tensors for computation, thus significantly reducing the memory cost. Through using an ensemble of multiple affinity tensors, the TSC can dramatically improve clustering performance. Empirical studies on benchmark datasets demonstrate that TSC-Net outperforms the recent baseline methods.

Abstract:
We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states from an uncurated set of videos from the Internet. The model is self-supervised by the causal ordering signal, i.e., initial object state \rightarrow→ manipulating action \rightarrow→ end state. Second, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions, such as pouring water and pouring coffee, together. Third, we collect a new dataset, named ChangeIt, with more than 2600 hours of video and 34 thousand changes of object states. We report results on an existing instructional video dataset COIN as well as our new large-scale ChangeIt dataset containing tens of thousands of long uncurated web videos depicting various interactions such as hole drilling, cream whisking, or paper plane folding. We show that our multi-task model achieves a relative improvement of 40% over the prior methods and significantly outperforms both image-based and video-based zero-shot models for this problem.

Abstract:
We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations in an end-to-end fashion for challenging graph-constrained architectural layout generation tasks. The proposed graph-Transformer-based generator includes a novel graph Transformer encoder that combines graph convolutions and self-attentions in a Transformer to model both local and global interactions across connected and non-connected graph nodes. Specifically, the proposed connected node attention (CNA) and non-connected node attention (NNA) aim to capture the global relations across connected nodes and non-connected nodes in the input graph, respectively. The proposed graph modeling block (GMB) aims to exploit local vertex interactions based on a house layout topology. Moreover, we propose a new node classification-based discriminator to preserve the high-level semantic and discriminative node features for different house components. To maintain the relative spatial relationships between ground truth and predicted graphs, we also propose a novel graph-based cycle-consistency loss. Finally, we propose a novel self-guided pre-training method for graph representation learning. This approach involves simultaneous masking of nodes and edges at an elevated mask ratio (i.e., 40%) and their subsequent reconstruction using an asymmetric graph-centric autoencoder architecture. This method markedly improves the model's learning proficiency and expediency. Experiments on three challenging graph-constrained architectural layout generation tasks (i.e., house layout generation, house roof generation, and building layout generation) with three public datasets demonstrate the effectiveness of the proposed method in terms of objective quantitative scores and subjective visual realism. New state-of-the-art results are established by large margins on these three tasks.

Abstract:
Time series remains one of the most challenging modalities in machine learning research. Out-of-distribution (OOD) detection and generalization on time series often face difficulties due to their non-stationary nature, wherein the distribution changes over time. The dynamic distributions within time series present significant challenges for existing algorithms, especially in identifying invariant distributions, as most focus on scenarios where domain information is provided as prior knowledge. This paper aims to address the issues induced by non-stationarity in time series through the exploration of subdomains within a complete dataset for generalized representation learning. We propose Diversify, a general framework, for OOD detection and generalization on dynamic distributions of time series. Diversify operates through an iterative process: first identifying the ’worst-case’ latent distribution scenario, then working to minimize the gaps between these latent distributions. We implement Diversify by combining existing OOD detection methods according to either extracted features or outputs of models for detection while we also directly utilize outputs for classification. Theoretical insights support the framework's validity. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that Diversify learns more generalized features and significantly outperforms other baselines.

Abstract:
Modern methods mainly regard lane detection as a problem of pixel-wise segmentation, which is struggling to address the problems of efficiency and challenging scenarios like severe occlusions and extreme lighting conditions. Inspired by human perception, the recognition of lanes under severe occlusions and extreme lighting conditions is mainly based on contextual and global information. Motivated by this observation, we propose a novel, simple, yet effective formulation aiming at ultra fast speed and the problem of challenging scenarios. Specifically, we treat the process of lane detection as an anchor-driven ordinal classification problem using global features. First, we represent lanes with sparse coordinates on a series of hybrid (row and column) anchors. With the help of the anchor-driven representation, we then reformulate the lane detection task as an ordinal classification problem to get the coordinates of lanes. Our method could significantly reduce the computational cost with the anchor-driven representation. Using the large receptive field property of the ordinal classification formulation, we could also handle challenging scenarios. Extensive experiments on four lane detection datasets show that our method could achieve state-of-the-art performance in terms of both speed and accuracy. A lightweight version could even achieve 300+ frames per second(FPS). Our code is at https://github.com/cfzd/Ultra-Fast-Lane-Detection-v2.

Abstract:
Typical approaches that learn crowd density maps are limited to extracting the supervisory information from the loosely organized spatial information in the crowd dot/density maps. This paper tackles this challenge by performing the supervision in the frequency domain. More specifically, we devise a new loss function for crowd analysis called generalized characteristic function loss (GCFL). This loss carries out two steps: 1) transforming the spatial information in density or dot maps to the frequency domain; 2) calculating a loss value between their frequency contents. For step 1, we establish a series of theoretical fundaments by extending the definition of the characteristic function for probability distributions to density maps, as well as proving some vital properties of the extended characteristic function. After taking the characteristic function of the density map, its information in the frequency domain is well-organized and hierarchically distributed, while in the spatial domain it is loose-organized and dispersed everywhere. In step 2, we design a loss function that can fit the information organization in the frequency domain, allowing the exploitation of the well-organized frequency information for the supervision of crowd analysis tasks. The loss function can be adapted to various crowd analysis tasks through the specification of its window functions. In this paper, we demonstrate its power in three tasks: Crowd Counting, Crowd Localization and Noisy Crowd Counting. We show the advantages of our GCFL compared to other SOTA losses and its competitiveness to other SOTA methods by theoretical analysis and empirical results on benchmark datasets. Our codes are available at https://github.com/wbshu/Crowd_Counting_in_the_Frequency_Domain.

Abstract:
Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2mm on the Human3.6M dataset, improving upon the state-of-the-art approach (Lin et al., 2021) by more than 10% with fewer than one-third of the parameters.

Abstract:
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.

Abstract:
Lossless and near-lossless image compression is of paramount importance to professional users in many technical fields, such as medicine, remote sensing, precision engineering and scientific research. But despite rapidly growing research interests in learning-based image compression, no published method offers both lossless and near-lossless modes. In this paper, we propose a unified and powerful deep lossy plus residual (DLPR) coding framework for both lossless and near-lossless image compression. In the lossless mode, the DLPR coding system first performs lossy compression and then lossless coding of residuals. We solve the joint lossy and residual compression problem in the approach of VAEs, and add autoregressive context modeling of the residuals to enhance lossless compression performance. In the near-lossless mode, we quantize the original residuals to satisfy a given ℓ∞ error bound, and propose a scalable near-lossless compression scheme that works for variable ℓ∞ bounds instead of training multiple networks. To expedite the DLPR coding, we increase the degree of algorithm parallelization by a novel design of coding context, and accelerate the entropy coding with adaptive residual interval. Experimental results demonstrate that the DLPR coding system achieves both the state-of-the-art lossless and near-lossless image compression performance with competitive coding speed.

Abstract:
Supervised person re-identification (re-id) methods require expensive manual labeling costs. Although unsupervised re-id methods can reduce the requirement of the labeled datasets, the performance of these methods is lower than the supervised alternatives. Recently, some weakly supervised learning-based person re-id methods have been proposed, which is a balance between supervised and unsupervised learning. Nevertheless, most of these models require another auxiliary fully supervised datasets or ignore the interference of noisy tracklets. To address this problem, in this work, we formulate a weakly supervised tracklet association learning (WS-TAL) model only leveraging the video labels. Specifically, we first propose an intra-bag tracklet discrimination learning (ITDL) term. It can capture the associations between person identities and images by assigning pseudo labels to each person image in a bag. And then, the discriminative feature for each person is learned by utilizing the obtained associations after filtering the noisy tracklets. Based on that, a cross-bag tracklet association learning (CTAL) term is presented to explore the potential tracklet associations between bags by mining reliable positive tracklet pairs and hard negative pairs. Finally, these two complementary terms are jointly optimized to train our re-id model. Extensive experiments on the weakly labeled datasets demonstrate that WS-TAL achieves 88.1% and 90.3% rank-1 accuracy on the MARS and DukeMTMC-VideoReID datasets respectively. The performance of our model surpasses the state-of-the-art weakly supervised models by a large margin, even outperforms some fully supervised re-id models.

Abstract:
Ground Penetrating Radar (GPR) has been widely used in pipeline detection and underground diagnosis. In practical applications, the characteristics of the GPR data of the detected area and the likely underground anomalous structures could be rarely acknowledged before fully analyzing the obtained GPR data, causing challenges to identify the underground structures or anomalies automatically. In this article, a GPR B-scan image diagnosis method based on learning in the model space is proposed. The idea of learning in the model space is to use models fitted on parts of data as more stable and parsimonious representations of the data. For the GPR image, 2-Direction Echo State Network (2D-ESN) is proposed to fit the image segments through the next item prediction. By building the connections between the points on the image in both the horizontal and vertical directions, the 2D-ESN regards the GPR image segment as a whole and could effectively capture the dynamic characteristics of the GPR image. And then, semi-supervised and supervised learning methods could be further implemented on the 2D-ESN models for underground diagnosis. Experiments on real-world datasets are conducted, and the results demonstrate the effectiveness of the proposed model.

Abstract:
We propose a new image level weakly supervised segmentation approach for datasets with a single object class of interest. Our approach is based on a regularized loss function inspired by the classical Conditional Random Field (CRF) modeling. Our loss models properties of generic objects, and we use it to guide CNN towards segments that are more likely to correspond to the object, thus avoiding the need for pixel precise annotations. Training CNN with regularized loss is a difficult task for gradient descent. We develop an annealing algorithm which is crucial for a successful training. Furthermore, we develop an approach for hyperparameter setting for the most important components of our regularized loss. This is far from trivial, since there is no pixel precise ground truth for guidance. The advantage of our method is that we use a standard CNN architecture and an easy to interpret loss function, derived from classical CRF models. Furthermore, we apply the same loss function for any task/dataset. We first evaluate our approach for salient object segmentation and co-segmentation. These tasks naturally involve one object class of interest. Then we adapt our approach to image level weakly supervised multi-class semantic segmentation. We obtain state-of-the-art results.

Abstract:
The amount of face images has been witnessing an explosive increase in the last decade, where various distortions inevitably exist on transmitted or stored face images. The distortions lead to visible and undesirable degradation on face images, affecting their quality of experience (QoE). To address this issue, this paper proposes a novel Transformer-based method for quality assessment on face images (named as TransFQA). Specifically, we first establish a large-scale face image quality assessment (FIQA) database, which includes 42,125 face images with diversifying content at different distortion types. Through an extensive crowdsource study, we obtain 712,808 subjective scores, which to the best of our knowledge contribute to the largest database for assessing the quality of face images. Furthermore, by investigating the established database, we comprehensively analyze the impacts of distortion types and facial components (FCs) on the overall image quality. Accordingly, we propose the TransFQA method, in which the FC-guided Transformer network (FT-Net) is developed to integrate the global context, face region and FC detailed features via a new progressive attention mechanism. Then, a distortion-specific prediction network (DP-Net) is designed to weight different distortions and accurately predict final quality scores. Finally, the experiments comprehensively verify that our TransFQA method significantly outperforms other state-of-the-art methods for quality assessment on face images.

Abstract:
Bayesian Neural Networks (BNNs) have long been considered an ideal, yet unscalable solution for improving the robustness and the predictive uncertainty of deep neural networks. While they could capture more accurately the posterior distribution of the network parameters, most BNN approaches are either limited to small networks or rely on constraining assumptions, e.g., parameter independence. These drawbacks have enabled prominence of simple, but computationally heavy approaches such as Deep Ensembles, whose training and testing costs increase linearly with the number of networks. In this work we aim for efficient deep BNNs amenable to complex computer vision architectures, e.g., ResNet-50 DeepLabv3+, and tasks, e.g., semantic segmentation and image classification, with fewer assumptions on the parameters. We achieve this by leveraging variational autoencoders (VAEs) to learn the interaction and the latent distribution of the parameters at each network layer. Our approach, called Latent-Posterior BNN (LP-BNN), is compatible with the recent BatchEnsemble method, leading to highly efficient (in terms of computation and memory during both training and testing) ensembles. LP-BNNs attain competitive results across multiple metrics in several challenging benchmarks for image classification, semantic segmentation, and out-of-distribution detection.

Abstract:
Multi-view learning is dedicated to integrating information from different views and improving the generalization performance of models. However, in most current works, learning under different views has significant independency, overlooking common information mapping patterns that exist between these views. This paper proposes a Structure Mapping Generative adversarial network (SM-GAN) framework, which utilizes the consistency and complementarity of multi-view data from the innovative perspective of information mapping. Specifically, based on network-structured multi-view data, a structural information mapping model is proposed to capture hierarchical interaction patterns among views. Subsequently, three different types of graph convolutional operations are designed in SM-GAN based on the model. Compared with regular GAN, we add a structural information mapping module between the encoder and decoder wthin the generator, completing the structural information mapping from the micro-view to the macro-view. This paper conducted sufficient validation experiments using public imaging genetics data in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. It is shown that SM-GAN outperforms baseline and advanced methods in multi-label classification and evolution prediction tasks.

Abstract:
Physical adversarial attacks have put a severe threat to DNN-based object detectors. To enhance security, a combination of visible and infrared sensors is deployed in various scenarios, which has proven effective in disabling existing single-modal physical attacks. To further demonstrate the potential risks in such cases, we design a unified adversarial patch that can perform cross-modal physical attacks, achieving evasion in both modalities simultaneously with a single patch. Given the different imaging mechanisms of visible and infrared sensors, our work manipulates patches’ shape features, which can be captured in different modalities when they undergo changes. To deal with challenges, we propose a novel boundary-limited shape optimization approach that aims to achieve compact and smooth shapes for the adversarial patch, making it easy to implement in the physical world. And a score-aware iterative evaluation method is also introduced to balance the fooling degree between visible and infrared detectors during optimization, which guides the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. Furthermore, we propose an Affine-Transformation-based enhancement strategy that makes the learnable shape robust to various angles, thus mitigating the issue of shape deformation caused by different shooting angles in the real world. Our method is evaluated against several state-of-the-art object detectors, achieving an Attack Success Rate (ASR) of over 80%. We also demonstrate the effectiveness of our approach in physical-world scenarios under various settings, including different angles, distances, postures, and scenes for both visible and infrared sensors.

Abstract:
How can one analyze detailed 3D biological objects, such as neuronal and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects – each subtree has a main branch with some side branches attached – and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neuronal and botanical trees. We also demonstrate its application to various shape analysis tasks such as (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.

Abstract:
This article aims to use graphic engines to simulate a large number of training data that have free annotations and possibly strongly resemble to real-world data. Between synthetic and real, a two-level domain gap exists, involving content level and appearance level. While the latter is concerned with appearance style, the former problem arises from a different mechanism, i.e., content mismatch in attributes such as camera viewpoint, object placement and lighting conditions. In contrast to the widely-studied appearance-level gap, the content-level discrepancy has not been broadly studied. To address the content-level misalignment, we propose an attribute descent approach that automatically optimizes engine attributes to enable synthetic data to approximate real-world data. We verify our method on object-centric tasks, wherein an object takes up a major portion of an image. In these tasks, the search space is relatively small, and the optimization of each attribute yields sufficiently obvious supervision signals. We collect a new synthetic asset VehicleX, and reformat and reuse existing the synthetic assets ObjectX and PersonX. Extensive experiments on image classification and object re-identification confirm that adapted synthetic data can be effectively used in three scenarios: training with synthetic data only, training data augmentation and numerically understanding dataset content.

Abstract:
Obtaining accurate pixel-level localization from class labels is a crucial process in weakly supervised semantic segmentation and object localization. Attribution maps from a trained classifier are widely used to provide pixel-level localization, but their focus tends to be restricted to a small discriminative region of the target object. An AdvCAM is an attribution map of an image that is manipulated to increase the classification score produced by a classifier before the final softmax or sigmoid layer. This manipulation is realized in an anti-adversarial manner, so that the original image is perturbed along pixel gradients in directions opposite to those used in an adversarial attack. This process enhances non-discriminative yet class-relevant features, which make an insufficient contribution to previous attribution maps, so that the resulting AdvCAM identifies more regions of the target object. In addition, we introduce a new regularization procedure that inhibits the incorrect attribution of regions unrelated to the target object and the excessive concentration of attributions on a small region of the target object. Our method achieves a new state-of-the-art performance in weakly and semi-supervised semantic segmentation, on both the PASCAL VOC 2012 and MS COCO 2014 datasets. In weakly supervised object localization, it achieves a new state-of-the-art performance on the CUB-200-2011 and ImageNet-1K datasets.

Abstract:
This paper presents a novel unsupervised domain adaptation method for semantic segmentation. We argue that a good representation of the target-domain data should keep both the knowledge from the source domain and the target-domain-specific information. To obtain the knowledge from the source domain, we first learn a set of bases to characterize the feature distribution of the source domain, then features from both the source and the target domain are re-represented as a weighted summation of the source bases. A discriminator is additionally introduced to make the re-representation responsibilities of both domain features under the same bases indistinguishable. In this way, the domain gap between the source re-representation and target re-representation is minimized, and the re-represented target domain features contain the source domain information. Then we combine the feature re-representation with the original domain-specific feature together for subsequent pixel-wise classification. To further make the re-represented target features semantically meaningful, a Reliable Pseudo Label Retraining (RPLR) strategy is proposed, which utilizes the consistency of the prediction by the networks trained with multi-view source images to select the clean pseudo labels on unlabeled target images for re-training. Extensive experiments demonstrate the competitive performance of our approach for unsupervised domain adaptation on the semantic segmentation benchmarks.

Abstract:
Most state-of-the-art object detection methods have achieved impressive perfomrace on several public benchmarks, which are trained with high definition images. However, existing detectors are often sensitive to the visual variations and out-of-distribution data due to the domain gap caused by various confounders, e.g. the adverse weathre conditions. To bridge the gap, previous methods have been mainly exploring domain alignment, which requires to collect an amount of domain-specific training samples. In this paper, we introduce a novel domain adaptation model to discover a weather condition invariant feature representation. Specifically, we first employ a memory network to develop a confounder dictionary, which stores prototypes of object features under various scenarios. To guarantee the representativeness of each prototype in the dictionary, a dynamic item extraction strategy is used to update the memory dictionary. After that, we introduce a causal intervention reasoning module to explore the invariant representation of a specific object under different weather conditions. Finally, a categorical consistency regularization is used to constrain the similarities between categories in order to automatically search for the aligned instances among distinct domains. Experiments are conducted on several public benchmarks (RTTS, Foggy-Cityscapes, RID, and BDD 100K) with state-of-the-art performance achieved under multiple weather conditions.

Abstract:
The objective of this paper is to learn dense 3D shape correspondence for topology-varying generic objects in an unsupervised manner. Conventional implicit functions estimate the occupancy of a 3D point given a shape latent code. Instead, our novel implicit function produces a probabilistic embedding to represent each 3D point in a part embedding space. Assuming the corresponding points are similar in the embedding space, we implement dense correspondence through an inverse function mapping from the part embedding vector to a corresponded 3D point. Both functions are jointly learned with several effective and uncertainty-aware loss functions to realize our assumption, together with the encoder generating the shape latent code. During inference, if a user selects an arbitrary point on the source shape, our algorithm can automatically generate a confidence score indicating whether there is a correspondence on the target shape, as well as the corresponding semantic point if there is one. Such a mechanism inherently benefits man-made objects with different part constitutions. The effectiveness of our approach is demonstrated through unsupervised 3D semantic correspondence and shape segmentation.

Abstract:
Light field disparity estimation is an essential task in computer vision. Currently, supervised learning-based methods have achieved better performance than both unsupervised and optimization-based methods. However, the generalization capacity of supervised methods on real-world data, where no ground truth is available for training, remains limited. In this paper, we argue that unsupervised methods can achieve not only much stronger generalization capacity on real-world data but also more accurate disparity estimation results on synthetic datasets. To fulfill this goal, we present the Occlusion Pattern Aware Loss, named OPAL, which successfully extracts and encodes general occlusion patterns inherent in the light field for calculating the disparity loss. OPAL enables: i) accurate and robust disparity estimation by teaching the network how to handle occlusions effectively and ii) significantly reduced network parameters required for accurate and efficient estimation. We further propose an EPI transformer and a gradient-based refinement module for achieving more accurate and pixel-aligned disparity estimation results. Extensive experiments demonstrate our method not only significantly improves the accuracy compared with SOTA unsupervised methods, but also possesses stronger generalization capacity on real-world data compared with SOTA supervised methods. Last but not least, the network training and inference efficiency are much higher than existing learning-based methods. Our code will be made publicly available.

Abstract:
This work targets designing a principled and unified training-free framework for Neural Architecture Search (NAS), with high performance, low cost, and in-depth interpretation. NAS has been explosively studied to automate the discovery of top-performer neural networks, but suffers from heavy resource consumption and often incurs search bias due to truncated training or approximations. Recent NAS works Mellor et al. 2021, Chen et al. 2021, Abdelfattah et al. 2021 start to explore indicators that can predict a network's performance without training. However, they either leveraged limited properties of deep networks, or the benefits of their training-free indicators were not applied to more extensive search methods. By rigorous correlation analysis, we present a unified framework to understand and accelerate NAS, by disentangling “TEG” characteristics of searched networks – Trainability, Expressivity, Generalization – all assessed in a training-free manner. The TEG indicators could be scaled up and integrated with various NAS search methods, including both supernet and single-path NAS approaches. Extensive studies validate the effective and efficient guidance from our TEG-NAS framework, leading to both improved search accuracy and over 56% reduction in search time cost. Moreover, we visualize search trajectories on three landscapes of “TEG” characteristics, observing that a good local minimum is easier to find on NAS-Bench-201 given its simple topology, whereas balancing “TEG” characteristics is much harder on the DARTS space due to its complex landscape geometry.

Abstract:
This paper proposes a novel pipeline to estimate a non-parametric environment map with high dynamic range from a single human face image. Lighting-independent and -dependent intrinsic images of the face are first estimated separately in a cascaded network. The influence of face geometry on the two lighting-dependent intrinsics, diffuse shading and specular reflection, are further eliminated by distributing the intrinsics pixel-wise onto spherical representations using the surface normal as indices. This results in two representations simulating images of a diffuse sphere and a glossy sphere under the input scene lighting. Taking into account the distinctive nature of light sources and ambient terms, we further introduce a two-stage lighting estimator to predict both accurate and realistic lighting from these two representations. Our model is trained supervisedly on a large-scale and high-quality synthetic face image dataset. We demonstrate that our method allows accurate and detailed lighting estimation and intrinsic decomposition, outperforming state-of-the-art methods both qualitatively and quantitatively on real face images.

Abstract:
Large-scale Gaussian process (GP) modeling is becoming increasingly important in machine learning. However, the standard modeling method of GPs, which uses the maximum likelihood method and the best linear unbiased predictor, is designed to run on a single computer, which often has limited computing power. Therefore, there is a growing demand for approximate alternatives, such as composite likelihood methods, that can take advantage of the power of multiple computers. However, these alternative methods in the literature offer limited options for practitioners because most methods focus more on computational efficiency rather than statistical efficiency. Limited accurate solutions to the parameter estimation and prediction for fast GP modeling are available in the literature for supercomputing practitioners. Therefore, this study develops an optimal composite likelihood (OCL) scheme for distributed GP modeling that can minimize information loss in parameter estimation and model prediction. The proposed predictor, called the best linear unbiased block predictor (BLUBP), has the minimum prediction variance given the partitioned data. Numerical examples illustrate that both the proposed composite likelihood estimation and prediction methods provide more accurate performance than their traditional counterparts under various cases, and an extremely close approximation to the standard modeling method is observed.

Abstract:
In this paper, we propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with position-level annotations (i.e. annotations of object centers and categories). In order to remedy the information loss from box annotations to centers, our method makes use of synthetic 3D shapes to convert the position-level annotations into virtual scenes with box-level annotations, and in turn utilizes the fully-annotated virtual scenes to complement the real labels. Specifically, we first present a shape-guided label-enhancement method, which assembles 3D shapes into physically reasonable virtual scenes according to the coarse scene layout extracted from position-level annotations. Then we transfer the information contained in the virtual scenes back to real ones by applying a virtual-to-real domain adaptation method, which refines the annotated object centers and additionally supervises the training of detector with the virtual scenes. Since the shape-guided label enhancement method generates virtual scenes by human-heuristic physical constraints, the layout of the fixed virtual scenes may be unreasonable with varied object combinations. To address this, we further present differentiable label enhancement to optimize the virtual scenes including object scales, orientations and locations in a data-driven manner. Moreover, we further propose a label-assisted self-training strategy to fully exploit the capability of detector. By reusing the position-level annotations and virtual scenes, we fuse the information from both domains and generate box-level pseudo labels on the real scenes, which enables us to directly train a detector in fully-supervised manner. Extensive experiments on the widely used ScanNet and Matterport3D datasets show that our approach surpasses current weakly-supervised and semi-supervised methods by a large margin, and achieves comparable detection performance with some popular fully-supervised methods with less than 5% of the labeling labor.

Abstract:
In this work, we revisit the prior mask guidance proposed in “Prior Guided Feature Enrichment Network for Few-Shot Segmentation”. The prior mask serves as an indicator that highlights the region of interests of unseen categories, and it is effective in achieving better performance on different frameworks of recent studies. However, the current method directly takes the maximum element-to-element correspondence between the query and support features to indicate the probability of belonging to the target class, thus the broader contextual information is seldom exploited during the prior mask generation. To address this issue, first, we propose the Context-aware Prior Mask (CAPM) that leverages additional nearby semantic cues for better locating the objects in query images. Second, since the maximum correlation value is vulnerable to noisy features, we take one step further by incorporating a lightweight Noise Suppression Module (NSM) to screen out the unnecessary responses, yielding high-quality masks for providing the prior knowledge. Both two contributions are experimentally shown to have substantial practical merit, and the new model named PFENet++ significantly outperforms the baseline PFENet as well as all other competitors on three challenging benchmarks PASCAL-5^ii, COCO-20^ii and FSS-1000. The new state-of-the-art performance is achieved without compromising the efficiency, manifesting the potential for being a new strong baseline in few-shot semantic segmentation.

Abstract:
Scene flow describes the 3D motion in a scene. It can be modeled as a single task or as a composite of the auxiliary tasks of depth, camera motion, and optical flow estimation. Deep learning's emergence in recent years has broadened the horizons for new methodologies in estimating these tasks, either as separate tasks or as joint tasks to reconstruct the scene flow. The sequence of images that are either synthesized or captured by a camera is used as input for these methods, which face the challenge of dealing with various situations in images to provide the most accurate motion, such as image quality. Nowadays, images have been superseded by point clouds, which provide 3D information, thereby expediting and enhancing the estimated motion. In this paper, we dig deeply into scene flow estimation in the deep learning era. We provide a comprehensive overview of the important topics regarding both image-based and point-cloud-based methods. In addition, we cover the methodologies for each category, highlighting the network architecture. Furthermore, we provide a comparison between these methods in terms of performance and efficiency. Finally, we conclude this survey with insights and discussions on the open issues and future research directions.

Abstract:
We address the problem of establishing accurate correspondences between two images. We present a flexible framework that can easily adapt to both geometric and semantic matching. Our contribution consists of three parts. Firstly, we propose an end-to-end trainable framework that uses the coarse-to-fine matching strategy to accurately find the correspondences. We generate feature maps in two levels of resolution, enforce the neighbourhood consensus constraint on the coarse feature maps by 4D convolutions and use the resulting correlation map to regulate the matches from the fine feature maps. Secondly, we present three variants of the model with different focuses. Namely, a universal correspondence model named DualRC that is suitable for both geometric and semantic matching, an efficient model named DualRC-L tailored for geometric matching with a lightweight neighbourhood consensus module that significantly accelerates the pipeline for high-resolution input images, and the DualRC-D model in which we propose a novel dynamically adaptive neighbourhood consensus module (DyANC) that dynamically selects the most suitable non-isotropic 4D convolutional kernels with the proper neighbourhood size to account for the scale variation. Last, we thoroughly experiment on public benchmarks for both geometric and semantic matching, showing superior performance in both cases.

Abstract:
Deep convolutional neural networks (CNNs) can be easily tricked to give incorrect outputs by adding tiny perturbations to the input that are imperceptible to humans. This makes them susceptible to adversarial attacks, and poses significant security risks to deep learning systems, and presents a great challenge in making CNNs robust against such attacks. An influx of defense strategies have thus been proposed to improve the robustness of CNNs. Current attack methods, however, may fail to accurately or efficiently evaluate the robustness of defending models. In this paper, we thus propose a unified \ell _pℓp white-box attack strategy, LAFIT, to harness the defender's latent features in its gradient descent steps, and further employ a new loss function to normalize logits to overcome floating-point-based gradient masking. We show that not only is it more efficient, but it is also a stronger adversary than the current state-of-the-art when examined across a wide range of defense mechanisms. This suggests that adversarial attacks/defenses could be contingent on the effective use of the defender's hidden components, and robustness evaluation should no longer view models holistically.

Abstract:
The task of Open-World Compositional Zero-Shot Learning (OW-CZSL) is to recognize novel state-object compositions in images from all possible compositions, where the novel compositions are absent during the training stage. The performance of conventional methods degrades significantly due to the large cardinality of possible compositions. Some recent works consider simple primitives (i.e., states and objects) independent and separately predict them to reduce cardinality. However, it ignores the heavy dependence between states, objects, and compositions. In this paper, we model the dependence via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility of compositions, e.g., hairy is more feasible with cat than with building in the real world. Contextuality-dependence represents the contextual variance in images, e.g., cat shows diverse appearances when it is dry or wet. We design Semantic Attention (SA) to capture the feasibility semantics to alleviate impossible predictions, driven by the visual similarity between simple primitives. We also propose a generative Knowledge Disentanglement (KD) to disentangle images into unbiased representations, easing the contextual bias. Moreover, we complement the independent compositional probability model with the learned feasibility and contextuality compatibly. In the experiments, we demonstrate our superior or competitive performance, SA-and-kD-guided Simple Primitives (SAD-SP), on three benchmark datasets.

Abstract:
Event cameras respond to scene dynamics and provide signals naturally suitable for motion estimation with advantages, such as high dynamic range. The emerging field of event-based vision motivates a revisit of fundamental computer vision tasks related to motion, such as optical flow and depth estimation. However, state-of-the-art event-based optical flow methods tend to originate in frame-based deep-learning methods, which require several adaptations (data conversion, loss function, etc.) as they have very different properties. We develop a principled method to extend the Contrast Maximization framework to estimate dense optical flow, depth, and ego-motion from events alone. The proposed method sensibly models the space-time properties of event data and tackles the event alignment problem. It designs the objective function to prevent overfitting, deals better with occlusions, and improves convergence using a multi-scale approach. With these key elements, our method ranks first among unsupervised methods on the MVSEC benchmark and is competitive on the DSEC benchmark. Moreover, it allows us to simultaneously estimate dense depth and ego-motion, exposes the limitations of current flow benchmarks, and produces remarkable results when it is transferred to unsupervised learning settings. Along with various downstream applications shown, we hope the proposed method becomes a cornerstone on event-based motion-related tasks.

Abstract:
Tensor networks developed in the context of condensed matter physics try to approximate order-NN tensors with a reduced number of degrees of freedom that is only polynomial in NN and arranged as a network of partially contracted smaller tensors. As we have recently demonstrated in the context of quantum many-body physics, computation costs can be further substantially reduced by imposing constraints on the canonical polyadic (CP) rank of the tensors in such networks. Here, we demonstrate how tree tensor networks (TTN) with CP rank constraints and tensor dropout can be used in machine learning. The approach is found to outperform other tensor-network-based methods in Fashion-MNIST image classification. A low-rank TTN classifier with branching ratio b=4b=4 reaches a test set accuracy of 90.3% with low computation costs. Consisting of mostly linear elements, tensor network classifiers avoid the vanishing gradient problem of deep neural networks. The CP rank constraints have additional advantages: The number of parameters can be decreased and tuned more freely to control overfitting, improve generalization properties, and reduce computation costs. They allow us to employ trees with large branching ratios, substantially improving the representation power.

Abstract:
In this article, we investigate self-supervised 3D scene flow estimation and class-agnostic motion prediction on point clouds. A realistic scene can be well modeled as a collection of rigidly moving parts, therefore its scene flow can be represented as a combination of rigid motion of these individual parts. Building upon this observation, we propose to generate pseudo scene flow labels for self-supervised learning through piecewise rigid motion estimation, in which the source point cloud is decomposed into local regions and each region is treated as rigid. By rigidly aligning each region with its potential counterpart in the target point cloud, we obtain a region-specific rigid transformation to generate its pseudo flow labels. To mitigate the impact of potential outliers on label generation, when solving the rigid registration for each region, we alternately perform three steps: establishing point correspondences, measuring the confidence for the correspondences, and updating the rigid transformation based on the correspondences and their confidence. As a result, confident correspondences will dominate label generation, and a validity mask will be derived for the generated pseudo labels. By using the pseudo labels together with their validity mask for supervision, models can be trained in a self-supervised manner. Extensive experiments on FlyingThings3D and KITTI datasets demonstrate that our method achieves new state-of-the-art performance in self-supervised scene flow learning, without any ground truth scene flow for supervision, even performing better than some supervised counterparts. Additionally, our method is further extended to class-agnostic motion prediction and significantly outperforms previous state-of-the-art self-supervised methods on nuScenes dataset.

Abstract:
The widespread usage of high-definition screens on edge devices stimulates a strong demand for efficient image restoration algorithms. The way of caching deep learning models in a look-up table (LUT) is recently introduced to respond to this demand. However, the size of a single LUT grows exponentially with the increase of its indexing capacity, which restricts its receptive field and thus the performance. To overcome this intrinsic limitation of the single-LUT solution, we propose a universal method to construct multiple LUTs like a neural network, termed MuLUT. First, we devise novel complementary indexing patterns, as well as a general implementation for arbitrary patterns, to construct multiple LUTs in parallel. Second, we propose a re-indexing mechanism to enable hierarchical indexing between cascaded LUTs. Finally, we introduce channel indexing to allow cross-channel interaction, enabling LUTs to process color channels jointly. In these principled ways, the total size of MuLUT is linear to its indexing capacity, yielding a practical solution to obtain superior performance with the enlarged receptive field. We examine the advantage of MuLUT on various image restoration tasks, including super-resolution, demosaicing, denoising, and deblocking. MuLUT achieves a significant improvement over the single-LUT solution, e.g., up to 1.1 dB PSNR for super-resolution and up to 2.8 dB PSNR for grayscale denoising, while preserving its efficiency, which is 100× less in energy cost compared with lightweight deep neural networks.

Abstract:
We introduce a novel Dual Input Stream Transformer (DIST) for the challenging problem of assigning fixation points from eye-tracking data collected during passage reading to the line of text that the reader was actually focused on. This post-processing step is crucial for analysis of the reading data due to the presence of noise in the form of vertical drift. We evaluate DIST against eleven classical approaches on a comprehensive suite of nine diverse datasets. We demonstrate that combining multiple instances of the DIST model in an ensemble achieves high accuracy across all datasets. Further combining the DIST ensemble with the best classical approach yields an average accuracy of 98.17%. Our approach presents a significant step towards addressing the bottleneck of manual line assignment in reading research. Through extensive analysis and ablation studies, we identify key factors that contribute to DIST's success, including the incorporation of line overlap features and the use of a second input stream. Via rigorous evaluation, we demonstrate that DIST is robust to various experimental setups, making it a safe first choice for practitioners in the field.

Abstract:
In recent years, the neural implicit surface has emerged as a powerful representation for multi-view surface reconstruction due to its simplicity and State-of-the-Art performance. However, reconstructing smooth and detailed surfaces in indoor scenes from multi-view images presents unique challenges. Indoor scenes typically contain large texture-less regions, making the photometric loss unreliable for optimizing the implicit surface. Previous work utilizes monocular geometry priors to improve the reconstruction in indoor scenes. However, monocular priors often contain substantial errors in thin structure regions due to domain gaps and the inherent inconsistencies when derived independently from different views. This paper presents DebSDF to address these challenges, focusing on the utilization of uncertainty in monocular priors and the bias in SDF-based volume rendering. We propose an uncertainty modeling technique that associates larger uncertainties with larger errors in the monocular priors. High-uncertainty priors are then excluded from optimization to prevent bias. This uncertainty measure also informs an importance-guided ray sampling and adaptive smoothness regularization, enhancing the learning of fine structures. We further introduce a bias-aware signed distance function to density transformation that takes into account the curvature and the angle between the view direction and the SDF normals to reconstruct fine details better. Our approach has been validated through extensive experiments on several challenging datasets, demonstrating improved qualitative and quantitative results in reconstructing thin structures in indoor scenes, thereby outperforming previous work.

Abstract:
Adversarial attacks have been proven to be potential threats to Deep Neural Networks (DNNs), and many methods are proposed to defend against adversarial attacks. However, while enhancing the robustness, the accuracy for clean examples will decline to a certain extent, implying a trade-off existed between the accuracy and adversarial robustness. In this paper, to meet the trade-off problem, we theoretically explore the underlying reason for the difference of the filters’ weight distribution between standard-trained and robust-trained models and then argue that this is an intrinsic property for static neural networks, thus they are difficult to fundamentally improve the accuracy and adversarial robustness at the same time. Based on this analysis, we propose a sample-wise dynamic network architecture named Adversarial Weight-Varied Network (AW-Net), which focuses on dealing with clean and adversarial examples with a “divide and rule” weight strategy. The AW-Net adaptively adjusts the network's weights based on regulation signals generated by an adversarial router, which is directly influenced by the input sample. Benefiting from the dynamic network architecture, clean and adversarial examples can be processed with different network weights, which provides the potential to enhance both accuracy and adversarial robustness. A series of experiments demonstrate that our AW-Net is architecture-friendly to handle both clean and adversarial examples and can achieve better trade-off performance than state-of-the-art robust models.

Abstract:
We explore clustering the softmax predictions of deep neural networks and introduce a novel probabilistic clustering method, referred to as k-sBetas. In the general context of clustering discrete distributions, the existing methods focused on exploring distortion measures tailored to simplex data, such as the KL divergence, as alternatives to the standard euclidean distance. We provide a general maximum a posteriori (MAP) perspective of clustering distributions, emphasizing that the statistical models underlying the existing distortion-based methods may not be descriptive enough. Instead, we optimize a mixed-variable objective measuring data conformity within each cluster to the introduced \mathtt sBetasBeta density function, whose parameters are constrained and estimated jointly with binary assignment variables. Our versatile formulation approximates various parametric densities for modeling simplex data and enables the control of the cluster-balance bias. This yields highly competitive performances for the unsupervised adjustment of black-box model predictions in various scenarios.

Abstract:
Training with more data has always been the most stable and effective way of improving performance in the deep learning era. The Open Images dataset, the largest object detection dataset, presents significant opportunities and challenges for general and sophisticated scenarios. However, its semi-automatic collection and labeling process, designed to manage the huge data scale, leads to label-related problems, including explicit or implicit multiple labels per object and highly imbalanced label distribution. In this work, we quantitatively analyze the major problems in large-scale object detection and provide a detailed yet comprehensive demonstration of our solutions. First, we design a concurrent softmax to handle the multi-label problems in object detection and propose a soft-balance sampling method with a hybrid training scheduler to address the label imbalance. This approach yields a notable improvement of 3.34 points, achieving the best single-model performance with a mAP of 60.90% on the public object detection test set of Open Images. Then, we introduce a well-designed ensemble mechanism that substantially enhances the performance of the single model, achieving an overall mAP of 67.17%, which is 4.29 points higher than the best result from the Open Images public test 2018.

Abstract:
We introduce eFFT, an efficient method for the calculation of the exact Fourier transform of an asynchronous event stream. It is based on keeping the matrices involved in the Radix-2 FFT algorithm in a tree data structure and updating them with the new events, extensively reusing computations, and avoiding unnecessary calculations while preserving exactness. eFFT can operate event-by-event, requiring for each event only a partial recalculation of the tree since most of the stored data are reused. It can also operate with event packets, using the tree structure to detect and avoid unnecessary and repeated calculations when integrating the different events within each packet to further reduce the number of operations. eFFT has been extensively evaluated with public datasets and experiments, validating its exactness, low processing time, and feasibility for online execution on resource-constrained hardware. We release a C++ implementation of eFFT to the community.

Abstract:
The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g., answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping problem, which leads to the failure of generating referential and meaningful questions from an image. ii) They fail to model complex implicit relations among the visual objects in an image and also overlook potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. Concretely, we aim to ask the right visual questions with Double Hints - textual answers and visual regions of interests, which could effectively mitigate the existing one-to-many mapping issue. Particularly, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework, which first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilizes a graph-to-sequence model to generate the questions with double hints. Experimental results demonstrate the priority of our proposed method.

Abstract:
Learning-enabled spectroscopic analysis, promising for automated real-time analysis of chemicals, is facing several challenges. First, a typical machine learning model requires a large number of training samples that physical systems can not provide. Second, it requires the testing samples to be in range with the training samples, which often is not the case in the real world. Further, a spectroscopy device is limited by its memory size, computing power, and battery capacity. That requires highly efficient learning models for on-site analysis. In this paper, by analyzing multi-gas mixtures and multi-molecule suspensions, we first show that orders of magnitude reduction of data dimension can be achieved as the number of principal components that need to be retained is the same as the independent constituents in the mixture. From this principle, we designed highly compact models in which the essential principal components can be directly extracted from the interrelations between the individual chemical properties and principal components; and only a few training samples are required. Our model can predict the constituent concentrations that have not been seen in the training dataset and provide estimations of measurement noises. This approach can be extended as an effectively standardized method for principle component extraction.

Abstract:
Low-light image enhancement (LLIE) investigates how to improve the brightness of an image captured in illumination-insufficient environments. The majority of existing methods enhance low-light images in a global and uniform manner, without taking into account the semantic information of different regions. Consequently, a network may easily deviate from the original color of local regions. To address this issue, we propose a semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that adaptively integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in the LLIE task. We further present a refined framework SKF++ with two new techniques: (a) Extra convolutional branch for intra-class illumination and color recovery through extracting local information and (b) Equalization-based histogram transformation for contrast enhancement and high dynamic range adjustment. Extensive experiments on various benchmarks of LLIE task and other image processing tasks show that models equipped with the SKF/SKF++ significantly outperform the baselines and our SKF/SKF++ generalizes to different models and scenes well. Besides, the potential benefits of our method in face detection and semantic segmentation in low-light conditions are discussed.

Abstract:
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results as the compression process often muddles the semantics between various planes. Besides, these data-driven approaches impose an urgent demand for massive data annotations, which are laborious and time-consuming. For the first problem, we propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics. DOPNet consists of three modules that are integrated to deliver distortion-free, semantics-clean, and detail-sharp disentangled representations, which benefit the subsequent layout recovery. For the second problem, we present an unsupervised adaptation technique tailored for horizon-depth and ratio representations. Concretely, we introduce an optimization strategy for decision-level layout analysis and a 1D cost volume construction method for feature-level multi-view aggregation, both of which are designed to fully exploit the geometric consistency across multiple perspectives. The optimizer provides a reliable set of pseudo-labels for network training, while the 1D cost volume enriches each view with comprehensive scene information derived from other perspectives. Extensive experiments demonstrate that our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks.

Abstract:
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and to accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. More than three thousand pruning papers have been published from 2020 to 2024. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of eight pairs of contrast settings for pruning (e.g., unstructured/structured, one-shot/iterative, data-free/data-driven, initialized/pre-trained weights, etc.) and explore several emerging topics, including pruning for large language models, vision transformers, diffusion models, and large multimodal models, post-training pruning, and different levels of supervision for pruning to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. Finally, we provide some valuable recommendations on selecting pruning methods and prospect several promising research directions for neural network pruning. To facilitate future research on deep neural network pruning, we summarize broad pruning applications (e.g., adversarial robustness, natural language understanding, etc.) and build a curated collection of datasets, networks, and evaluations on different applications. We maintain a repository on https://github.com/hrcheng1066/awesome-pruning that serves as a comprehensive resource for neural network pruning papers and corresponding open-source codes. We will keep updating this repository to include the latest advancements in the field.

Abstract:
We present the EuroCity Persons (ECP) 2.0 dataset, a novel image dataset for person detection, tracking and prediction in traffic. The dataset was collected on-board a vehicle driving through 29 cities in 11 European countries. It contains more than 250K unique person trajectories, in more than 2.0M images and comes with a size of 11 TB. ECP2.0 is about one order of magnitude larger than previous state-of-the-art person datasets in automotive context. It offers remarkable diversity in terms of geographical coverage, time of day, weather and seasons. We discuss the novel semi-supervised approach that was used to generate the temporally dense pseudo ground-truth (i.e., 2D bounding boxes, 3D person locations) from sparse, manual annotations at keyframes. Our approach leverages auxiliary LiDAR data for 3D uplifting and vehicle inertial sensing for ego-motion compensation. It incorporates keyframe information in a three-stage approach (tracklet generation, tracklet merging into tracks, track smoothing) for obtaining accurate person trajectories. We validate our pseudo ground-truth generation approach in ablation studies, and show that it significantly outperforms existing methods. Furthermore, we demonstrate its benefits for training and testing of state-of-the-art tracking methods. Our approach provides a speed-up factor of about 34 compared to frame-wise manual annotation. The ECP2.0 dataset is made freely available for non-commercial research use.

Abstract:
Recent studies on contrastive learning have achieved remarkable performance solely by leveraging few labels in medical image segmentation. Existing methods mainly focus on instance discrimination and invariant mapping. However, they face three common pitfalls: (1) tailness: medical image data usually follows an implicit long-tail class distribution. Blindly leveraging all pixels in training hence can lead to the data imbalance issues, and cause deteriorated performance; (2) consistency: it remains unclear whether a segmentation model has learned meaningful and yet consistent anatomical features due to the intra-class variations between different anatomical features; and (3) diversity: the intra-slice correlations within the entire dataset have received significantly less attention. This motivates us to seek a principled approach for strategically making use of the dataset itself to discover similar yet distinct samples from different anatomical views. In this paper, we introduce a novel semi-supervised medical image segmentation framework termed Mine yOur owN Anatomy (MONA), and make three contributions. First, prior work argues that every pixel equally matters to the training; we observe empirically that this alone is unlikely to define meaningful anatomical features, mainly due to lacking the supervision signal. We show two simple solutions towards learning invariances. Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features in an unsupervised manner. Lastly, we both empirically and theoretically, demonstrate the efficacy of our MONA on three benchmark datasets with different labeled settings, achieving new state-of-the-art under different labeled semi-supervised settings.

Abstract:
The concept of integrating physics-based and data-driven approaches has become popular for modeling sustainable energy systems. However, the existing literature mainly focuses on the data-driven surrogates generated to replace physics-based models. These models often trade accuracy for speed but lack the generalizability, adaptability, and interpretability inherent in physics-based models, which are often indispensable in modeling real-world dynamic systems for optimization and control purposes. We propose a novel machine learning architecture, termed model-integrated neural networks (MINN), that can learn the physics-based dynamics of general autonomous or non-autonomous systems consisting of partial differential-algebraic equations (PDAEs). The obtained architecture systematically solves an unsettled research problem in control-oriented modeling, i.e., how to obtain optimally simplified models that are physically insightful, numerically accurate, and computationally tractable simultaneously. We apply the proposed neural network architecture to model the electrochemical dynamics of lithium-ion batteries and show that MINN is extremely data-efficient to train while being sufficiently generalizable to previously unseen input data, owing to its underlying physical invariants. The MINN battery model has an accuracy comparable to the first principle-based model in predicting both the system outputs and any locally distributed electrochemical behaviors but achieves two orders of magnitude reduction in the solution time.

Abstract:
3D neural rendering enables photo-realistic reconstruction of a specific scene by encoding discontinuous inputs into a neural representation. Despite the remarkable rendering results, the storage of network parameters is not transmission-friendly and not extendable to metaverse applications. In this paper, we propose an invertible neural rendering approach that enables generating an interactive 3D model from a single image (i.e., 3D Snapshot). Our idea is to distill a pre-trained neural rendering model (e.g., NeRF) into a visualizable image form that can then be easily inverted back to a neural network. To this end, we first present a neural image distillation method to optimize three neural planes for representing the original neural rendering model. However, this representation is noisy and visually meaningless. We thus propose a dynamic invertible neural network to embed this noisy representation into a plausible image representation of the scene. We demonstrate promising reconstruction quality quantitatively and qualitatively, by comparing to the original neural rendering model, as well as video-based invertible methods. On the other hand, our method can store dozens of NeRFs with a compact restoration network (5 MB), and embedding each 3D scene takes up only 160 KB of storage. More importantly, our approach is the first solution that allows embedding a neural rendering model into image representations, which enables applications like creating an interactive 3D model from a printed image in the metaverse.

Abstract:
Moving object detection in satellite videos (SVMOD) is a challenging task due to the extremely dim and small target characteristics. Current learning-based methods extract spatio-temporal information from multi-frame dense representation with labor-intensive manual labels to tackle SVMOD, which needs high annotation costs and contains tremendous computational redundancy due to the severe imbalance between foreground and background regions. In this paper, we propose a highly efficient unsupervised framework for SVMOD. Specifically, we propose a generic unsupervised framework for SVMOD, in which pseudo labels generated by a traditional method can evolve with the training process to promote detection performance. Furthermore, we propose a highly efficient and effective sparse convolutional anchor-free detection network by sampling the dense multi-frame image form into a sparse spatio-temporal point cloud representation and skipping the redundant computation on background regions. Coping these two designs, we can achieve both high efficiency (label and computation efficiency) and effectiveness. Extensive experiments demonstrate that our method can not only process 98.8 frames per second on 1024 × 10241024×1024 images but also achieve state-of-the-art performance.

Abstract:
Artificial lights commonly leave strong lens flare artifacts on the images captured at night, degrading both the visual quality and performance of vision algorithms. Existing flare removal approaches mainly focus on removing daytime flares and fail in nighttime cases. Nighttime flare removal is challenging due to the unique luminance and spectrum of artificial lights, as well as the diverse patterns and image degradation of the flares. The scarcity of the nighttime flare removal dataset constrains the research on this crucial task. In this paper, we introduce Flare7K++, the first comprehensive nighttime flare removal dataset, consisting of 962 real-captured flare images (Flare-R) and 7000 synthetic flares (Flare7K). Compared to Flare7K, Flare7K++ is particularly effective in eliminating complicated degradation around the light source, which is intractable by using synthetic flares alone. Besides, the previous flare removal pipeline relies on the manual threshold and blur kernel settings to extract light sources, which may fail when the light sources are tiny or not overexposed. To address this issue, we additionally provide the annotations of light sources in Flare7K++ and propose a new end-to-end pipeline to preserve the light source while removing lens flares. Our dataset and pipeline offer a valuable foundation and benchmark for future investigations into nighttime flare removal studies. Extensive experiments demonstrate that Flare7K++ supplements the diversity of existing flare datasets and pushes the frontier of nighttime flare removal toward real-world scenarios.

Abstract:
Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users’ viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by comparing our model generalization capabilities on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models.

Abstract:
Scene Graph Generation (SGG) aims to detect visual relationships in an image. However, due to long-tailed bias, SGG is far from practical. Most methods depend heavily on the assistance of statistics co-occurrence to generate a balanced dataset, so they are dataset-specific and easily affected by noises. The fundamental cause is that SGG is simplified as a classification task instead of a reasoning task, thus the ability capturing the fine-grained details is limited and the difficulty in handling ambiguity is increased. By imitating the way of dual process in cognitive psychology, a Visual-Textual Semantics Consistency Network (VTSCN) is proposed to model the SGG task as a reasoning process, and relieve the long-tailed bias significantly. In VTSCN, as the rapid autonomous process (Type1 process), we design a Hybrid Union Representation (HUR) module, which is divided into two steps for spatial awareness and working memories modeling. In addition, as the higher order reasoning process (Type2 process), a Global Textual Semantics Modeling (GTS) module is designed to individually model the textual contexts with the word embeddings of pairwise objects. As the final associative process of cognition, a Heterogeneous Semantics Consistency (HSC) module is designed to balance the type1 process and the type2 process. Lastly, our VTSCN raises a new way for SGG model design by fully considering human cognitive process. Experiments on Visual Genome, GQA and PSG datasets show our method is superior to state-of-the-art methods, and ablation studies validate the effectiveness of our VTSCN.

Abstract:
In this article we propose a conceptual framework to study ensembles of conformal predictors (CP), that we call Ensemble Predictors (EP). Our approach is inspired by the application of imprecise probabilities in information fusion. Based on the proposed framework, we study, for the first time in the literature, the theoretical properties of CP ensembles in a general setting, by focusing on simple and commonly used possibilistic combination rules. We also illustrate the applicability of the proposed methods in the setting of multivariate time-series classification, showing that these methods provide better performance (in terms of both robustness, conservativeness, accuracy and running time) than both standard classification algorithms and other combination rules proposed in the literature, on a large set of benchmarks from the UCR time series archive.

Abstract:
Graph neural networks (GNN) suffer from severe inefficiency due to the exponential growth of node dependency with the increase of layers. It extremely limits the application of stochastic optimization algorithms so that the training of GNN is usually time-consuming. To address this problem, we propose to decouple a multi-layer GNN as multiple simple modules for more efficient training, which is comprised of classical forward training (FT) and designed backward training (BT). Under the proposed framework, each module can be trained efficiently in FT by stochastic algorithms without distortion of graph information owing to its simplicity. To avoid the only unidirectional information delivery of FT and sufficiently train shallow modules with the deeper ones, we develop a backward training mechanism that makes the former modules perceive the latter modules, inspired by the classical backward propagation algorithm. The backward training introduces the reversed information delivery into the decoupled modules as well as the forward information delivery. To investigate how the decoupling and greedy training affect the representational capacity, we theoretically prove that the error produced by linear modules will not accumulate on unsupervised tasks in most cases. The theoretical and experimental results show that the proposed framework is highly efficient with reasonable performance, which may deserve more investigation.

Abstract:
Human-Object Interaction (HOI) detection aims to understand human activities by detecting interaction triplets. Previous HOI detection methods adopt a two-stage instance-driven paradigm. Unfortunately, many non-interactive human-object pairs generated by the first stage are the main obstacle impeding HOI detectors from high efficiency and promising performance. To remedy this, we propose a novel top-down interaction-driven paradigm, detecting interactions first and bridging interactive human-object pairs through interactions. We formulate HOI as a point triplet < > and design a Parallel Point Detection and Matching (PPDM) framework. We further take advantage of two-stage methods and propose a novel framework, PPDM++, that detects the interactive human-object pairs by PPDM, then extracts region features for each pair to predict actions. The core of PPDM/PPDM++ is to convert the instance-driven bottom-up paradigm to an interaction-driven top-down paradigm, thus avoiding additional computation costs from traversing a tremendous number of non-interactive pairs. Benefiting from the advanced paradigm, PPDM/PPDM++ has achieved significant performance gains with high efficiency. PPDM-DLA-34 has achieved 19.94 mAP with 42 FPS as the first real-time HOI detector, and PPDM++-SwinB achieves 30.1 mAP with 17 FPS on HICO-DET dataset. We also built an application-oriented database named HOI-A, a supplement to the existing datasets.

Abstract:
Many complex social, biological, or physical systems are characterized as networks, and recovering the missing links of a network could shed important lights on its structure and dynamics. A good topological representation is crucial to accurate link modeling and prediction, yet how to account for the kaleidoscopic changes in link formation patterns remains a challenge, especially for analysis in cross-domain studies. We propose a new link representation scheme by projecting the local environment of a link into a “dipole plane”, where neighboring nodes of the link are positioned via their relative proximity to the two anchors of the link, like a dipole. By doing this, complex and discrete topology arising from link formation is turned to differentiable point-cloud distribution, opening up new possibilities for topological feature-engineering with desired expressiveness, interpretability and generalization. Our approach has comparable or even superior results against state-of-the-art GNNs, meanwhile with a model up to hundreds of times smaller and running much faster. Furthermore, it provides a universal platform to systematically profile, study, and compare link-patterns from miscellaneous real-world networks. This allows building a global link-pattern atlas, based on which we have uncovered interesting common patterns of link formation, i.e., the bridge-style, the radiation-style, and the community-style across a wide collection of networks with highly different nature.

Abstract:
PSNR-oriented models are a critical class of super-resolution models with applications across various fields. However, these models tend to generate over-smoothed images, a problem that has been analyzed previously from the perspectives of models or loss functions, but without taking into account the impact of data properties. In this paper, we present a novel phenomenon that we term the center-oriented optimization (COO) problem, where a model's output converges towards the center point of similar high-resolution images, rather than towards the ground truth. We demonstrate that the strength of this problem is related to the uncertainty of data, which we quantify using entropy. We prove that as the entropy of high-resolution images increases, their center point will move further away from the clean image distribution, and the model will generate over-smoothed images. Implicitly optimizing the COO problem, perceptual-driven approaches such as perceptual loss, model structure optimization, or GAN-based methods can be viewed. We propose an explicit solution to the COO problem, called Detail Enhanced Contrastive Loss (DECLoss). DECLoss utilizes the clustering property of contrastive learning to directly reduce the variance of the potential high-resolution distribution and thereby decrease the entropy. We evaluate DECLoss on multiple super-resolution benchmarks and demonstrate that it improves the perceptual quality of PSNR-oriented models. Moreover, when applied to GAN-based methods, such as RaGAN, DECLoss helps to achieve state-of-the-art performance, such as 0.093 LPIPS with 24.51 PSNR on 4× downsampled Urban100, validating the effectiveness and generalization of our approach.

Abstract:
This paper studies how to flexibly integrate reconstructed 3D models into practical 3D modeling pipelines such as 3D scene creation and rendering. Due to the technical difficulty, one can only obtain rough 3D models (R3DMs) for most real objects using existing 3D reconstruction techniques. As a result, physically-based rendering (PBR) would render low-quality images or videos for scenes that are constructed by R3DMs. One promising solution would be representing real-world objects as Neural Fields such as NeRFs, which are able to generate photo-realistic renderings of an object under desired viewpoints. However, a drawback is that the synthesized views through Neural Fields Rendering (NFR) cannot reflect the simulated lighting details on R3DMs in PBR pipelines, especially when object interactions in the 3D scene creation cause local shadows. To solve this dilemma, we propose a lighting transfer network (LighTNet) to bridge NFR and PBR, such that they can benefit from each other. LighTNet reasons about a simplified image composition model, remedies the uneven surface issue caused by R3DMs, and is empowered by several perceptual-motivated constraints and a new Lab angle loss which enhances the contrast between lighting strength and colors. Comparisons demonstrate that LighTNet is superior in synthesizing impressive lighting, and is promising in pushing NFR further in practical 3D modeling workflows.

Abstract:
Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this article, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The article provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models.

Abstract:
The new paradigm of syntactic pattern recognition, SPR, which uses multi-derivational parsing of vague languages is introduced in the paper. The methodology proposed addresses the issue of the recognition of vague/distorted patterns which is one of the important open problems in the area. The concept of the vague language of patterns and the efficient parsing method based on the class of dynamically programmed grammars are introduced. A vague language is defined with vague primitives which are vectors of “neighboring” primitives associated with measures of distance, probability, fuzziness, etc. The use of vague primitives allows us to identify bb best structural templates during multi-derivational parsing that can be used for getting more adequate final result. The generic architecture of SPR system based on the approach proposed together with the system's applications for short-term electrical load forecasting and for analysis of ultrasound images in order to diagnose congenital defects of fetal palates are presented. The results of the experimental studies are discussed.

Abstract:
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods.

Abstract:
By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.

Abstract:
Fusing a low-resolution hyperspectral image (HSI) with a high-resolution (HR) multi-spectral image has provided an effective way for HSI super-resolution (SR). The key lies on inferring the posteriori of the latent (i.e., HR) HSI using an appropriate image prior and the likelihood determined by the degeneration between the latent HSI and the observed images. However, in scenarios with complex imaging environments and various imaging scenes, the prior of HSIs can be prohibitively complicated and the degeneration is often unknown, which causes it difficult to accurately infer the posteriori of each latent HSI. To tackle this problem, we present an unsupervised test-time adaptation learning (UTAL) framework for HSI SR under unknown degeneration. Instead of directly modeling the complicated image prior, it first implicitly learns a content-agnostic prior shared across different images through supervisedly pre-training a mutual-guiding fusion module on extensive synthetic data. Then, it adapts the shared prior to those private characteristics in the latent HSI for posteriori inference through unsupervisedly learning a self-guiding adaptation module and a degeneration estimation network on two observed images in the test phase. Such a two-stage learning scheme models the complicated image prior in a divide-and-conquer manner, which eases the modeling difficulty and improves the prior accuracy. Moreover, the unknown degeneration can be estimated properly. Both of these two advantages empower us to accurately infer the posteriori of the latent HSI, thereby increasing the generalization performance in real applications. Additionally, in order to further mitigate the over-fitting in coping with more challenging cases (e.g., degenerations in both spectral and spatial domains are unknown) and speed up, we propose to meta-train UTAL on extensive synthetic SR tasks and solve it using an alternative optimization strategy such that UTAL learns to produce good generalization performance in real challenging cases with a small number of gradient descent steps. To verify the efficacy of UTAL, we evaluate it on HSI SR tasks with different unknown degenerations as well as some other HSI restoration tasks (e.g., compressive sensing), and report strong results superior to that of existing competitors.

Abstract:
As an emerging research practice leveraging recent advanced AI techniques, e.g. deep models based prediction and generation, Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. With the rapid advances of deep feature representation and visual data compression in mind, in this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established including feature-assisted coding, scalable coding, intermediate feature compression/optimization, and machine vision targeted codec, from broader perspectives of vision tasks, analytics resources, etc. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel visual information compression for the analytics taxonomy problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.

Abstract:
Structured light illumination is an active 3D scanning technique based on projecting and capturing a set of striped patterns and measuring the warping of the patterns as they reflect off a target object's surface. As designed, each pixel in the camera sees exactly one pixel from the projector; however, there are multi-path situations where a camera pixel sees light from multiple projector positions. In the case of bimodal multi-path, the camera pixel receives light from exactly two positions, which occurs along a step edge where the edge slices through a pixel which, therefore, sees both a foreground and background surface. In this paper, we present a general mathematical model to address this bimodal multi-path issue in a phase-shifting or so-called phase-measuring-profilometry scanner to measure the constructive and destructive interference between the two light paths, and by taking advantage of this interference, separate the paths and make two decoupled depth measurements. We validate our algorithm with both simulations and a number of challenging real-world scenarios, significantly outperforming the state-of-the-art methods.

Abstract:
Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

Abstract:
Recently, brain-inspired spiking neural networks (SNNs) have demonstrated promising capabilities in solving pattern recognition tasks. However, these SNNs are grounded on homogeneous neurons that utilize a uniform neural coding for information representation. Given that each neural coding scheme possesses its own merits and drawbacks, these SNNs encounter challenges in achieving optimal performance such as accuracy, response time, efficiency, and robustness, all of which are crucial for practical applications. In this study, we argue that SNN architectures should be holistically designed to incorporate heterogeneous coding schemes. As an initial exploration in this direction, we propose a hybrid neural coding and learning framework, which encompasses a neural coding zoo with diverse neural coding schemes discovered in neuroscience. Additionally, it incorporates a flexible neural coding assignment strategy to accommodate task-specific requirements, along with novel layer-wise learning methods to effectively implement hybrid coding SNNs. We demonstrate the superiority of the proposed framework on image classification and sound localization tasks. Specifically, the proposed hybrid coding SNNs achieve comparable accuracy to state-of-the-art SNNs, while exhibiting significantly reduced inference latency and energy consumption, as well as high noise robustness. This study yields valuable insights into hybrid neural coding designs, paving the way for developing high-performance neuromorphic systems.

Abstract:
Shadow detection is a basic task of remote sensing image analysis, but it is often seriously disturbed by vegetation, water bodies, and black objects. It is observed that vegetation and dark objects often show a dark look in visible bands but brighter in the near-infrared (NIR), and is also noticed that the reflection of inland water bodies in the green band is stronger than that in the blue band. Taking advantage of these physical properties and combining them with the bluish and dark appearance of shadows, we propose a simple but effective shadow detection method for multispectral remote sensing images. These physical properties are used to create transformation models that suppress features such as vegetation, water bodies, etc., but at the same time enhance shadows. Then, we transform the shadow representation into a color space to generate candidate shadows using dominant color components. To separate shadows from the others, we propose two indexes, the normalized Color Difference Composite Index (CDCI) and Color Purity Index (CPI), and fuse them to achieve shadows and their confidence. The experimental results indicate that the proposed method can effectively detect the shadows in multispectral images and outperforms the state-of-the-art approaches.

Abstract:
Accurate 3D object detection in large-scale outdoor scenes, characterized by considerable variations in object scales, necessitates features rich in both long-range and fine-grained information. While recent detectors have utilized window-based transformers to model long-range dependencies, they tend to overlook fine-grained details. To bridge this gap, we propose MsSVT++, an innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures both types of information through a divide-and-conquer approach. This approach involves explicitly dividing attention heads into multiple groups, each responsible for attending to information within a specific range. The outputs of these groups are subsequently merged to obtain final mixed-scale features. To mitigate the computational complexity associated with applying a window-based transformer in 3D voxel space, we introduce a novel Chessboard Sampling strategy and implement voxel sampling and gathering operations sparsely using a hash map. Moreover, an important challenge stems from the observation that non-empty voxels are primarily located on the surface of objects, which impedes the accurate estimation of bounding boxes. To overcome this challenge, we introduce a Center Voting module that integrates newly voted voxels enriched with mixed-scale contextual information towards the centers of the objects, thereby improving precise object localization. Extensive experiments demonstrate that our single-stage detector, built upon the foundation of MsSVT++, consistently delivers exceptional performance across diverse datasets.

Abstract:
Monocular depth inference is a fundamental problem for scene perception of robots. Specific robots may be equipped with a camera plus an optional depth sensor of any type and located in various scenes of different scales, whereas recent advances derived multiple individual sub-tasks. It leads to additional burdens to fine-tune models for specific robots and thereby high-cost customization in large-scale industrialization. This article investigates a unified task of monocular depth inference, which infers high-quality depth maps from all kinds of input raw data from various robots in unseen scenes. A basic benchmark G2-MonoDepth is developed for this task, which comprises four components: (a) a unified data representation RGB+X to accommodate RGB plus raw depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth sparsity/errors of input raw data and diverse scales of output scenes, (c) an improved network to well propagate diverse scene scales from input to output, and (d) a data augmentation pipeline to simulate all types of real artifacts in raw depth maps for training. G2-MonoDepth is applied in three sub-tasks including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and it always outperforms SOTA baselines on both real-world data and synthetic data.

Abstract:
Motion prediction is crucial for autonomous driving systems to understand complex driving scenarios and make informed decisions. However, this task is challenging due to the diverse behaviors of traffic participants and complex environmental contexts. In this paper, we propose Motion TRansformer (MTR) frameworks to address these challenges. The initial MTR framework utilizes a transformer encoder-decoder structure with learnable intention queries, enabling efficient and accurate prediction of future trajectories. By customizing intention queries for distinct motion modalities, MTR improves multimodal motion prediction while reducing reliance on dense goal candidates. The framework comprises two essential processes: global intention localization, identifying the agent's intent to enhance overall efficiency, and local movement refinement, adaptively refining predicted trajectories for improved accuracy. Moreover, we introduce an advanced MTR++ framework, extending the capability of MTR to simultaneously predict multimodal motion for multiple agents. MTR++ incorporates symmetric context modeling and mutually-guided intention querying modules to facilitate future behavior interaction among multiple agents, resulting in scene-compliant future trajectories. Extensive experimental results demonstrate that the MTR framework achieves state-of-the-art performance on the highly-competitive motion prediction benchmarks, while the MTR++ framework surpasses its precursor, exhibiting enhanced performance and efficiency in predicting accurate multimodal future trajectories for multiple agents.

Affiliations: School of Computer Science and Engineering and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China; School of Instrument Science and Engineering, Southeast University, Nanjing, China; Wangxuan Institute of Computer Technology and National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Abstract:
Semi-Supervised Few-Shot Learning (SSFSL) aims to train a classifier that can adapt to new tasks using limited labeled data and a fixed amount of unlabeled data. Various sophisticated methods have been proposed to tackle the challenges associated with this problem. In this paper, we present a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective. We leverage these pseudo-labels to augment the support set, which is typically limited in few-shot tasks, e.g., 1-shot classification. In such label-constrained scenarios, our approach can offer highly accurate negative pseudo-labels. By iteratively excluding negative pseudo-labels one by one, we ultimately derive a positive pseudo-label for each unlabeled sample in our approach. The integration of negative and positive pseudo-labels complements the limited support set, resulting in significant accuracy improvements for SSFSL. Our approach can be implemented in just few lines of code by only using off-the-shelf operations, yet it outperforms state-of-the-art methods on four benchmark datasets. Furthermore, our approach exhibits good adaptability and generalization capabilities when used as a plug-and-play counterpart alongside existing SSFSL methods and when extended to generalized linear models.

Abstract:
Explainable AI (XAI) is widely viewed as a sine qua non for ever-expanding AI research. A better understanding of the needs of XAI users, as well as human-centered evaluations of explainable models are both a necessity and a challenge. In this paper, we explore how human-computer interaction (HCI) and AI researchers conduct user studies in XAI applications based on a systematic literature review. After identifying and thoroughly analyzing 97 core papers with human-based XAI evaluations over the past five years, we categorize them along the measured characteristics of explanatory methods, namely trust, understanding, usability, and human-AI collaboration performance. Our research shows that XAI is spreading more rapidly in certain application domains, such as recommender systems than in others, but that user evaluations are still rather sparse and incorporate hardly any insights from cognitive or social sciences. Based on a comprehensive discussion of best practices, i.e., common models, design choices, and measures in user studies, we propose practical guidelines on designing and conducting user studies for XAI researchers and practitioners. Lastly, this survey also highlights several open research directions, particularly linking psychological science and human-centered XAI.

Abstract:
Learning powerful representations in bird’s-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits several advantages, as representing surrounding scenes in BEV is intuitive and fusion-friendly; and representing objects in BEV is most desirable for subsequent modules as in planning and/or control. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; (c) how to formulate the pipeline to incorporate features from different sources and views; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios. In this survey, we review the most recent works on BEV perception and provide an in-depth analysis of different solutions. Moreover, several systematic designs of BEV approach from the industry are depicted as well. Furthermore, we introduce a full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs. At last, we point out the future research directions in this area. We hope this report will shed some light on the community and encourage more research effort on BEV perception.

Abstract:
This paper delves into the problem of correlated time-series forecasting in practical applications, an area of growing interest in a multitude of fields such as stock price prediction and traffic demand analysis. Current methodologies primarily represent data using conventional graph structures, yet these fail to capture intricate structures with non-pairwise relationships. To address this challenge, we adopt dynamic hypergraphs in this study to better illustrate complex interactions, and introduce a novel hypergraph neural network model named CHNN for correlated time series forecasting. In more detail, CHNN leverages both semantic and topological similarities via an interaction model and hypergraph diffusion process, thereby constructing comprehensive collaborative correlation scores that effectively guide spatial message propagation. In addition, it incorporates short-term temporal information to generate efficient spatio-temporal feature maps. Lastly, a long-term temporal module is proposed to generate future predictions utilizing both temporal attention and a gated recurrent network. Comprehensive experiments conducted on four real-world datasets, i.e., Tiingo, Stocktwits, NYC-Taxi, and Social Network demonstrate that the proposed CHNN markedly outperforms a range of benchmark methods.

Abstract:
To improve user experience, recommender systems have been widely used on many online platforms. In these systems, recommendation models are typically learned from positive/negative feedback that are collected automatically. Notably, recommender systems are a little different from general supervised learning tasks. In recommender systems, there are some factors (e.g., previous recommendation models or operation strategies of a online platform) that determine which items can be exposed to each individual user. Normally, the previous exposure results are not only relevant to the instances’ features (i.e., user or item), but also affect their feedback ratings, thus leading to confounding bias in the recommendation models. To mitigate this bias, researchers have already provided a variety of strategies. However, there are still two issues that are underappreciated: 1) previous debiased RS approaches cannot effectively capture recommendation-specific, exposure-specific and their common knowledge simultaneously; 2) the true exposure results of the user-item pairs are partially inaccessible, so there would be some noises if we use their observability to approximate it as existing approaches. Motivated by this, we develop a novel debiasing recommendation approach. More specifically, we first propose a mutual information-based counterfactual learning framework based on the causal relationship among the instance features, exposure status, and ratings. This framework can 1) capture recommendation-specific, exposure-specific and their common knowledge by explicitly modeling the relationship among the causal factors, and 2) achieve robustness towards partially inaccessible exposure results by a pairwise learning strategy. Under such a framework, we implement an optimizable loss function with theoretical analysis. By minimizing this loss, we expect to obtain an unbiased recommendation model that reflects the users’ real interests. Meanwhile, we also prove that our loss function has robustness towards the partial inaccessibility of the exposure status. Finally, extensive experiments on public datasets manifest the superiority of our proposed method in boosting the recommendation performance.

Abstract:
3-D point clouds facilitate 3-D visual applications with detailed information of objects and scenes but bring about enormous challenges to design efficient compression technologies. The irregular signal statistics and high-order geometric structures of 3-D point clouds cannot be fully exploited by existing sparse representation and deep learning based point cloud attribute compression schemes and graph dictionary learning paradigms. In this paper, we propose a novel pp-Laplacian embedding graph dictionary learning framework that jointly exploits the varying signal statistics and high-order geometric structures for 3-D point cloud attribute compression. The proposed framework formulates a nonconvex minimization constrained by pp-Laplacian embedding regularization to learn a graph dictionary varying smoothly along the high-order geometric structures. An efficient alternating optimization paradigm is developed by harnessing ADMM to solve the nonconvex minimization. To our best knowledge, this paper proposes the first graph dictionary learning framework for point cloud compression. Furthermore, we devise an efficient layered compression scheme that integrates the proposed framework to exploit the correlations of 3-D point clouds in a structured fashion. Experimental results demonstrate that the proposed framework is superior to state-of-the-art transform-based methods in MM-term approximation and point cloud attribute compression and outperforms recent MPEG G-PCC reference software.

Abstract:
Density peaks clustering detects modes as points with high density and large distance to points of higher density. Each non-mode point is assigned to the same cluster as its nearest neighbor of higher density. Density peaks clustering has proved capable in applications, yet little work has been done to understand its theoretical properties or the characteristics of the clusterings it produces. Here, we prove that it consistently estimates the modes of the underlying density and correctly clusters the data with high probability. However, noise in the density estimates can lead to erroneous modes and incoherent cluster assignments. A novel clustering algorithm, Component-wise Peak-Finding (CPF), is proposed to remedy these issues. The improvements are twofold: 1) the assignment methodology is improved by applying the density peaks methodology within level sets of the estimated density; 2) the algorithm is not affected by spurious maxima of the density and hence is competent at automatically deciding the correct number of clusters. We present novel theoretical results, proving the consistency of CPF, as well as extensive experimental results demonstrating its exceptional performance. Finally, a semi-supervised version of CPF is presented, integrating clustering constraints to achieve excellent performance for an important problem in computer vision.

Abstract:
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^22-Track. At the 1st-stage, M^2M2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2M2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (～∼ 3%, ～∼ 11% and ～∼ 22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.

Abstract:
Estimating depth from images nowadays yields outstanding results, both in terms of in-domain accuracy and generalization. However, we identify two main challenges that remain open in this field: dealing with non-Lambertian materials and effectively processing high-resolution images. Purposely, we propose a novel dataset that includes accurate and dense ground-truth labels at high resolution, featuring scenes containing several specular and transparent surfaces. Our acquisition pipeline leverages a novel deep space-time stereo framework, enabling easy and accurate labeling with sub-pixel precision. The dataset is composed of 606 samples collected in 85 different scenes, each sample includes both a high-resolution pair (12 Mpx) as well as an unbalanced stereo pair (Left: 12 Mpx, Right: 1.1 Mpx), typical of modern mobile devices that mount sensors with different resolutions. Additionally, we provide manually annotated material segmentation masks and 15 K unlabeled samples. The dataset is composed of a train set and two test sets, the latter devoted to the evaluation of stereo and monocular depth estimation networks. Our experiments highlight the open challenges and future research directions in this field.

Abstract:
Weight learning forms a basis for the machine learning and numerous algorithms have been adopted up to date. Most of the algorithms were either developed in the stochastic framework or aimed at minimization of loss or regret functions. Asymptotic convergence of weight learning, vital for good output prediction, was seldom guaranteed for online applications. Since linear regression is the most fundamental component in machine learning, we focus on this model in this paper. Aiming at online applications, a deterministic analysis method is developed based on LaSalle’s invariance principle. Convergence conditions are derived for both the first-order and the second-order learning algorithms, without resorting to any stochastic argument. Moreover, the deterministic approach makes it easy to analyze the noise influence. Specifically, adaptive hyperparameters are derived in this framework and their tuning rules disclosed for the compensation of measurement noise. Comparison with four most popular algorithms validates that this approach has a higher learning capability and is quite promising in enhancing the weight learning performance.

Abstract:
Recently, a novel multimodal reasoning task named Explanatory Visual Question Answering (EVQA) has been introduced, which combines answering visual questions with multimodal explanation generation to expound upon the underlying reasoning processes. In contrast to conventional Visual Question Answering (VQA) that merely concentrates on providing answers, EVQA aims to improve the explainability and verifiability of reasoning by providing user-friendly explanations. Despite the improved explainability of inferred results, the existing EVQA models still adopt black-box neural networks to infer results, lacking the explainability of the reasoning process. Moreover, existing EVQA models commonly predict answers and explanations in isolation, overlooking the inherent causal correlation between them. To handle these challenges, we propose a Program-guided Variational Causal Inference Network (Pro-VCIN) that integrates neural-symbolic reasoning with variational causal inference and constructs causal correlations between the predicted answers and explanations. First, we utilize pretrained models to extract visual features and convert questions into the corresponding programs. Second, we propose a multimodal program Transformer to translate programs and the related visual features into coherent and rational explanations of the reasoning processes. Finally, we propose a variational causal inference to construct the target structural causal model and predict answers based on the causal correlation to explanations. Comprehensive experiments conducted on EVQA benchmark datasets reveal the superiority of Pro-VCIN in terms of both performance and explainability over state-of-the-art EVQA methods.

Abstract:
Continual semantic segmentation (CSS) based on incremental learning (IL) is a great endeavour in developing human-like segmentation models. However, current CSS approaches encounter challenges in the trade-off between preserving old knowledge and learning new ones, where they still need large-scale annotated data for incremental training and lack interpretability. In this paper, we present Learning at a Glance (LAG), an efficient, robust, human-like and interpretable approach for CSS. Specifically, LAG is a simple and model-agnostic architecture, yet it achieves competitive CSS efficiency with limited incremental data. Inspired by human-like recognition patterns, we propose a semantic-invariance modelling approach via semantic features decoupling that simultaneously reconciles solid knowledge inheritance and new-term learning. Concretely, the proposed decoupling manner includes two ways, i.e., channel-wise decoupling and spatial-level neuron-relevant semantic consistency. Our approach preserves semantic-invariant knowledge as solid prototypes to alleviate catastrophic forgetting, while also constraining sample-specific contents through an asymmetric contrastive learning method to enhance model robustness during IL steps. Experimental results in multiple datasets validate the effectiveness of the proposed method. Furthermore, we introduce a novel CSS protocol that better reflects realistic data-limited CSS settings, and LAG achieves superior performance under multiple data-limited conditions.

Abstract:
Generative systems for graphical assets have the potential to provide users with high quality assets at the push of a button. However, there are many forms of assets, and many approaches for producing them. Quantitative evaluation of these methods is necessary if practitioners wish to validate or compare their implementations. Furthermore, providing benchmarks for new methods to strive for or surpass. While most methods are validated using tried-and-tested metrics within their own domains, there is no unified method of finding the most appropriate. We present a framework based on a literature pool of close to 200 papers, that provides guidance in selecting metrics to evaluate the validity and quality of artefacts produced, and the operational capabilities of the method.

Abstract:
Graph Neural Networks (GNNs) have gained much more attention in the representation learning for the graph-structured data. However, the labels are always limited in the graph, which easily leads to the overfitting problem and causes the poor performance. To solve this problem, we propose a new framework called IGCN, short for Informative Graph Convolutional Network, where the objective of IGCN is designed to obtain the informative embeddings via discarding the task-irrelevant information of the graph data based on the mutual information. As the mutual information for irregular data is intractable to compute, our framework is optimized via a surrogate objective, where two terms are derived to approximate the original objective. For the former term, it demonstrates that the mutual information between the learned embeddings and the ground truth should be high, where we utilize the semi-supervised classification loss and the prototype based supervised contrastive learning loss for optimizing it. For the latter term, it requires that the mutual information between the learned node embeddings and the initial embeddings should be high and we propose to minimize the reconstruction loss between them to achieve the goal of maximizing the latter term from the feature level and the layer level, which contains the graph encoder-decoder module and a novel architecture GCN_InfoInfo. Moreover, we provably show that the designed GCN_InfoInfo can better alleviate the information loss and preserve as much useful information of the initial embeddings as possible. Experimental results show that the IGCN outperforms the state-of-the-art methods on 7 popular datasets.

Abstract:
Superpixel aggregation is a powerful tool for automated neuron segmentation from electron microscopy (EM) volume. However, existing graph partitioning methods for superpixel aggregation still involve two separate stages—model estimation and model solving, and therefore model error is inherent. To address this issue, we integrate the two stages and propose an end-to-end aggregation framework based on deep learning of the minimum cost multicut problem called DeepMulticut. The core challenge lies in differentiating the NP-hard multicut problem, whose constraint number is exponential in the problem size. With this in mind, we resort to relaxing the combinatorial solver—the greedy additive edge contraction (GAEC)—to a continuous Soft-GAEC algorithm, whose limit is shown to be the vanilla GAEC. Such relaxation thus allows the DeepMulticut to integrate edge cost estimators, Edge-CNNs, into a differentiable multicut optimization system and allows a decision-oriented loss to feed decision quality back to the Edge-CNNs for adaptive discriminative feature learning. Hence, the model estimators, Edge-CNNs, can be trained to improve partitioning decisions directly while beyond the NP-hardness. Also, we explain the rationale behind the DeepMulticut framework from the perspective of bi-level optimization. Extensive experiments on three public EM datasets demonstrate the effectiveness of the proposed DeepMulticut.

Abstract:
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text, with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.

Abstract:
Knowledge graph reasoning (KGR), aiming to deduce new facts from existing facts based on mined logic rules underlying knowledge graphs (KGs), has become a fast-growing research direction. It has been proven to significantly benefit the usage of KGs in many AI applications, such as question answering, recommendation systems, and etc. According to the graph types, existing KGR models can be roughly divided into three categories, i.e., static models, temporal models, and multi-modal models. Early works in this domain mainly focus on static KGR, and recent works try to leverage the temporal and multi-modal information, which are more practical and closer to real-world. However, no survey papers and open-source repositories comprehensively summarize and discuss models in this important direction. To fill the gap, we conduct a first survey for knowledge graph reasoning tracing from static to temporal and then to multi-modal KGs. Concretely, the models are reviewed based on bi-level taxonomy, i.e., top-level (graph types) and base-level (techniques and scenarios). Besides, the performances, as well as datasets, are summarized and presented. Moreover, we point out the challenges and potential opportunities to enlighten the readers.

Abstract:
Despite acceleration in the use of 3D meshes, it is difficult to find effective mesh quality assessment algorithms that can produce predictions highly correlated with human subjective opinions. Defining mesh quality features is challenging due to the irregular topology of meshes, which are defined on vertices and triangles. To address this, we propose a novel 3D projective structural similarity index (\mathtt 3D3D-\mathtt PSSIMPSSIM) for meshes that is robust to differences in mesh topology. We address topological differences between meshes by introducing multi-view and multi-layer projections that can densely represent the mesh textures and geometrical shapes irrespective of mesh topology. It also addresses occlusion problems that occur during projection. We propose visual sensitivity weights that capture the perceptual sensitivity to the degree of mesh surface curvature. \mathtt 3D3D-\mathtt PSSIMPSSIM computes perceptual quality predictions by aggregating quality-aware features that are computed in multiple projective spaces onto the mesh domain, rather than on 2D spaces. This allows \mathtt 3D3D-\mathtt PSSIMPSSIM to determine which parts of a mesh surface are distorted by geometric or color impairments. Experimental results show that \mathtt 3D3D-\mathtt PSSIMPSSIM can predict mesh quality with high correlation against human subjective judgments, across the presence of noise, even when there are large topological differences, outperforming existing mesh quality assessment models.

Abstract:
Detecting coronary stenosis accurately in X-ray angiography (XRA) is important for diagnosing and treating coronary artery disease (CAD). However, challenges arise from factors like breathing and heart motion, poor imaging quality, and the complex vascular structures, making it difficult to identify stenosis fast and precisely. In this study, we proposed a Quantum Diffusion Model with Spatio-Temporal Feature Sharing to Real-time detect Stenosis (STQD-Det). Our framework consists of two modules: Sequential Quantum Noise Boxes module and spatio-temporal feature module. To evaluate the effectiveness of the method, we conducted a 4-fold cross-validation using a dataset consisting of 233 XRA sequences. Our approach achieved the F1 score of 92.39% with a real-time processing speed of 25.08 frames per second. These results outperform 17 state-of-the-art methods. The experimental results show that the proposed method can accomplish the stenosis detection quickly and accurately.

Abstract:
Generative Adversarial Networks have achieved significant advancements in generating and editing high-resolution images. However, most methods suffer from either requiring extensive labeled datasets or strong prior knowledge. It is also challenging for them to disentangle correlated attributes with few-shot data. In this paper, we propose FEditNet++, a GAN-based approach to explore latent semantics. It aims to enable attribute editing with limited labeled data and disentangle the correlated attributes. We propose a layer-wise feature contrastive objective, which takes into consideration content consistency and facilitates the invariance of the unrelated attributes before and after editing. Furthermore, we harness the knowledge from the pretrained discriminative model to prevent overfitting. In particular, to solve the entanglement problem between the correlated attributes from data and semantic latent correlation, we extend our model to jointly optimize multiple attributes and propose a novel decoupling loss and cross-assessment loss to disentangle them from both latent and image space. We further propose a novel-attribute disentanglement strategy to enable editing of novel attributes with unknown entanglements. Finally, we extend our model to accurately edit the fine-grained attributes. Qualitative and quantitative assessments demonstrate that our method outperforms state-of-the-art approaches across various datasets, including CelebA-HQ, RaFD, Danbooru2018 and LSUN Church.

Abstract:
Understanding emotions from diverse contexts has received widespread attention in computer vision communities. The core philosophy of Context-Aware Emotion Recognition (CAER) is to provide valuable semantic cues for recognizing the emotions of target persons by leveraging rich contextual information. Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts. Nevertheless, a long-neglected dilemma is that a severe context bias in existing datasets results in an unbalanced distribution of emotional states among different contexts, causing biased visual representation learning. From a causal demystification perspective, the harmful bias is identified as a confounder that misleads existing models to learn spurious correlations based on likelihood estimation, limiting the models’ performance. To address the issue, we embrace causal inference to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a customized causal graph. Subsequently, we present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder, which is built upon backdoor adjustment theory to facilitate seeking approximate causal effects during model training. As a plug-and-play component, CCIM can easily integrate with existing approaches and bring significant improvements. Systematic experiments on three datasets demonstrate the effectiveness of our CCIM.

Abstract:
This work proposed a LiDAR-inertial-visual fusion framework termed R^33LIVE++ to achieve robust and accurate state estimation while simultaneously reconstructing the radiance map on the fly. R^33LIVE++ consists of a LiDAR-inertial odometry (LIO) and a visual-inertial odometry (VIO), both running in real-time. The LIO subsystem utilizes the measurements from a LiDAR for reconstructing the geometric structure, while the VIO subsystem simultaneously recovers the radiance information of the geometric structure from the input images. R^33LIVE++ is developed based on R^33LIVE and further improves the accuracy in localization and mapping by accounting for the camera photometric calibration and the online estimation of camera exposure time. We conduct more extensive experiments on public and self-collected datasets to compare our proposed system against other state-of-the-art SLAM systems. Quantitative and qualitative results show that R^33LIVE++ has significant improvements over others in both accuracy and robustness. Moreover, to demonstrate the extendability of R^33LIVE++, we developed several applications based on our reconstructed maps, such as high dynamic range (HDR) imaging, virtual environment exploration, and 3D video gaming.

Abstract:
Node representation learning on attributed graphs—whose nodes are associated with rich attributes (e.g., texts and protein sequences)—plays a crucial role in many important downstream tasks. To encode the attributes and graph structures simultaneously, recent studies integrate pre-trained models with graph neural networks (GNNs), where pre-trained models serve as node encoders (NEs) to encode the attributes. As jointly training large NEs and GNNs on large-scale graphs suffers from severe scalability issues, many methods propose to train NEs and GNNs separately. Consequently, they do not take feature convolutions in GNNs into consideration in the training phase of NEs, leading to a significant learning bias relative to the joint training. To address this challenge, we propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. The inverse mapping leads to an objective function that is equivalent to that by the joint training, while it can effectively incorporate GNNs in the training phase of NEs against the learning bias. More importantly, we show that LD converges to the optimal objective function values by the joint training under mild assumptions. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph Benchmark datasets.

Abstract:
The tracking-by-detection paradigm currently dominates multiple target tracking algorithms. It usually includes three tasks: target detection, appearance feature embedding, and data association. Carrying out these three tasks successively usually leads to lower tracking efficiency. In this paper, we propose a one-stage anchor-free multiple task learning framework which carries out target detection and appearance feature embedding in parallel to substantially increase the tracking speed. This framework simultaneously predicts a target detection and produces a feature embedding for each location, by sharing a pyramid of feature maps. We propose a deformable local attention module which utilizes the correlations between features at different locations within a target to obtain more discriminative features. We further propose a task-aware prediction module which utilizes deformable convolutions to select the most suitable locations for the different tasks. At the selected locations, classification of samples into foreground or background, appearance feature embedding, and target box regression are carried out. Two effective training strategies, regression range overlapping and sample reweighting, are proposed to reduce missed detections in dense scenes. Ambiguous samples whose identities are difficult to determine are effectively dealt with to obtain more accurate feature embedding of target appearance. An appearance-enhanced non-maximum suppression is proposed to reduce over-suppression of true targets in crowded scenes. Based on the one-stage anchor-free network with the deformable local attention module and the task-aware prediction module, we implement a new online multiple target tracker. Experimental results show that our tracker achieves a very fast speed while maintaining a high tracking accuracy.

Abstract:
Recently, there has been a trend of designing neural data structures to go beyond handcrafted data structures by leveraging patterns of data distributions for better accuracy and adaptivity. Sketches are widely used data structures in real-time web analysis, network monitoring, and self-driving to estimate item frequencies of data streams within limited space. However, existing sketches have not fully exploited the patterns of the data stream distributions, making it challenging to tightly couple them with neural networks that excel at memorizing pattern information. Starting from the premise, we envision a pure neural data structure as a base sketch, which we term the meta-sketch, to reinvent the base structure of conventional sketches. The meta-sketch learns basic sketching abilities from meta-tasks constituted with synthetic datasets following Zipf distributions in the pre-training phase and can be quickly adapted to real (skewed) distributions in the adaption phase. The meta-sketch not only surpasses its competitors in sketching conventional data streams but also holds good potential in supporting more complex streaming data, such as multimedia and graph stream scenarios. Extensive experiments demonstrate the superiority of the meta-sketch and offer insights into its working mechanism.

Abstract:
Radiography imaging protocols focus on particular body regions, therefore producing images of great similarity and yielding recurrent anatomical structures across patients. Exploiting this structured information could potentially ease the detection of anomalies from radiography images. To this end, we propose a Simple Space-Aware Memory Matrix for In-painting and Detecting anomalies from radiography images (abbreviated as SimSID). We formulate anomaly detection as an image reconstruction task, consisting of a space-aware memory matrix and an in-painting block in the feature space. During the training, SimSID can taxonomize the ingrained anatomical structures into recurrent visual patterns, and in the inference, it can identify anomalies (unseen/modified visual patterns) from the test image. Our SimSID surpasses the state of the arts in unsupervised anomaly detection by +8.0%, +5.0%, and +9.9% AUC scores on ZhangLab, COVIDx, and CheXpert benchmark datasets, respectively.

Abstract:
Skeleton-based exercise assessment focuses on evaluating the correctness or quality of an exercise performed by a subject. Skeleton data provide two groups of features (i.e., position and orientation), which existing methods have not fully harnessed. We previously proposed an ensemble-based graph convolutional network (EGCN) that considers both position and orientation features to construct a model-based approach. Integrating these types of features achieved better performance than available methods. However, EGCN lacked a fusion strategy across the data, feature, decision, and model levels. In this paper, we present an advanced framework, EGCN++, for rehabilitation exercise assessment. Based on EGCN, a new fusion strategy called MLE-PO is proposed for EGCN++; this technique considers fusion at the data and model levels. We conduct extensive cross-validation experiments and investigate the consistency between machine and human evaluations on three datasets: UI-PRMD, KIMORE, and EHE. Results demonstrate that MLE-PO outperforms other EGCN ensemble strategies and representative baselines. Furthermore, the MLE-PO's model evaluation scores are more quantitatively consistent with clinical evaluations than other ensemble strategies.

Affiliations: College of Artificial Intelligence, Dalian Maritime University, Dalian, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Safety, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Cyber Science and Engineering, Southeast University, Nanjing, China

Abstract:
Human parsing has attracted considerable research interest due to its broad potential applications in the computer vision community. In this paper, we explore several useful properties, including high-resolution representation, auxiliary guidance, and model robustness, which collectively contribute to a novel method for accurate human parsing in both simple and complex scenes. Starting from simple scenes: we propose the boundary-aware hybrid resolution network (BHRN), an advanced human parsing network. BHRN utilizes deconvolutional layers and multi-scale supervision to generate rich high-resolution representations. Additionally, it includes an edge perceiving branch designed to enhance the fineness of part boundaries. Building on BHRN, we construct a dual-task mutual learning (DTML) framework. It not only provides implicit guidance to assist the parser by incorporating boundary features, but also explicitly maintains the high-order consistency between the parsing prediction and the ground truth. Toward complex scenes: we develop a domain transform method to enhance the model robustness. By transforming the input space from the spatial domain to the polar harmonic Fourier moment domain, the mapping relationship to the output semantic space is highly stable. This transformation yields robust representations for both clean and corrupted data. When evaluated on standard benchmark datasets, our method achieves superior performance compared to state-of-the-art human parsing methods. Furthermore, our domain transform strategy significantly improves the robustness of DTML dramatically in most complex scenes.

Abstract:
Deploying models on target domain data subject to distribution shift requires adaptation. Test-time training (TTT) emerges as a solution to this adaptation under a realistic scenario where access to full source domain data is not available, and instant inference on the target domain is required. Despite many efforts into TTT, there is a confusion over the experimental settings, thus leading to unfair comparisons. In this work, we first revisit TTT assumptions and categorize TTT protocols by two key factors, i.e., whether testing data is sequentially streamed and whether source model is allowed to be trained with modified loss function. Among the multiple protocols, we adopt a realistic sequential test-time training (sTTT) protocol, under which we develop a test-time anchored clustering (TTAC) approach to enable stronger test-time feature learning. TTAC discovers clusters in both source and target domains and matches the target clusters to the source ones to improve adaptation. When source domain information is strictly absent (i.e., source-free) we further develop an efficient method to infer source domain distributions for anchored clustering. Finally, self-training (ST) has demonstrated great success in learning from unlabeled data and we empirically figure out that applying ST alone to TTT is prone to confirmation bias. Therefore, a more effective TTT approach is introduced by regularizing self-training with anchored clustering, and the improved model is referred to as TTAC++. We demonstrate that, under all TTT protocols, TTAC++ consistently outperforms the state-of-the-art methods on five TTT datasets, including corrupted target domain, selected hard samples, synthetic-to-real adaptation and adversarially attacked target domain. We hope this work will provide a fair benchmarking of TTT methods, and future research should be compared within respective protocols.

Abstract:
Negative sampling has swiftly risen to prominence as a focal point of research, with wide-ranging applications spanning machine learning, computer vision, natural language processing, data mining, and recommender systems. This surge in interest prompts us to question the fundamental impact of negative sampling: Does negative sampling really matter? Is there a general framework that can incorporate all negative sampling methods? In what fields is it applied? Addressing these questions, we propose a general framework that using negative sampling. Delving into the history of negative sampling, we chart its evolution across five distinct trajectories. We dissect and categorize the strategies used to select negative sample candidates, detailing global, local, mini-batch, hop, and memory-based approaches. Our comprehensive review extends to an analysis of current negative sampling methodologies, systematically grouping them into five classifications: static, hard, GAN-based, Auxiliary-based, and In-batch. Beyond detailed categorization, we explore the practical application of negative sampling across various fields. Finally, we briefly discuss open problems and future directions for negative sampling.

Abstract:
Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. As a result, it is necessary to collect and label data-text pairs for training, which is both costly and time-consuming. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and “believable” outputs and significantly outperforms existing zero-shot methods.

Abstract:
6-DoF object pose estimation from a monocular image is a challenging problem, where a post-refinement procedure is generally needed for high-precision estimation. In this paper, we propose a framework, dubbed RNNPose, based on a recurrent neural network (RNN) for object pose refinement, which is robust to erroneous initial poses and occlusions. During the recurrent iterations, object pose refinement is formulated as a non-linear least squares problem based on the estimated correspondence field (between a rendered image and the observed image). The problem is then solved by a differentiable Levenberg-Marquardt (LM) algorithm enabling end-to-end training. The correspondence field estimation and pose refinement are conducted alternately in each iteration to improve the object poses. Furthermore, to improve the robustness against occlusion, we introduce a consistency-check mechanism based on the learned descriptors of the 3D model and observed 2D images, which downweights the unreliable correspondences during pose optimization. We evaluate RNNPose on several public datasets, including LINEMOD, Occlusion-LINEMOD, YCB-Video and TLESS. We demonstrate state-of-the-art performance and strong robustness against severe clutter and occlusion in the scenes. Extensive experiments validate the effectiveness of our proposed method. Besides, the extended system based on RNNPose successfully generalizes to multi-instance scenarios and achieves top-tier performance on the TLESS dataset.

Abstract:
The introduction of Transformer architectures – with the self-attention mechanism – in automatic Natural Language Generation (NLG) is a breakthrough in solving general task-oriented problems, such as the simple production of long text excerpts that resemble ones written by humans. While the performance of GPT-X architectures is there for all to see, many efforts are underway to penetrate the secrets of these black-boxes in terms of intelligent information processing whose output statistical distributions resemble that of natural language. In this work, through the complexity science framework, a comparative study of the stochastic processes underlying the texts produced by the English version of GPT-2 with respect to texts produced by human beings, notably novels in English and programming codes, is offered. The investigation, of a methodological nature, consists first of all of an analysis phase in which the Multifractal Detrended Fluctuation Analysis and the Recurrence Quantification Analysis – together with Zipf's law and approximate entropy – are adopted to characterize long-term correlations, regularities and recurrences in human and machine-produced texts. Results show several peculiarities and trends in terms of long-range correlations and recurrences in the last case. The synthesis phase, on the other hand, uses the complexity measures to build synthetic text descriptors – hence a suitable text embedding – which serve to constitute the features for feeding a machine learning system designed to operate feature selection through an evolutionary technique. Using multivariate analysis, it is then shown the grouping tendency of the three analyzed text types, allowing to place GTP-2 texts in between natural language texts and computer codes. Similarly, the classification task demonstrates that, given the high accuracy obtained in the automatic discrimination of text classes, the proposed set of complexity measures is highly informative. These interesting results allow us to add another piece to the theoretical understanding of the surprising results obtained by NLG systems based on deep learning and let us to improve the design of new informetrics or text mining systems for text classification, fake news detection, or even plagiarism detection.

Affiliations: School of Software, Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China; School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China; College of Computer and Information, Hohai University, Nanjing, Jiangsu, China; Engineering Research Center of Digital Forensics, Ministry of Education, the School of Computer Science, Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China

Abstract:
As a fundamental mathematical problem in the field of machine learning, the linear separability test still lacks a theoretically complete and computationally efficient method. This paper proposes and proves a sufficient and necessary condition for linear separability test based on a sphere model. The advantage of this test method is two-fold: (1) it provides not only a qualitative test of linear separability but also a quantitative analysis of the separability of linear separable instances; (2) it has low time cost and is more efficient than existing test methods. The proposed method is validated through a large number of experiments on benchmark datasets and artificial datasets, demonstrating both its correctness and efficiency.

Affiliations: School of Software Enginnering, University of Science and Technology of China, Hefei, China; School of Computing, National University of Singapore, Singapore; Key Laboratory for Embedded and Network Computing of Hunan Province, Hunan University, Changsha, China; Visual Computing Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia; Electrical Engineering and Computer Science, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia; Computer Science, National University of Singapore, Singapore; School of Computer Science, University of Science and Technology of China, Hefei, China; West China Biomedical Big Data Center, Sichuan University, Chengdu, China; School of Data Science, University of Science and Technology of China, Hefei, China; School of Computer Science and Technology, University of Science and Technology of China, Hefei, China

Abstract:
The training and inference of Graph Neural Networks (GNNs) are costly when scaling up to large-scale graphs. Graph Lottery Ticket (GLT) has presented the first attempt to accelerate GNN inference on large-scale graphs by jointly pruning the graph structure and the model weights. Though promising, GLT encounters robustness and generalization issues when deployed in real-world scenarios, which are also long-standing and critical problems in deep learning ideology. In real-world scenarios, the distribution of unseen test data is typically diverse. We attribute the failures on out-of-distribution (OOD) data to the incapability of discerning causal patterns, which remain stable amidst distribution shifts. In traditional spase graph learning, the model performance deteriorates dramatically as the graph/network sparsity exceeds a certain high level. Worse still, the pruned GNNs are hard to generalize to unseen graph data due to limited training set at hand. To tackle these issues, we propose the Resilient Graph Lottery Ticket (RGLT) to find more robust and generalizable GLT in GNNs. Concretely, we reactivate a fraction of weights/edges by instantaneous gradient information at each pruning point. After sufficient pruning, we conduct environmental interventions to extrapolate potential test distribution. Finally, we perform last several rounds of model averages to further improve generalization. We provide multiple examples and theoretical analyses that underpin the universality and reliability of our proposal. Further, RGLT has been experimentally verified across various independent identically distributed (IID) and out-of-distribution (OOD) graph benchmarks.

Affiliations: School of Biomedical Engineering and Imaging Sciences, King's College London, London, U.K.; Department of Radiology, University Hospitals Leuven, Leuven, Belgium; Department of Informatics, Technical University Munich, Munich, Germany; Department of Neuroradiology and Clinical Neuroscience Center, University Hospital Zurich and University of Zurich, Zurich, Switzerland; Institute for Women's Health, University College London, London, U.K.; Center for MR Research, University Children's Hospital Zurich, University of Zurich, Zurich, Switzerland; Department of Biomedical Imaging and Image-guided Therapy, Medical University of Vienna, Vienna, Austria

Abstract:
Deep learning models for medical image segmentation can fail unexpectedly and spectacularly for pathological cases and images acquired at different centers than training images, with labeling errors that violate expert knowledge. Such errors undermine the trustworthiness of deep learning models for medical image segmentation. Mechanisms for detecting and correcting such failures are essential for safely translating this technology into clinics and are likely to be a requirement of future regulations on artificial intelligence (AI). In this work, we propose a trustworthy AI theoretical framework and a practical system that can augment any backbone AI system using a fallback method and a fail-safe mechanism based on Dempster-Shafer theory. Our approach relies on an actionable definition of trustworthy AI. Our method automatically discards the voxel-level labeling predicted by the backbone AI that violate expert knowledge and relies on a fallback for those voxels. We demonstrate the effectiveness of the proposed trustworthy AI approach on the largest reported annotated dataset of fetal MRI consisting of 540 manually annotated fetal brain 3D T2w MRIs from 13 centers. Our trustworthy AI method improves the robustness of four backbone AI models for fetal brain MRIs acquired across various centers and for fetuses with various brain abnormalities.

Abstract:
Monocular depth estimation has been widely studied, and significant improvements in performance have been recently reported. However, most previous works are evaluated on a few benchmark datasets, such as KITTI datasets, and none of the works provide an in-depth analysis of the generalization performance of monocular depth estimation. In this paper, we deeply investigate the various backbone networks (e.g.CNN and Transformer models) toward the generalization of monocular depth estimation. First, we evaluate state-of-the-art models on both in-distribution and out-of-distribution datasets, which have never been seen during network training. Then, we investigate the internal properties of the representations from the intermediate layers of CNN-/Transformer-based models using synthetic texture-shifted datasets. Through extensive experiments, we observe that the Transformers exhibit a strong shape-bias rather than CNNs, which have a strong texture-bias. We also discover that texture-biased models exhibit worse generalization performance for monocular depth estimation than shape-biased models. We demonstrate that similar aspects are observed in real-world driving datasets captured under diverse environments. Lastly, we conduct a dense ablation study with various backbone networks which are utilized in modern strategies. The experiments demonstrate that the intrinsic locality of the CNNs and the self-attention of the Transformers induce texture-bias and shape-bias, respectively.

Abstract:
Unsupervised domain adaptation without accessing expensive annotation processes of target data has achieved remarkable successes in semantic segmentation. However, most existing state-of-the-art methods cannot explore whether semantic representations across domains are transferable or not, which may result in the negative transfer brought by irrelevant knowledge. To tackle this challenge, in this paper, we develop a novel Knowledge Aggregation-induced Transferability Perception (KATP) module for unsupervised domain adaptation, which is a pioneering attempt to distinguish transferable or untransferable knowledge across domains. Specifically, the KATP module is designed to quantify which semantic knowledge across domains is transferable, by incorporating the transferability information propagation from constructed global category-wise prototypes. Based on KATP, we design a novel KATP Adaptation Network (KATPAN) to determine where and how to transfer. The KATPAN contains a transferable appearance translation module \mathcal T_A(\cdot)TA(·) and a transferable representation augmentation module \mathcal T_R(\cdot)TR(·), where both modules construct a virtuous circle of performance promotion. \mathcal T_A(\cdot)TA(·) develops a transferability-aware information bottleneck to highlight where to adapt transferable visual characterizations and modality information; \mathcal T_R(\cdot)TR(·) explores how to augment transferable representations while abandoning untransferable information, and promotes the translation performance of \mathcal T_A(\cdot)TA(·) in return. Comprehensive experiments on several representative benchmark datasets and a medical dataset support the state-of-the-art performance of our model.

Abstract:
Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware multimodal fusion approach that quantifies modality-wise aleatoric or data uncertainty towards emotion prediction. We propose a novel fusion framework, in which latent distributions over unimodal temporal context are learned by constraining their variance. These variance constraints, Calibration and Ordinal Ranking, are designed such that the variance estimated for a modality can represent how informative the temporal context of that modality is w.r.t. emotion recognition. When well-calibrated, modality-wise uncertainty scores indicate how much their corresponding predictions are likely to differ from the ground truth labels. Well-ranked uncertainty scores allow the ordinal ranking of different frames across different modalities. To jointly impose both these constraints, we propose a softmax distributional matching loss. Our evaluation on AVEC 2019 CES, CMU-MOSEI, and IEMOCAP datasets shows that the proposed multimodal fusion method not only improves the generalisation performance of emotion recognition models and their predictive uncertainty estimates, but also makes the models robust to novel noise patterns encountered at test time.

Abstract:
The British landscape painter John Constable is considered foundational for the Realist movement in 19th-century European painting. Constable's painted skies, in particular, were seen as remarkably accurate by his contemporaries, an impression shared by many viewers today. Yet, assessing the accuracy of realist paintings like Constable's is subjective or intuitive, even for professional art historians, making it difficult to say with certainty what set Constable's skies apart from those of his contemporaries. Our goal is to contribute to a more objective understanding of Constable's realism. We propose a new machine-learning-based paradigm for studying pictorial realism in an explainable way. Our framework assesses realism by measuring the similarity between clouds painted by artists noted for their skies, like Constable, and photographs of clouds. The experimental results of cloud classification show that Constable approximates more consistently than his contemporaries the formal features of actual clouds in his paintings. The study, as a novel interdisciplinary approach that combines computer vision and machine learning, meteorology, and art history, is a springboard for broader and deeper analyses of pictorial realism.

Abstract:
Unsupervised domain adaptation (UDA) and domain generalization (DG) enable machine learning models trained on a source domain to perform well on unlabeled or even unseen target domains. As previous UDA&DG semantic segmentation methods are mostly based on outdated networks, we benchmark more recent architectures, reveal the potential of Transformers, and design the DAFormer network tailored for UDA&DG. It is enabled by three training strategies to avoid overfitting to the source domain: While (1) Rare Class Sampling mitigates the bias toward common source domain classes, (2) a Thing-Class ImageNet Feature Distance and (3) a learning rate warmup promote feature transfer from ImageNet pretraining. As UDA&DG are usually GPU memory intensive, most previous methods downscale or crop images. However, low-resolution predictions often fail to preserve fine details while models trained with cropped images fall short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution framework for UDA&DG, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention. DAFormer and HRDA significantly improve the state-of-the-art UDA&DG by more than 10 mIoU on 5 different benchmarks.

Abstract:
Texture recognition is a challenging visual task since its multiple primitives or attributes can be perceived from the texture image under different spatial contexts. Existing approaches predominantly built upon CNN incorporate rich local descriptors with orderless aggregation to capture invariance to the spatial layout. However, these methods ignore the inherent structure relation organized by primitives and the semantic concept described by attributes, which are critical cues for texture representation. In this paper, we propose a novel Multiple Primitives and Attributes Perception network (MPAP) that extracts features by modeling the relation of bottom-up structure and top-down attribute in a multi-branch unified framework. A bottom-up process is first proposed to capture the inherent relation of various primitive structures by leveraging structure dependency and spatial order information. Then, a top-down process is introduced to model the latent relation of multiple attributes by transferring attribute-related features between adjacent branches. Moreover, an augmentation module is devised to bridge the gap between high-level attributes and low-level structure features. MPAP can learn representation through jointing bottom-up and top-down processes in a mutually reinforced manner. Experimental results on six challenging texture datasets demonstrate the superiority of MPAP over state-of-the-art methods in terms of accuracy, robustness, and efficiency.

Abstract:
Surface reconstruction for point clouds is an important task in 3D computer vision. Most of the latest methods resolve this problem by learning signed distance functions from point clouds, which are limited to reconstructing closed surfaces. Some other methods tried to represent open surfaces using unsigned distance functions (UDF) which are learned from ground truth distances. However, the learned UDF is hard to provide smooth distance fields due to the discontinuous character of point clouds. In this paper, we propose CAP-UDF, a novel method to learn consistency-aware UDF from raw point clouds. We achieve this by learning to move queries onto the surface with a field consistency constraint, where we also enable to progressively estimate a more accurate surface. Specifically, we train a neural network to gradually infer the relationship between queries and the approximated surface by searching for the moving target of queries in a dynamic way. Meanwhile, we introduce a polygonization algorithm to extract surfaces using the gradients of the learned UDF. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore our performance in unsupervised point normal estimation, which demonstrate non-trivial improvements of CAP-UDF over the state-of-the-art methods.

Abstract:
In the real world, data distributions often exhibit multiple granularities. However, the majority of existing neighbor-based machine-learning methods rely on manually setting a single-granularity for neighbor relationships. These methods typically handle each data point using a single-granularity approach, which severely affects their accuracy and efficiency. This paper adopts a dual-pronged approach: it constructs a multi-granularity representation of the data using the granular-ball computing model, thereby boosting the algorithm’s time efficiency. It leverages the multi-granularity representation of the data to create tailored, multi-granularity neighborhood relationships for different task scenarios, resulting in improved algorithmic accuracy. The experimental results convincingly demonstrate that the proposed multi-granularity neighbor relationship effectively enhances KNN classification and clustering methods.

Abstract:
An image line segment is a fundamental low-level visual feature that delineates straight, slender, and uninterrupted portions of objects and scenarios within images. Detection and description of line segments lay the basis for numerous vision tasks. Although many studies have aimed to detect and describe line segments, a comprehensive review is lacking, obstructing their progress. This study fills the gap by comprehensively reviewing related studies on detecting and describing two-dimensional image line segments to provide researchers with an overall picture and deep understanding. Based on their mechanisms, two taxonomies for line segment detection and description are presented to introduce, analyze, and summarize these studies, facilitating researchers to learn about them quickly and extensively. The key issues, core ideas, advantages and disadvantages of existing methods, and their potential applications for each category are analyzed and summarized, including previously unknown findings. The challenges in existing methods and corresponding insights for potentially solving them are also provided to inspire researchers. In addition, some state-of-the-art line segment detection and description algorithms are evaluated without bias, and the evaluation code will be publicly available. The theoretical analysis, coupled with the experimental results, can guide researchers in selecting the best method for their intended vision applications. Finally, this study provides insights for potentially interesting future research directions to attract more attention from researchers to this field.

Abstract:
Accurately capturing dynamic scenes with wide-ranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera's frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a monochrome pixel-wise programmable image sensor, our sampling pattern captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds — both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system's adaptability and performance in dynamic conditions.

Abstract:
The production of food, feed, fiber, and fuel is a key task of agriculture, which has to cope with many challenges in the upcoming decades, e.g., a higher demand, climate change, lack of workers, and the availability of arable land. Vision systems can support making better and more sustainable field management decisions, but also support the breeding of new crop varieties by allowing temporally dense and reproducible measurements. Recently, agricultural robotics got an increasing interest in the vision and robotics communities since it is a promising avenue for coping with the aforementioned lack of workers and enabling more sustainable production. While large datasets and benchmarks in other domains are readily available and enable significant progress, agricultural datasets and benchmarks are comparably rare. We present an annotated dataset and benchmarks for the semantic interpretation of real agricultural fields. Our dataset recorded with a UAV provides high-quality, pixel-wise annotations of crops and weeds, but also crop leaf instances at the same time. Furthermore, we provide benchmarks for various tasks on a hidden test set comprised of different fields: known fields covered by the training data and a completely unseen field.

Abstract:
Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset available to facilitate further exploration in this area.

Abstract:
Brain network analysis plays an increasingly important role in studying brain function and the exploring of disease mechanisms. However, existing brain network construction tools have some limitations, including dependency on empirical users, weak consistency in repeated experiments and time-consuming processes. In this work, a diffusion-based brain network pipeline, DGCL is designed for end-to-end construction of brain networks. Initially, the brain region-aware module (BRAM) precisely determines the spatial locations of brain regions by the diffusion process, avoiding subjective parameter selection. Subsequently, DGCL employs graph contrastive learning to optimize brain connections by eliminating individual differences in redundant connections unrelated to diseases, thereby enhancing the consistency of brain networks within the same group. Finally, the node-graph contrastive loss and classification loss jointly constrain the learning process of the model to obtain the reconstructed brain network, which is then used to analyze important brain connections. Validation on two datasets, ADNI and ABIDE, demonstrates that DGCL surpasses traditional methods and other deep learning models in predicting disease development stages. Significantly, the proposed model improves the efficiency and generalization of brain network construction. In summary, the proposed DGCL can be served as a universal brain network construction scheme, which can effectively identify important brain connections through generative paradigms and has the potential to provide disease interpretability support for neuroscience research.

Affiliations: School of Computer Science and Technology, Tiangong University, Tianjin, China; College of Computer Science, Chongqing University, Chongqing, China; Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China; Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China

Abstract:
Multimodal summarization (MS) for videos aims to generate summaries from multi-source information (e.g., video and text transcript), showing promising progress recently. However, existing works are limited to monolingual scenarios, neglecting non-native viewers' needs to understand videos in other languages. It stimulates us to introduce multimodal cross-lingual summarization for videos (MCLS), which aims to generate cross-lingual summaries from multimodal input of videos. Considering the challenge of high annotation cost and resource constraints in MCLS, we propose a knowledge distillation (KD) induced triple-stage training method to assist MCLS by transferring knowledge from abundant monolingual MS data to those data with insufficient volumes. In the triple-stage training method, a video-guided dual fusion network (VDF) is designed as the backbone network to integrate multimodal and cross-lingual information through diverse fusion strategies in the encoder and decoder; What's more, we propose two cross-lingual knowledge distillation strategies: adaptive pooling distillation and language-adaptive warping distillation (LAWD), designed for encoder-level and vocab-level distillation objects to facilitate effective knowledge transfer across cross-lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle lingual sequences of varying lengths between MS and MCLS models. Specifically, to tackle the challenge of unequal length of parallel cross-language sequences in KD, LAWD can directly conduct cross-language distillation while keeping the language feature shape unchanged to reduce potential information loss. We meticulously annotated the How2-MCLS dataset based on the How2 dataset to simulate MCLS scenarios. Experimental results show that the proposed method achieves competitive performance compared to strong baselines, and can bring substantial performance improvements to MCLS models by transferring knowledge from the MS model.

Abstract:
Due to the costliness of labelled data in real-world applications, semi-supervised learning, underpinned by pseudo labelling, is an appealing solution. However, handling confusing samples is nontrivial: discarding valuable confusing samples would compromise the model generalisation while using them for training would exacerbate the issue of confirmation bias caused by the resulting inevitable mislabelling. To solve this problem, this paper proposes to use confusing samples proactively without label correction. Specifically, a Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation even without a concrete label. This provides an upper bound for inter-class information sharing capacity, which eventually leads to a better embedding space. Extensive experiments on two mainstream dense prediction tasks — semantic segmentation and object detection, demonstrate that the proposed VC learning significantly surpasses the state-of-the-art, especially when only very few labels are available. Our intriguing findings highlight the usage of VC learning in dense vision tasks.

Abstract:
How to effectively explore the colors of exemplars and propagate them to colorize each frame is vital for exemplar-based video colorization. In this article, we present a BiSTNet to explore colors of exemplars and utilize them to help video colorization by a bidirectional temporal feature fusion with the guidance of semantic image prior. We first establish the semantic correspondence between each frame and the exemplars in deep feature space to explore color information from exemplars. Then, we develop a simple yet effective bidirectional temporal feature fusion module to propagate the colors of exemplars into each frame and avoid inaccurate alignment. We note that there usually exist color-bleeding artifacts around the boundaries of important objects in videos. To overcome this problem, we develop a mixed expert block to extract semantic information for modeling the object boundaries of frames so that the semantic image prior can better guide the colorization process. In addition, we develop a multi-scale refinement block to progressively colorize frames in a coarse-to-fine manner. Extensive experimental results demonstrate that the proposed BiSTNet performs favorably against state-of-the-art methods on the benchmark datasets and real-world scenes. Moreover, the BiSTNet obtains one champion in NTIRE 2023 video colorization challenge (Kang et al. 2023).

Abstract:
As an effective technique to extend the depth-of-field (DOF) of optical lenses, multi-focus image fusion has recently become an active topic in image processing community. However, a major problem remaining unsolved in this field is the lack of universal criteria in selecting objective evaluation metrics. Consequently, the metrics utilized in different studies often vary significantly, leading to high difficulties in achieving unbiased evaluation. To address this problem, this paper proposes a statistic-based approach for verifying the effectiveness of objective metrics in multi-focus image fusion. The core idea is to adopt statistical correlation measures to evaluate the performance consistency between a certain fusion metric and some popular full-reference image quality assessment models. In addition, a convolutional neural network (CNN)-based fusion metric is presented to measure the similarity between the source images and the fused image based on the semantic features at multiple abstraction levels. A comparative study is conducted to evaluate 20 existing fusion metrics using the proposed statistic-based approach on a large-scale, realistic and with-ground-truth multi-focus image fusion dataset recently released. Experimental results demonstrate the feasibility of the proposed approach in evaluating the effectiveness of objective metrics and the advantage of our CNN-based metric.

Affiliations: Harbin Institute of Technology, Harbin, China; SKLCCSE, Institute of Artificial Intelligence, Beihang University, Beijing, China; School of Mathematics and Statistics, MOE Key Laboratory for Complexity Science in Aerospace, and Xi'an-Budapest Joint Research Center for Combinatorics, Northwestern Polytechnical University, Xi’an, China; University of Amsterdam, Amsterdam, Netherlands; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi, UAE

Abstract:
One fundamental problem in deep learning is understanding the excellent performance of deep Neural Networks (NNs) in practice. An explanation for the superiority of NNs is that they can realize a large family of complicated functions, i.e., they have powerful expressivity. The expressivity of a Neural Network with Piecewise Linear activations (PLNN) can be quantified by the maximal number of linear regions it can separate its input space into. In this paper, we provide several mathematical results needed for studying the linear regions of Convolutional Neural Networks with Piecewise Linear activations (PLCNNs), and use them to derive the maximal and average numbers of linear regions for one-layer PLCNNs. Furthermore, we obtain upper and lower bounds for the number of linear regions of multi-layer PLCNNs. Our results suggest that deeper PLCNNs have more powerful expressivity than shallow PLCNNs, while PLCNNs have more expressivity than fully-connected PLNNs per parameter, in terms of the number of linear regions.

Abstract:
The perception of drones, also known as Unmanned Aerial Vehicles (UAVs), particularly in infrared videos, is crucial for effective anti-UAV tasks. However, existing datasets for UAV tracking have limitations in terms of target size and attribute distribution characteristics, which do not fully represent complex realistic scenes. To address this issue, we introduce a generalized infrared UAV tracking benchmark called Anti-UAV410. The benchmark comprises a total of 410 videos with over 438 K manually annotated bounding boxes. To tackle the challenges of UAV tracking in complex environments, we propose a novel method called Siamese drone tracker (SiamDT). SiamDT incorporates a dual-semantic feature extraction mechanism that explicitly models targets in dynamic background clutter, enabling effective tracking of small UAVs. The SiamDT method consists of three key steps: Dual-Semantic RPN Proposals (DS-RPN), Versatile R-CNN (VR-CNN), and Background Distractors Suppression. These steps are responsible for generating candidate proposals, refining prediction scores based on dual-semantic features, and enhancing the discriminative capacity of the trackers against dynamic background clutter, respectively. Extensive experiments conducted on the Anti-UAV410 dataset and three other large-scale benchmarks demonstrate the superior performance of the proposed SiamDT method compared to recent state-of-the-art trackers.

Abstract:
With the increasing attention in various 3D safety-critical applications, point cloud learning models have been shown to be vulnerable to adversarial attacks. Although existing 3D attack methods achieve high success rates, they delve into the data space with point-wise perturbation, which may neglect the geometric characteristics. Instead, we propose point cloud attacks from a new perspective—the graph spectral domain attack, aiming to perturb graph transform coefficients in the spectral domain that correspond to varying certain geometric structures. Specifically, leveraging on graph signal processing, we first adaptively transform the coordinates of points onto the spectral domain via graph Fourier transform (GFT) for compact representation. Then, we analyze the influence of different spectral bands on the geometric structure, based on which we propose to perturb the GFT coefficients via a learnable graph spectral filter. Considering the low-frequency components mainly contribute to the rough shape of the 3D object, we further introduce a low-frequency constraint to limit perturbations within imperceptible high-frequency components. Finally, the adversarial point cloud is generated by transforming the perturbed spectral representation back to the data domain via the inverse GFT. Experimental results demonstrate the effectiveness of the proposed attack in terms of both the imperceptibility and attack success rates.

Abstract:
Various correlations hidden in crowdsourcing annotation tasks bring opportunities to further improve the accuracy of label aggregation. However, these relationships are usually extremely difficult to be modeled. Most existing methods can merely make use of one or two correlations. In this paper, we propose a novel graph neural network model, namely LAGNN, which models five different correlations in crowdsourced annotation tasks by utilizing deep graph neural networks with convolution operations and derives a high label aggregation performance. Utilizing the group of high quality workers through labeling similarity, LAGNN can efficiently revise the preference among workers. Moreover, by injecting a little ground truth in its training stage, the label aggregation performance of LAGNN can be further significantly improved. We evaluate LAGNN on a large number of simulated datasets generated through varying six degrees of freedom and on eight real-world crowdsourcing datasets in both supervised and unsupervised (agnostic) modes. Experiments on data leakage is also contained. Experimental results consistently show that the proposed LAGNN significantly outperforms six state-of-the-art models in terms of label aggregation accuracy.

Abstract:
Most artificial neural networks used for object recognition are trained in a fully supervised setup. This is not only resource consuming as it requires large data sets of labeled examples but also quite different from how humans learn. We use a setup in which an artificial agent first learns in a simulated world through self-supervised, curiosity-driven exploration. Following this initial learning phase, the learned representations can be used to quickly associate semantic concepts such as different types of doors using one or more labeled examples. To do this, we use a method we call fast concept mapping which uses correlated firing patterns of neurons to define and detect semantic concepts. This association works instantaneously with very few labeled examples, similar to what we observe in humans in a phenomenon called fast mapping. Strikingly, we can already identify objects with as little as one labeled example which highlights the quality of the encoding learned self-supervised through interaction with the world. It therefore presents a feasible strategy for learning concepts without much supervision and shows that through pure interaction meaningful representations of an environment can be learned that work better for few-short learning than non-interactive methods.

Abstract:
Inductive bias in machine learning (ML) is the set of assumptions describing how a model makes predictions. Different ML-based methods for protein-ligand binding affinity (PLA) prediction have different inductive biases, leading to different levels of generalization capability and interpretability. Intuitively, the inductive bias of an ML-based model for PLA prediction should fit in with biological mechanisms relevant for binding to achieve good predictions with meaningful reasons. To this end, we propose an interaction-based inductive bias to restrict neural networks to functions relevant for binding with two assumptions: 1) A protein-ligand complex can be naturally expressed as a heterogeneous graph with covalent and non-covalent interactions; 2) The predicted PLA is the sum of pairwise atom-atom affinities determined by non-covalent interactions. The interaction-based inductive bias is embodied by an explainable heterogeneous interaction graph neural network (EHIGN) for explicitly modeling pairwise atom-atom interactions to predict PLA from 3D structures. Extensive experiments demonstrate that EHIGN achieves better generalization capability than other state-of-the-art ML-based baselines in PLA prediction and structure-based virtual screening. More importantly, comprehensive analyses of distance-affinity, pose-affinity, and substructure-affinity relations suggest that the interaction-based inductive bias can guide the model to learn atomic interactions that are consistent with physical reality. As a case study to demonstrate practical usefulness, our method is tested for predicting the efficacy of Nirmatrelvir against SARS-CoV-2 variants. EHIGN successfully recognizes the changes in the efficacy of Nirmatrelvir for different SARS-CoV-2 variants with meaningful reasons.

Abstract:
In this paper, we address panoramic semantic segmentation which is under-explored due to two critical challenges: (1) image distortions and object deformations on panoramas; (2) lack of semantic annotations in the 360^\circ360∘ imagery. To tackle these problems, first, we propose the upgraded Transformer for Panoramic Semantic Segmentation, i.e., Trans4PASS+, equipped with Deformable Patch Embedding (DPE) and Deformable MLP (DMLPv2) modules for handling object deformations and image distortions whenever (before or after adaptation) and wherever (shallow or deep levels). Second, we enhance the Mutual Prototypical Adaptation (MPA) strategy via pseudo-label rectification for unsupervised domain adaptive panoramic segmentation. Third, aside from Pinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS) with 9,080 panoramic images, facilitating Synthetic-to-Real (Syn2Real) adaptation scheme in 360^\circ360∘ imagery. Extensive experiments are conducted, which cover indoor and outdoor scenarios, and each of them is investigated with Pin2Pan and Syn2Real regimens. Trans4PASS+ achieves state-of-the-art performances on four domain adaptive panoramic semantic segmentation benchmarks.

Abstract:
Learning signed distance functions (SDFs) from point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy observations. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. To accelerate training, we use multi-resolution hash encodings implemented in CUDA in our framework, which reduces our training time by a factor of ten, achieving convergence within one minute. We further introduce a novel schema to improve multi-view reconstruction by estimating SDFs as a prior. Our evaluations under widely-used benchmarks demonstrate our superiority over the state-of-the-art methods in surface reconstruction from point clouds or multi-view images, point cloud denoising and upsampling.

Abstract:
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.

Abstract:
The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA^++ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe^++ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs.

Abstract:
Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis. These approaches can explicitly model inter-temporal and inter-variable relationships, which traditional and other deep neural network-based methods struggle to do. In this survey, we provide a comprehensive review of graph neural networks for time series analysis (GNN4TS), encompassing four fundamental dimensions: forecasting, classification, anomaly detection, and imputation. Our aim is to guide designers and practitioners to understand, build applications, and advance research of GNN4TS. At first, we provide a comprehensive task-oriented taxonomy of GNN4TS. Then, we present and discuss representative research works and introduce mainstream applications of GNN4TS. A comprehensive discussion of potential future research directions completes the survey. This survey, for the first time, brings together a vast array of knowledge on GNN-based time series research, highlighting foundations, practical applications, and opportunities of graph neural networks for time series analysis.

Abstract:
Point cloud registration is challenging in the presence of heavy outlier correspondences. This paper focuses on addressing the robust correspondence-based registration problem with gravity prior that often arises in practice. The gravity directions are typically obtained by inertial measurement units (IMUs) and can reduce the degree of freedom (DOF) of rotation from 3 to 1. We propose a novel transformation decoupling strategy by leveraging the screw theory. This strategy decomposes the original 4-DOF problem into three sub-problems with 1-DOF, 2-DOF, and 1-DOF, respectively, enhancing computation efficiency. Specifically, the first 1-DOF represents the translation along the rotation axis, and we propose an interval stabbing-based method to solve it. The second 2-DOF represents the pole which is an auxiliary variable in screw theory, and we utilize a branch-and-bound method to solve it. The last 1-DOF represents the rotation angle, and we propose a global voting method for its estimation. The proposed method solves three consensus maximization sub-problems sequentially, leading to efficient and deterministic registration. In particular, it can even handle the correspondence-free registration problem due to its significant robustness. Extensive experiments on both synthetic and real-world datasets demonstrate that our method is more efficient and robust than state-of-the-art methods, even when dealing with outlier rates exceeding 99%.

Abstract:
We present CO-Net++, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains with a two-stage feature rectification strategy. The core of CO-Net++ lies in optimizing task-shared parameters to capture universal features across various tasks while discerning task-specific parameters tailored to encapsulate the unique characteristics of each task. Specifically, CO-Net++ develops a two-stage feature rectification strategy (TFRS) that distinctly separates the optimization processes for task-shared and task-specific parameters. At the first stage, TFRS configures all parameters in backbone as task-shared, which encourages CO-Net++ to thoroughly assimilate universal attributes pertinent to all tasks. In addition, TFRS introduces a sign-based gradient surgery to facilitate the optimization of task-shared parameters, thus alleviating conflicting gradients induced by various dataset domains. In the second stage, TFRS freezes task-shared parameters and flexibly integrates task-specific parameters into the network for encoding specific characteristics of each dataset domain. CO-Net++ prominently mitigates conflicting optimization caused by parameter entanglement, ensuring the sufficient identification of universal and specific features. Extensive experiments reveal that CO-Net++ realizes exceptional performances on both 3D object detection and 3D semantic segmentation tasks. Moreover, CO-Net++ delivers an impressive incremental learning capability and prevents catastrophic amnesia when generalizing to new point cloud tasks.

Abstract:
We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 \textm^2m2), providing 36 k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicly available for research purposes.

Abstract:
Fusing features from different sources is a critical aspect of many computer vision tasks. Existing approaches can be roughly categorized as parameter-free or learnable operations. However, parameter-free modules are limited in their ability to benefit from offline learning, leading to poor performance in some challenging situations. Learnable fusing methods are often space-consuming and time-consuming, particularly when fusing features with different shapes. To address these shortcomings, we conducted an in-depth analysis of the limitations associated with both fusion methods. Based on our findings, we propose a generalized module named Asymmetric Convolution Module (ACM). This module can learn to encode effective priors during offline training and efficiently fuse feature maps with different shapes in specific tasks. Specifically, we propose a mathematically equivalent method for replacing costly convolutions on concatenated features. This method can be widely applied to fuse feature maps across different shapes. Furthermore, distinguished from parameter-free operations that can only fuse two features of the same type, our ACM is general, flexible, and can fuse multiple features of different types. To demonstrate the generality and efficiency of ACM, we integrate it into several state-of-the-art models on three representative vision tasks. Extensive experimental results on three tasks and several datasets demonstrate that our new module can bring significant improvements and noteworthy efficiency.

Abstract:
Model explainability is one of the crucial ingredients for building trustable AI systems, especially in the applications requiring reliability such as automated driving and diagnosis. Many explainability methods have been studied in the literature. Among many others, this article focuses on a research line that tries to visually explain a pre-trained image classification model such as Convolutional Neural Network by discovering concepts learned by the model, which is so-called the concept-based explanation. Previous concept-based explanation methods rely on the human definition of concepts (e.g., the Broden dataset) or semantic segmentation techniques like Slic (Simple Linear Iterative Clustering). However, we argue that the concepts identified by those methods may show image parts which are more in line with a human perspective or cropped by a segmentation method, rather than purely reflect a model's own perspective. We propose Model-Oriented Concept Extraction (MOCE), a novel approach to extracting key concepts based solely on a model itself, thereby being able to capture its unique perspectives which are not affected by any external factors. Experimental results on various pre-trained models confirmed the advantages of extracting concepts by truly representing the model's point of view.

Abstract:
Integrating information from vision and language modalities has sparked interesting applications in the fields of computer vision and natural language processing. Existing methods, though promising in tasks like image captioning and visual question answering, face challenges in understanding real-life issues and offering step-by-step solutions. In particular, they typically limit their scope to solutions with a sequential structure, thus ignoring complex inter-step dependencies. To bridge this gap, we propose a graph-based approach to vision-language problem solving. It leverages a novel integrated attention mechanism that jointly considers the importance of features within each step as well as across multiple steps. Together with a graph neural network method, this attention mechanism can be progressively learned to predict sequential and non-sequential solution graphs depending on the characterization of the problem-solving process. To tightly couple attention with the problem-solving procedure, we further design new learning objectives with attention metrics that quantify this integrated attention, which better aligns visual and language information within steps, and more accurately captures information flow between steps. Experimental results on VisualHow, a comprehensive dataset of varying solution structures, show significant improvements in predicting steps and dependencies, demonstrating the effectiveness of our approach in tackling various vision-language problems.

Abstract:
As a result of Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable — time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.

Abstract:
In this work, we tackle the task of estimating the 6D pose of an object from point cloud data. While recent learning-based approaches have shown remarkable success on synthetic datasets, we have observed them to fail in the presence of real-world data. We investigate the root causes of these failures and identify two main challenges: The sensitivity of the widely-used SVD-based loss function to the range of rotation between the two point clouds, and the difference in feature distributions between the source and target point clouds. We address the first challenge by introducing a directly supervised loss function that does not utilize the SVD operation. To tackle the second, we introduce a new normalization strategy, Match Normalization. Our two contributions are general and can be applied to many existing learning-based 3D object registration frameworks, which we illustrate by implementing them in two of them, DCP and IDAM. Our experiments on the real-scene TUD-L Hodan et al. 2018, LINEMOD Hinterstoisser et al. 2012 and Occluded-LINEMOD Brachmann et al. 2014 datasets evidence the benefits of our strategies. They allow for the first-time learning-based 3D object registration methods to achieve meaningful results on real-world data. We therefore expect them to be key to the future developments of point cloud registration methods.

Abstract:
A challenge of channel pruning is designing efficient and effective criteria to select channels to prune. A widely used criterion is minimal performance degeneration, e.g., loss changes before and after pruning being the smallest. To accurately evaluate the truth performance degeneration requires retraining the survived weights to convergence, which is prohibitively slow. Hence existing pruning methods settle to use previous weights (without retraining) to evaluate the performance degeneration. However, we observe that the loss changes differ significantly with and without retraining. It motivates us to develop a technique to evaluate true loss changes without retraining, using which to select channels to prune with more reliability and confidence. We first derive a closed-form estimator of the true loss change per mask change, using influence functions without retraining. Influence function is a classic technique from robust statistics that reveals the impacts of a training sample on the model's prediction and is repurposed by us to assess impacts on true loss changes. We then show how to assess the importance of all channels simultaneously and develop a novel global channel pruning algorithm accordingly. We conduct extensive experiments to verify the effectiveness of the proposed algorithm, which significantly outperforms the competing channel pruning methods on both image classification and object detection tasks. To the best of our knowledge, we are the first that shows evaluating true loss changes or pruning without retraining is possible. This finding will open up opportunities for a series of new paradigms to emerge that differ from existing pruning methods.

Abstract:
We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.

Abstract:
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV-v2, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking of models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area.

Abstract:
In recent years, deep learning has shown potential and efficiency in a wide area including computer vision, image and signal processing. Yet, translational challenges remain for user applications due to a lack of interpretability of algorithmic decisions and results. This black box problem is particularly problematic for high-risk applications such as medical-related decision-making. The current study goal was to design an interpretable deep learning system for time series classification of electroencephalogram (EEG) for sleep stage scoring as a step toward designing a transparent system. We have developed an interpretable deep neural network that includes a kernel-based layer guided by a set of principles used for sleep scoring by human experts in the visual analysis of polysomnographic records. A kernel-based convolutional layer was defined and used as the first layer of the system and made available for user interpretation. The trained system and its results were interpreted in four levels from microstructure of EEG signals, such as trained kernels and effect of each kernel on the detected stages, to macrostructures, such as transitions between stages. The proposed system demonstrated greater performance than prior studies and the system learned information consistent with expert knowledge.

Abstract:
Random-walk-based network embedding algorithms like DeepWalk and node2vec are widely used to obtain euclidean representation of the nodes in a network prior to performing downstream inference tasks. However, despite their impressive empirical performance, there is a lack of theoretical results explaining their large-sample behavior. In this paper, we study node2vec and DeepWalk through the perspective of matrix factorization. In particular, we analyze these algorithms in the setting of community detection for stochastic blockmodel graphs (and their degree-corrected variants). By exploiting the row-wise uniform perturbation bound for leading singular vectors, we derive high-probability error bounds between the matrix factorization-based node2vec/DeepWalk embeddings and their true counterparts, uniformly over all node embeddings. Based on strong concentration results, we further show the perfect membership recovery by node2vec/DeepWalk, followed by KK-means/medians algorithms. Specifically, as the network becomes sparser, our results guarantee that with large enough window size and vertex number, applying KK-means/medians on the matrix factorization-based node2vec embeddings can, with high probability, correctly recover the memberships of all vertices in a network generated from the stochastic blockmodel (or its degree-corrected variants). The theoretical justifications are mirrored in the numerical experiments and real data applications, for both the original node2vec and its matrix factorization variant.

Abstract:
The key challenges in cloud computing encompass dynamic resource scaling, load balancing, and power consumption. Accurate workload prediction is identified as a crucial strategy to address these challenges. Despite numerous methods proposed to tackle this issue, existing approaches fall short of capturing the high-variance nature of volatile and dynamic cloud workloads. Consequently, this paper introduces a novel model aimed at addressing this limitation. This paper presents a novel Multiple Controlled Toffoli-driven Adaptive Quantum Neural Network (MCT-AQNN) model to establish an empirical solution to complex, elastic as well as challenging workload prediction problems by optimizing the exploration, adaption, and exploitation proficiencies through quantum learning. The computational adaptability of quantum computing is ingrained with machine learning algorithms to derive more precise correlations from dynamic and complex workloads. The furnished input data point and hatched neural weights are refitted in the form of qubits while the controlling effects of Multiple Controlled Toffoli (MCT) gates are operated at the hidden and output layers of Quantum Neural Network (QNN) for enhancing learning capabilities. Complimentarily, a Uniformly Adaptive Quantum Machine Learning (UAQL) algorithm has evolved to functionally and effectually train the QNN. The extensive experiments are conducted and the comparisons are performed with state-of-the-art methods using four real-world benchmark datasets. Experimental results evince that MCT-AQNN has up to 32%–96% higher accuracy than the existing approaches.

Abstract:
Bharadwaj et al. (2023) present a comments paper evaluating the classification accuracy of several state-of-the-art methods using EEG data averaged over random class samples. According to the results, some of the methods achieve above-chance accuracy, while the method proposed in (Palazzo et al. 2020), that is the target of their analysis, does not. In this rebuttal, we address these claims and explain why they are not grounded in the cognitive neuroscience literature, and why the evaluation procedure is ineffective and unfair.

Abstract:
Explosive volcanic blasts can occur suddenly and without any clear precursors. Many volcanoes have erupted in the last years with no evident change in the eruptive parameters and with dramatic consequences for the population living nearby the volcano and the tourists visiting the active areas. In recent years, a big effort has been made to develop Early Warning systems to issue timely alerts to the population. At Stromboli volcano, the development of sensitive instruments to measure the deformation (tilt) of the ground has revealed that the volcano edifice is inflating tens of minutes before the explosion following a recurrent exponential ramp-like pattern. This scale-invariant of ground deformation has allowed the development of a quasi-deterministic Early Warning system which is operative since 2019. In this article we show how Artificial Intelligence and Machine Learning can be successfully applied to improve the efficiency and the sensitivity of Early Warning systems, provided the availability of a comprehensive experimental data set on past explosive events. The approach presented here for the Stromboli case demonstrates promising results also in forecasting the intensity of explosive events, offering valuable insights and new perspectives into the potential risks associated with volcanic activities.

Abstract:
When the locations of non-zero samples are known, the Moore-Penrose inverse (MPI) can be used for the data recovery of compressive sensing (CS). First, the prior from the locations is used to shrink the measurement matrix in CS. Then the data can be recovered by using MPI with such shrinking matrix. We can also prove that the results of data recovery from the original CS and our MPI-based method are the same mathematically. Based on such finding, a novel sidelobe-reduction method for synthetic aperture radar (SAR) and Polarimetric SAR (POLSAR) images is studied. The aim of sidelobe reduction is to recover the samples within the mainlobes and suppress the ones within the sidelobes. In our study, prior from spatial variant apodization (SVA) is used to determine the locations of the mainlobes and the sidelobes, respectively. With CS, the mainlobe area can be well recovered. Samples within the sidelobe areas are also recovered using background fusion. Our method is suitable for acquired data with large sizes. The performance of the proposed algorithm is evaluated with acquired space-borne SAR and air-borne POLSAR data. In our experiments, we use the 1\,\textm1m space-borne SAR data with the size of 10000 (samples) × 10000 (samples) and 0.3\,\textm0.3m POLSAR data with the size of 10000 (samples) × 26000 (samples) for sidelobe suppression. Furthermore, We also verified that, our method does not affect the polarization signatures. The effectiveness for the sidelobe suppression is qualitatively examined, and results were satisfactory.