TPAMI2025

Abstract:
Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-focused frame in a focal stack. We then explore bridging this domain gap through aberration-aware training (AAT). Our approach involves a lightweight network that models lens aberrations at different positions and focus distances, which is then integrated into the conventional network training pipeline. We evaluate the generality of network models on both synthetic and real-world data. The experimental results demonstrate that the proposed AAT scheme can improve depth estimation accuracy without fine-tuning the model for different datasets.

Abstract:
Existing clustering ensemble methods typically fuse all base clusterings in one shot under unsupervised settings, making it difficult to distinguish the quality of individual base clusterings and to exploit latent prior knowledge; consequently, their adaptability to data distributions and overall performance are limited. To address these issues, this paper proposes the Self-Constrained Clustering Ensemble (SCCE) algorithm. SCCE treats the pseudolabels automatically generated from current clustering results as selfsupervised signals and performs metric learning to obtain a linear transformation that enlarges interclass distances while compressing intraclass distances. The base clusterings are then reclustered in the new metric space to enhance separability and consistency. Afterward, ensemble updating is iteratively applied, forming a self-driven closed loop that continuously improves model performance. Theoretical analysis shows that the model converges efficiently via alternating optimization, with computational complexity on the same order as mainstream methods. Experiments on public datasets demonstrate that the proposed algorithm significantly outperforms representative clustering ensemble approaches, validating its effectiveness and robustness in scenarios lacking external supervision.

Abstract:
Learning-based image reconstruction models, such as those based on the U-Net, require a large set of labeled images if good generalization is to be guaranteed. In some imaging domains, however, labeled data with pixel- or voxel-level label accuracy are scarce due to the cost of acquiring them. This problem is exacerbated further in domains like medical imaging, where there is no single ground truth label, resulting in large amounts of repeat variability in the labels. Therefore, training reconstruction networks to generalize better by learning from both labeled and unlabeled examples (called semi-supervised learning) is problem of practical and theoretical interest. However, traditional semi-supervised learning methods for image reconstruction often necessitate handcrafting a differentiable regularizer specific to some given imaging problem, which can be extremely time-consuming. In this work, we propose “supervision by denoising” (SUD), a framework to supervise reconstruction models using their own denoised output as labels. SUD unifies stochastic averaging and spatial denoising techniques under a spatio-temporal denoising framework and alternates denoising and model weight update steps in an optimization framework for semi-supervision. As example applications, we apply SUD to two problems from biomedical imaging—anatomical brain reconstruction (3D) and cortical parcellation (2D)—to demonstrate a significant improvement in reconstruction over supervised-only and ensembling baselines.

Abstract:
The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as “prompt modules” and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has good few-shot learning ability, and (iii) is domain-generalizable.

Abstract:
Transformer-based models have recently shown success in representation learning on graph-structured data beyond natural language processing and computer vision. However, the success is limited to small-scale graphs due to the drawbacks of full dot-product attention on graphs such as the quadratic complexity with respect to the number of nodes and message aggregation from enormous irrelevant nodes. To address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention via dynamically selected relevant nodes for efficiently handling large-scale graphs with a linear complexity in the number of nodes. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, combining with our learnable Katz Positional Encodings, the sparse attention is applied to the node sequences for learning node representations with a significantly reduced computational cost. Extensive experiments demonstrate that our DGT achieves superior performance on 7 graph benchmark datasets with 2.5 ～∼ 449 times less computational cost compared to transformer-based graph models with full attention.

Abstract:
Due to the wide existence of unlabeled graph-structured data (e.g., molecular structures), the graph-level clustering has recently attracted increasing attention, whose goal is to divide the input graphs into several disjoint groups. However, the existing methods habitually focus on learning the graphs embeddings with different graph reguralizations, and seldom refer to the obvious differences in data distributions of distinct graph-level datasets. How to characteristically consider multiple graph-level datasets in a general well-designed model without prior knowledge is still challenging. In view of this, we propose a novel Graph Prompt Clustering (GPC) method. Within this model, there are two main modules, i.e., graph model pretraining as well as prompt and finetuning. In the graph model pretraining module, the graph model is pretrained by a selected source graph-level dataset with mutual information maximization and self-supervised clustering regularization. In the prompt and finetuning module, the network parameters of the pretrained graph model are frozen, and a groups of learnable prompt vectors assigned to each graph-level representation are trained for adapting different target graph-level datasets with various data distributions. Experimental results across six benchmark datasets demonstrate the impressive generalization capability and effectiveness of GPC compared with the state-of-the-art methods.

Abstract:
Evaluating the performance of low-light image enhancement (LLE) is highly subjective, thus making integrating human preferences into LLE a necessity. Existing methods fail to consider this and present a series of potentially valid heuristic criteria for training LLE models. In this paper, we propose a new paradigm, i.e., aesthetics-guided low-light image enhancement (ALL-E), which introduces aesthetic preferences to LLE and motivates training in a reinforcement learning framework with an aesthetic reward. Each pixel, functioning as an agent, refines itself by recursive actions. We further present ALL-E+, an extended version of ALL-E, which casts a two-stage aesthetics-guided enhancement and denoising. ALL-E+ achieves low-light enhancement and denoising compensation sequentially in a unified framework, resulting in significant improvements in both subjective visual experience and objective evaluation. Extensive experiments show that integrating aesthetic preferences can further improve the visual experience of enhanced images. Our results on various benchmarks also demonstrate the superiority of our method over state-of-the-art methods.

Abstract:
Zeroth-order optimization algorithms recently emerge as a popular research theme in optimization and machine learning, playing important roles in many deep-learning related tasks such as black-box adversarial attack, deep reinforcement learning, as well as hyper-parameter tuning. Mainstream zeroth-order optimization algorithms, however, concentrate on exploiting zeroth-order-estimated first-order gradient information of the objective landscape. In this paper, we propose a novel meta-algorithm called Hessian-Aware Zeroth-Order (ZOHA) optimization algorithm, which utilizes several canonical variants of zeroth-order-estimated second-order Hessian information of the objective: power-method-based, and Gaussian-smoothing-based. We conclude theoretically that ZOHA enjoys an improved convergence rate compared with existing work without incorporating in zeroth-order optimization second-order Hessian information. Empirical studies on logistic regression as well as the black-box adversarial attack are provided to validate the effectiveness and improved success rates with reduced query complexity of the zeroth-order oracle.

Abstract:
Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes \to→ Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes \to→ ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night.

Abstract:
Machine unlearning (MU) aims to eliminate information that has been learned from specific training data, namely forgetting data, from a pretrained model. Currently, the mainstream of relabeling-based MU methods involves modifying the forgetting data with incorrect labels and subsequently fine-tuning the model. While learning such incorrect information can indeed remove knowledge, the process is quite unnatural as the unlearning process undesirably reinforces the incorrect information and leads to over-forgetting. Towards more natural machine unlearning, we inject correct information from the remaining data to the forgetting samples when changing their labels. Through pairing these adjusted samples with their labels, the model tends to use the injected correct information and naturally suppresses the information meant to be forgotten. Albeit straightforward, such a first step towards natural machine unlearning can significantly outperform current state-of-the-art approaches. In particular, our method substantially reduces the over-forgetting problem and leads to strong robustness across different unlearning tasks, making it a promising candidate for practical machine unlearning.

Abstract:
Manifold learning and KK-means are two powerful techniques for data analysis in the field of artificial intelligence. When used for label learning, a promising strategy is to combine them directly and optimize both models simultaneously. However, a significant drawback of this approach is that it represents a naive and crude integration, requiring the optimization of all variables in both models without achieving a truly essential combination. Additionally, it introduces an extra hyperparameter and cannot ensure cluster balance. These challenges motivate us to explore whether a meaningful integration can be developed for dimensionality reduction clustering. In this paper, we propose a novel self-supervised manifold clustering framework that reformulates the two models into a unified framework, eliminating the need for additional hyperparameters while achieving dimensionality reduction clustering. Specifically, by analyzing the relationship between KK-means and manifold learning, we construct a meaningful low-dimensional manifold clustering model that directly produces the label matrix of the data. The label information is then used to guide the learning of the manifold structure, ensuring consistency between the manifold structure and the labels. Notably, we identify a valuable role of \ell _2,pℓ2,p-norm regularization in clustering: maximizing the \ell _2,pℓ2,p-norm naturally maintains class balance during clustering, and we provide a theoretical proof of this property. Extensive experimental results demonstrate the efficiency of our proposed model.

Abstract:
Visual vibrometry is a highly useful tool for remote capture of audio, as well as the physical properties of materials, human heart rate, and more. While visually-observable vibrations can be captured directly with a high-speed camera, minute imperceptible object vibrations can be optically amplified by imaging the displacement of a speckle pattern created by shining a laser beam on the vibrating surface. In this paper, we propose a novel method for sensing vibrations at high speeds (up to 63 kHz), for multiple scene sources at once, using sensors rated for only 130 Hz operation. Our method relies on simultaneously capturing the scene with two cameras equipped with rolling and global shutter sensors, respectively. The rolling shutter camera captures distorted speckle images that encode the high-speed object vibrations. The global shutter camera captures undistorted reference images of the speckle pattern, helping to decode the source vibrations. We demonstrate our method by capturing vibration caused by audio sources (e.g., speakers, human voice, and musical instruments) and analyzing the vibration modes of a tuning fork.

Abstract:
Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations of these two directions. By operating on a four-dimensional representation of the light field, our model learns to represent view-dependent effects accurately. By enforcing geometric constraints during training and inference, the scene geometry is implicitly learned from a sparse set of views. Concretely, we introduce a two-stage transformer-based model that first aggregates features along epipolar lines, then aggregates features along reference views to produce the color of a target ray. Additionally, we propose modifications that allow the model to generalize to scenes without any fine-tuning. Our model outperforms the state-of-the-art on multiple forward-facing and 360^\circ∘ datasets, with larger margins on scenes with severe view-dependent variations.

Abstract:
In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach—representation ensembling—demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, Food1k, and CORe50), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget.

Abstract:
Network-based intrusion detection system (NIDS) monitors network traffic for malicious activities, formingthe frontline defense against increasing attacks over information infrastructures. Although promising, our quantitative analysis shows that existing methods perform inconsistently in attacks (e.g., 18% F1 for the MITM and 93% F1 for DDoS by a GCN-based state-of-the-art method), and perform poorly in few-shot intrusion detections (e.g., dramatically drops from 91% to 36% in 3D-IDS, and drops from 89% to 20% in E-GraphSAGE). We reveal that the underlying cause is entangled distributions of flow features. This motivates us to propose DIDS-MFL, a disentangled intrusion detection approach for various scenarios. DIDS-MFL involves two key components: a double Disentanglement-based Intrusion Detection System (DIDS) and a plug-and-play Multi-scale Few-shot Learning-based (MFL) intrusion detection module. Specifically, the proposed DIDS first disentangles traffic features by a non-parameterized optimization, automatically differentiating tens and hundreds of complex features. Such differentiated features will be further disentangled to highlight the attack-specific features. Our DIDS additionally uses a novel graph diffusion method that dynamically fuses the network topology for spatial-temporal aggregation in evolving data streams. Furthermore, the proposed MFL involves an alternating optimization framework to address the entangled representations in few-shot traffic threats with rigorous derivation. MFL first captures multi-scale information in latent space to distinguish attack-specific information and then optimizes the disentanglement term to highlight the attack-specific information. Finally, MFL fuses and alternately solves them in an end-to-end way. To the best of our knowledge, DIDS-MFL takes the first step toward disentangled dynamic intrusion detection under various attack scenarios. Equipped with DIDS-MFL, administrators can effectively identify various attacks in encrypted traffic, including known, unknown, and few-shot threats that are not easily detected. Comprehensive experiments show the superiority of our proposed DIDS-MFL. For few-shot NIDS, our DIDS-MFL achieves a 71.91% –125.19% improvement in average F1-score over 14 baselines and shows versatility in multiple baselines and multiple tasks.

Abstract:
Zero-shot image captioning can harness the knowledge of pre-trained visual language models (VLMs) and language models (LMs) to generate captions for target domain images without paired sample training. Existing methods attempt to establish high-quality connections between visual and textual modalities in text-only pre-training tasks. These methods can be divided into two perspectives: sentence-level and entity-level. Although they achieve effective performance on some metrics, they suffer from hallucinations due to biased associations during training. In this paper, we propose a scene-relation-level pre-training task by considering relations as more valuable modal connection bridges. Based on this, we construct a novel Visual-Language Scene Relation Aware Captioner (SRACap), which expands the ability to predict scene relations while generating captions for images. In addition, SRACap possesses excellent cross-domain zero-shot generalization capability, which is driven by a well-designed scene reinforcement switching pipeline. We introduce a scene policy network to dynamically crop salient regions from images and feed them into a language model to generate captions. We integrate multiple expert CLIP models to form a mixture-of-rewards module (MoR) as a reward source, and deeply optimized SRACap through the policy gradient algorithm in the zero-shot inference stage. With the iteration of scene reinforcement switching, SRACap can gradually refine the generated caption details while maintaining high semantic consistency across visual-linguistic modalities. We conduct extensive experiments on multiple standard image captioning benchmarks, showing that SRACap can accurately understand scene structures and generate high-quality text, significantly outperforming other zero-shot inference methods.

Abstract:
Hypergraph Neural Networks (HGNNs) have attracted much attention for high-order structural data learning. Existing methods mainly focus on simple mean-based aggregation or manually combining multiple aggregations to capture multiple information on hypergraphs. However, those methods inherently lack continuous non-linear modeling ability and are sensitive to varied distributions. Although some kernel-based aggregations on GNNs and CNNs can capture non-linear patterns to some degree, those methods are restricted in the low-order correlation and may cause unstable computation in training. In this work, we introduce Kernelized Hypergraph Neural Networks (KHGNN) and its variant, Half-Kernelized Hypergraph Neural Networks (H-KHGNN), which synergize mean-based and max-based aggregation functions to enhance representation learning on hypergraphs. KHGNN’s kernelized aggregation strategy adaptively captures both semantic and structural information via learnable parameters, offering a mathematically grounded blend of kernelized aggregation approaches for comprehensive feature extraction. H-KHGNN addresses the challenge of overfitting in less intricate hypergraphs by employing non-linear aggregation selectively in the vertex-to-hyperedge message-passing process, thus reducing model complexity. Our theoretical contributions reveal a bounded gradient for kernelized aggregation, ensuring stability during training and inference. Empirical results demonstrate that KHGNN and H-KHGNN outperform state-of-the-art models across 10 graph/hypergraph datasets, with ablation studies demonstrating the effectiveness and computational stability of our method.

Abstract:
We tackle the problem of bundle adjustment (i.e., simultaneous refinement of camera poses and scene map) for a purely rotating event camera. Starting from first principles, we formulate the problem as a classical non-linear least squares optimization. The photometric error is defined using the event generation model directly in the camera rotations and the semi-dense scene brightness that triggers the events. We leverage the sparsity of event data to design a tractable Levenberg-Marquardt solver that handles the very large number of variables involved. To the best of our knowledge, our method, which we call Event-based Photometric Bundle Adjustment (EPBA), is the first event-only photometric bundle adjustment method that works on the brightness map directly and exploits the space-time characteristics of event data, without having to convert events into image-like representations. Comprehensive experiments on both synthetic and real-world datasets demonstrate EPBA’s effectiveness in decreasing the photometric error (by up to 90%), yielding results of unparalleled quality. The refined maps reveal details that were hidden using prior state-of-the-art rotation-only estimation methods. The experiments on modern high-resolution event cameras show the applicability of EPBA to panoramic imaging in various scenarios (without map initialization, at multiple resolutions, and in combination with other methods, such as IMU dead reckoning or previous event-based rotation estimation methods). We make the source code publicly available.

Abstract:
In this paper, we propose a post-training pruning framework that jointly optimizes layerwise pruning to minimize model output distortion. Through theoretical and empirical analysis, we discover an important additivity property of output distortion from pruning weights/channels in DNNs. Leveraging this property, we reformulate pruning optimization as a combinatorial problem and solve it with dynamic programming, achieving linear time complexity and making the algorithm very fast on CPUs. Furthermore, we optimize additivity-derived distortions using Hessian-based Taylor approximation to enhance pruning efficiency, accompanied by fine-grained complexity reduction techniques. Our method is evaluated on various DNN architectures, including CNNs, ViTs, and object detectors, and on vision tasks such as image classification on CIFAR-10 and ImageNet, and 3D object detection and various datasets. We achieve SoTA with significant FLOPs reductions without accuracy loss. Specifically, on CIFAR-10, we achieve up to 27.9×27.9×, 29.2×29.2×, and 14.9×14.9× FLOPs reductions on ResNet-32, VGG-16, and DenseNet-121, respectively. On ImageNet, we observe no accuracy loss with 1.69×1.69× and 2×2× FLOPs reductions on ResNet-50 and DeiT-Base, respectively. For 3D object detection, we achieve \mathbf 3.89×, \mathbf 3.72×3.89×,3.72× FLOPs reductions on CenterPoint and PVRCNN models. These results demonstrate the effectiveness and practicality of our approach for improving model performance through layer-adaptive weight pruning.

Abstract:
Inverse problems in scientific imaging often seek physical characterization of heterogeneous scene materials. The scene is thus represented by physical quantities, such as the density and sizes of particles (microphysics) across a domain. Moreover, the forward image formation model is physical. An important case is that of clouds, where microphysics in three dimensions (3D) dictate the cloud dynamics, lifetime and albedo, with implications to Earth’s energy balance, sustainable energy and rainfall. Current methods, however, recover very degenerate representations of microphysics. To enable 3D volumetric recovery of all the required microphysical parameters, we introduce the neural microphysics field (NeMF). It is based on a deep neural network, whose input is multi-view polarization images. NeMF is pre-trained through supervised learning. Training relies on polarized radiative transfer, and noise modeling in polarization-sensitive sensors. The results offer unprecedented recovery, including droplet effective variance. We test NeMF in rigorous simulations and demonstrate it using real-world polarization-image data.

Abstract:
Using millimeter wave (mmWave) signals for imaging has an important advantage in that they can penetrate through poor environmental conditions such as fog, dust, and smoke that severely degrade optical-based imaging systems. However, mmWave radars, contrary to cameras and LiDARs, suffer from low angular resolution because of small physical apertures and conventional signal processing techniques. Sparse radar imaging, on the other hand, can increase the aperture size while minimizing power consumption and read-out bandwidth. This article presents CoIR, an analysis by synthesis method that leverages the implicit neural network bias in convolutional decoders and compressed sensing to perform high-accuracy sparse radar imaging. The proposed system is data set-agnostic and does not require any auxiliary sensors for training or testing. We introduce a sparse array design that allows for a 5.5×5.5× reduction in the number of antenna elements needed compared to conventional MIMO array designs. We demonstrate our system's improved imaging performance over standard mmWave radars and other competitive untrained methods on both simulated and experimental mmWave radar data.

Abstract:
Optical blur is an inherent property of any lens system and is challenging to model in modern cameras because of their complex optical elements. To tackle this challenge, we introduce a high-dimensional neural representation of blur—the lens blur field—and a practical method for acquiring it. The lens blur field is a multilayer perceptron (MLP) designed to (1) accurately capture variations of the lens 2D point spread function over image plane location, focus setting and, optionally, depth and (2) represent these variations parametrically as a single, sensor-specific function. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses. To learn the real-world blur field of a given device, we formulate a generalized non-blind deconvolution problem that directly optimizes the MLP weights using a small set of focal stacks as the only input. We also provide a first-of-its-kind dataset of 5D blur fields—for smartphone cameras, camera bodies equipped with a variety of lenses, etc. Lastly, we show that acquired 5D blur fields are expressive and accurate enough to reveal, for the first time, differences in optical behavior of smartphone devices of the same make and model.

Abstract:
Taylor-Series-Expansion (TSE) is a mathematics theorem. It proves that the expansion of the first few finite Taylor Series is a good approximation of a nonlinear function in most cases. Inspired by the TSE theorem, a brand-new TSE-based vision transformer is designed. TSE-based vision transformer uses the shared first-order TSE transformer block’s weight (in analogy with the Taylor-Series first-order term), its finite multiple multiplications (in analogy with the Taylor-Series expanded high-order terms), and the corresponding learnable TSE coefficients to approximate the naive vision transformer. In this manner, the TSE-based vision model reduces the memory burden but keeps a similar accuracy as the naive counterpart. Derived from adding the Taylor skip mechanism in training, the TSE-based vision transformer has good dynamic expansion capability. Experiment results show TSE-based models can boost actual deployment latency by 1.30-1.36× on A100 GPU and 1.34-1.45× on AGX Orin with negligible accuracy degradation on ImageNet classification, COCO detection, and ADE20K segmentation benchmarking tasks. Moreover, TSE-based optimization is orthogonal to model compression. Combining with the state-of-the-art vision transformer compression method, it can boost actual deployment performance by 1.70-1.87× and 3.29-3.61× of latency and throughput on A100 GPU, and 1.67-1.74× and 2.76-2.94× improvement of latency and throughput on AGX Orin.

Abstract:
Variational Autoencoders (VAEs), can achieve remarkable results in single tasks, by learning data representations, image generation, or image-to-image translation among others. However, VAEs suffer from loss of information when aiming to continuously learn a sequence of different data domains. This is caused by the catastrophic forgetting, which affects all machine learning methods. This paper addresses the problem of catastrophic forgetting by developing a new theoretical framework which derives an upper bound to the negative sample log-likelihood when continuously learning sequences of tasks. These theoretical derivations provide new insights into the forgetting behavior of learning models, showing that their optimal performance is achieved when a dynamic mixture expansion model adds new components whenever learning new tasks. In our approach we optimize the model size by introducing the Dynamic Expansion Graph Model (DEGM) that dynamically builds a graph structure promoting the positive knowledge transfer when learning new tasks. In addition, we propose a Dynamic Expansion Graph Adaptive Mechanism (DEGAM) that generates adaptive weights to regulate the graph structure, further improving the positive knowledge transfer effectiveness. Experimental results show that the proposed methodology performs better than other baselines in continual learning.

Abstract:
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, “ViCo” for independent talking and listening head generation tasks at the sentence level, and “ViCo-X”, for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners’ behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation.

Affiliations: School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Artificial Intelligence, Beijing Normal University, Beijing, China; Momenta, Suzhou, China; State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Abstract:
The fairness of face recognition (FR) is a challenging issue to numerous FR algorithms in the modern pluralistic and egalitarian society. In this work, we propose an instance-consistent fair face recognition (IC-FFR) method by fulfilling complete instance fairness on false positive rate (FPR) and true positive rate (TPR). In view of the misalignment of testing and training metrics, not yet considered by the current fair FR algorithms, in theory, we inspect the correlation between the testing metrics (FPR and TPR) and the label classification loss, and we derive a high-probability consistency of unfairness penalties from FPR and TPR to the softmax loss. According to the theoretical analysis, we further develop an instance-consistent fairness solution by introducing customized instance margins, which well preserve consistent FPR and TPR of all instances during the label classification in training. To encourage more fine-grained fairness evaluation, we contribute a dataset called national faces in the world (NFW) to measure the fairness of individuals and countries. Extensive experiments on our NFW as well as the RFW and BFW benchmarks demonstrate the effectiveness and superiority of our method compared to those state-of-the-art fair FR methods.

Abstract:
Prior research in video object segmentation (VOS) predominantly relies on videos with dense annotations. However, obtaining pixel-level annotations is both costly and time-intensive. In this work, we highlight the potential of effectively training a VOS model using remarkably sparse video annotations—specifically, as few as one or two labeled frames per training video, yet maintaining near equivalent performance levels. We introduce this innovative training methodology as low-shot video object segmentation, abbreviated as low-shot VOS. Central to this method is the generation of reliable pseudo labels for unlabeled frames during the training phase, which are then used in tandem with labeled frames to optimize the model. Notably, our strategy is extremely simple and can be incorporated into the vast majority of current VOS models. For the first time, we propose a universal method for training VOS models on one-shot and two-shot VOS datasets. In the two-shot configuration, utilizing just 7.3% and 2.9% of labeled data from the YouTube-VOS and DAVIS benchmarks respectively, our model delivers results on par with those trained on completely labeled datasets. It is also worth noting that in the one-shot setting, a minor performance decrement is observed in comparison to models trained on fully annotated datasets.

Abstract:
Supervised Cross-Modal Retrieval (SCMR) achieves significant performance with the supervision provided by substantial label annotations of multi-modal data. However, the requirement for large annotated multi-modal datasets restricts the use of supervised cross-modal retrieval in many practical scenarios. Active Learning (AL) has been proposed to reduce labeling costs while improving performance in various label-dependent tasks, in which the most informative unlabeled samples are selected for labeling and training. Directly exploiting the existing AL methods for supervised cross-modal retrieval may not be a good idea since they only focus on the uncertainty within each modality, ignoring the inter-modality relationship within the text-image pairs. Furthermore, existing methods focus exclusively on the informativeness of data during sample selection, leading to a biased, homogenized set where selected samples often contain nearly identical semantics and are densely distributed in a region of the feature space. Persistent training with such biased data selections can disturb multi-modal representation learning and substantially degrade the retrieval performance of SCMR. In this work, we propose an Active Supervised Cross-Modal Retrieval (ASCMR) framework, which effectively identifies informative multi-modal samples and generates unbiased sample selections. In particular, we propose a probabilistic multi-modal informativeness estimation that captures both the intra-modality and inter-modality uncertainty of multi-modal pairs within a unified representation. To ensure unbiased sample selection, we introduce a density-aware budget allocation strategy that constrains the active learning objective of maximizing the informativeness of selection with a novel semantic density regularization term. The proposed methods are evaluated on three widely used benchmark datasets, MS-COCO, NUS-WIDE, and MIRFlickr, demonstrating our effectiveness in significantly reducing the annotation cost while outperforming other baselines of active learning strategies. We could achieve over 95% of the fully supervised model’s performance by only utilizing 6%, 3%, and 4% active selected samples for MS-COCO, NUS-WIDE, and MIRFlickr, respectively.

Abstract:
Fuzzy C-Means algorithm (FCM) is one of the most commonly used fuzzy clustering algorithm, which uses the alternating optimization algorithm to update the membership matrix and the cluster center matrix. FCM achieves effective results in clustering tasks. However, due to many constraints, the objective function is inconvenient to optimize directly and is prone to converges to a suboptimal local minimum, which affects the clustering performance. In this paper, we propose a minimization problem equivalent to FCM. Firstly, we use the optimal solution when fixing the cluster center matrix to replace the membership matrix, transforming the original constrained optimization problem into an unconstrained optimization problem, thus reducing the number of variables. We then use gradient descent instead of alternating optimization to solve the model, so we call this model UC-FCM. Extensive experimental results show that UC-FCM can obtain better local minimum and achieve superior clustering performance compared to FCM under the same initialization. Moreover, UC-FCM is also competitive compared with other advanced clustering algorithms.

Abstract:
We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7 K images covering 64 object categories in real-world scenarios. Then, we develop a simple but strong dual-branch framework, dubbed R2CNet, with a reference branch embedding the common representations of target objects from referring images and a segmentation branch identifying and segmenting camouflaged objects under the guidance of the common representations. In particular, we design a Referring Mask Generation module to generate pixel-level prior mask and a Referring Feature Enrichment module to enhance the capability of identifying specified camouflaged objects. Extensive experiments show the superiority of our Ref-COD methods over their COD counterparts in segmenting specified camouflaged objects and identifying the main body of target objects.

Abstract:
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks. Existing methods are devoted to developing various robust training strategies or regularizations to update the weights of the neural network. But beyond the weights, the overall structure and information flow in the network are explicitly determined by the neural architecture, which remains unexplored. Thus, this paper aims to improve the adversarial robustness of the network from the architectural perspective. We explore the relationship among adversarial robustness, Lipschitz constant, and architecture parameters and show that an appropriate constraint on architecture parameters could reduce the Lipschitz constant to further improve the robustness. The importance of architecture parameters could vary from operation to operation or connection to connection. We approximate the Lipschitz constant of the entire network through a univariate log-normal distribution, whose mean and variance are related to architecture parameters. The confidence can be fulfilled through formulating a constraint on the distribution parameters based on the cumulative function. Compared with adversarially trained neural architectures searched by various NAS algorithms as well as efficient human-designed models, our algorithm empirically achieves the best performance among all the models under various attacks on different datasets.

Abstract:
The increasing effect of Internet of Things (IoT) unlocks the massive volume of the availability of Big Data in many fields. Generally, these Big Data may be in a non-independently and identically distributed fashion (non-IID). In this paper, we have contributions in such a way enable multi-view k-means (MVKM) clustering to maintain the privacy of each database by allowing MVKM to be operated on the local principle of clients’ multi-view data. This work integrates the exponential distance to transform the weighted Euclidean distance on MVKM so that it can make full use of development in federated learning via the MVKM clustering algorithm. The proposed algorithm, called a federated MVKM (Fed-MVKM), can provide a whole new level adding a lot of new ideas to produce a much better output. The proposed Fed-MVKM is highly suitable for clustering large data sets. To demonstrate its efficient and applicable, we implement a synthetic and six real multi-view data sets and then perform Federated Peter-Clark in Huang et al. 2023 for causal inference setting to split the data instances over multiple clients, efficiently. The results show that shared-models based local cluster centers with data-driven in the federated environment can generate a satisfying final pattern of one multi-view data that simultaneously improve the clustering performance of (non-federated) MVKM clustering algorithms.

Abstract:
Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises \left\langle condition,question,answer\right\ranglecondition,question,answer instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.

Abstract:
Geometric deep learning (GDL) models have demonstrated a great potential for the analysis of non-Euclidian data. They are developed to incorporate the geometric and topological information of non-Euclidian data into the end-to-end deep learning architectures. Motivated by the recent success of discrete Ricci curvature in graph neural network (GNNs), we propose TorGNN, an analytic Torsion enhanced Graph Neural Network model. The essential idea is to characterize graph local structures with an analytic torsion based weight formula. Mathematically, analytic torsion is a topological invariant that can distinguish spaces which are homotopy equivalent but not homeomorphic. In our TorGNN, for each edge, a corresponding local simplicial complex is identified, then the analytic torsion (for this local simplicial complex) is calculated, and further used as a weight (for this edge) in message-passing process. Our TorGNN model is validated on link prediction tasks from sixteen different types of networks and node classification tasks from four types of networks. It has been found that our TorGNN can achieve superior performance on both tasks, and outperform various state-of-the-art models. This demonstrates that analytic torsion is a highly efficient topological invariant in the characterization of graph structures and can significantly boost the performance of GNNs.

Abstract:
This work develops and analyzes a class of adaptive biased stochastic optimization (ABSO) algorithms from the perspective of the GEneralized Adaptive gRadient (GEAR) method that contains Adam, AdaGrad, RMSProp, etc. Particularly, two preferred biased stochastic optimization (BSO) algorithms, the biased stochastic variance reduction gradient (BSVRG) algorithm and the stochastic recursive gradient algorithm (SARAH), equipped with GEAR, are first considered in this work, leading to two ABSO algorithms: BSVRG-GEAR and SARAH-GEAR. We present a uniform analysis of ABSO algorithms for minimizing strongly convex (SC) and Polyak-Łojasiewicz (PŁ) composite objective functions. Second, we also use our framework to develop another novel BSO algorithm, adaptive biased stochastic conjugate gradient (coined BSCG-GEAR), which achieves the well-known oracle complexity. Specifically, under mild conditions, we prove that the resulting ABSO algorithms attain a linear convergence rate on both PŁ and SC cases. Moreover, we show that the complexity of the resulting ABSO algorithms is comparable to that of advanced stochastic gradient-based algorithms. Finally, we demonstrate the empirical superiority and the numerical stability of the resulting ABSO algorithms by conducting numerical experiments on different applications of machine learning.

Abstract:
Although numerous clustering algorithms have been developed, many existing methods still rely on the K-means technique to identify clusters of data points. However, the performance of K-means is highly dependent on the accurate estimation of cluster centers, which is challenging to achieve optimally. Furthermore, it struggles to handle linearly non-separable data. To address these limitations, from the perspective of manifold learning, we reformulate multi-view K-means into a manifold-based multi-view clustering formulation that eliminates the need for computing centroid matrix. This reformulation ensures consistency between the manifold structure and the data labels. Building on this, we propose a novel multi-view K-means model incorporating the tensor rank constraint. Our model employs the indicator matrices from different views to construct a third-order tensor, whose rank is minimized via the tensor Schatten p-norm. This approach effectively leverages the complementary information across views. By utilizing different distance functions, our proposed model can effectively handle linearly non-separable data. Extensive experimental results on multiple databases demonstrate the superiority of our proposed model.

Abstract:
The task of partial scene text retrieval involves localizing and searching for text instances that are the same or similar to a given query text from an image gallery. However, existing methods can only handle text-line instances, leaving the problem of searching for partial patches within these text-line instances unsolved due to a lack of patch annotations in the training data. To address this issue, we propose a network that can simultaneously retrieve both text-line instances and their partial patches. Our method embeds the two types of data (query text and scene text instances) into a shared feature space and measures their cross-modal similarities. To handle partial patches, our proposed approach adopts a Multiple Instance Learning (MIL) approach to learn their similarities with query text, without requiring extra annotations. However, constructing bags, which is a standard step of conventional MIL approaches, can introduce numerous noisy samples for training, and lower inference speed. To address this issue, we propose a Ranking MIL (RankMIL) approach to adaptively filter those noisy samples. Additionally, we present a Dynamic Partial Match Algorithm (DPMA) that can directly search for the target partial patch from a text-line instance during the inference stage, without requiring bags. This greatly improves the search efficiency and the performance of retrieving partial patches. We evaluate the proposed method on both English and Chinese datasets in two tasks: retrieving text-line instances and partial patches. For English text retrieval, our method outperforms state-of-the-art approaches by 8.04% mAP and 12.71% mAP on average, respectively, among three datasets for the two tasks. For Chinese text retrieval, our approach surpasses state-of-the-art approaches by 24.45% mAP and 38.06% mAP on average, respectively, among three datasets for the two tasks.

Abstract:
Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual prompts, like colorful boxes or circles, are suggested to enhance local perception. However, these methods often include irrelevant and noisy pixels, leading to suboptimal performance. The design of better visual prompts and their collaboration with text prompting remains underexplored. This paper introduces Fine-Grained Visual Text Prompting (FGVTP), a new zero-shot framework for object-based tasks using precise semantic masks and reinforced image-text alignment. FGVTP comprises Fine-Grained Visual Prompting (FGVP) and Consistency-Enhanced Text Prompting (CETP). Specifically, we carefully study visual prompting designs by exploring more visual markings that vary in shape and form. FGVP uses semantic masks from a segmenter like the Segment Anything Model (SAM) and employs background blurring (Blur Reverse Mask) to highlight targets while maintaining spatial coherence. Further, CETP enhances image-text alignment by prompting captions based on FGVP-processed images. As a result, FGVTP achieves superior zero-shot referring expression comprehension on RefCOCO/+/g benchmarks, outperforming previous SOTA methods by 5.8% on average. Part detection experiments conducted on the PACO dataset further validate the preponderance of FGVTP over existing works.

Affiliations: School of Physics, Mathematics and Computing, University of Western Australia, Crawley, WA, Australia; National Engineering Research Center of Visual Technology, National Key Laboratory for Multimedia Information Processing, School of Computer Science, National Biomedical Imaging Center, Peking University, Beijing, China; School of Computer Science, Faculty of Engineering, University of Sydney, Darlington, NSW, Australia; Center for Research in Computer Vision, University of Central Florida, Orlando, FL, USA

Abstract:
Understanding long-form videos requires precise temporal action segmentation. While existing studies typically employ multi-stage models that follow an iterative refinement process, we present a novel framework based on the denoising diffusion model that retains this core iterative principle. Within this framework, the model iteratively produces action predictions starting with random noise, conditioned on the features of the input video. To effectively capture three key characteristics of human actions, namely the position prior, the boundary ambiguity, and the relational dependency, we propose a cohesive masking strategy for the conditioning features. Moreover, a consistency gradient guidance technique is proposed, which maximizes the similarity between outputs with or without the masking, thereby enriching conditional information during the inference process. Extensive experiments are performed on four datasets, i.e., GTEA, 50Salads, Breakfast, and Assembly101. The results indicate that our proposed method outperforms or is on par with existing state-of-the-art techniques, underscoring the potential of generative approaches for action segmentation.

Abstract:
The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug & play manner to further enhance their performance.

Abstract:
Burst Image Restoration aims to reconstruct a high-quality image by efficiently combining complementary inter-frame information. However, it is quite challenging since individual burst images often have inter-frame misalignments that usually lead to ghosting and zipper artifacts. To mitigate this, we develop a novel approach for burst image processing named BIPNet that focuses solely on the information exchange between burst frames and filter-out the inherent degradations while preserving and enhancing the actual scene details. Our central idea is to generate a set of pseudo-burst features that combine complementary information from all the burst frames to exchange information seamlessly. However, due to inter-frame misalignment, the information cannot be effectively combined in pseudo-burst. Thus, we initially align the incoming burst features regarding the reference frame using the proposed edge-boosting feature alignment. Lastly, we progressively upscale the pseudo-burst features in multiple stages while adaptively combining the complementary information. Unlike the existing works, that usually deploy single-stage up-sampling with a late fusion scheme, we first deploy a pseudo-burst mechanism followed by the adaptive-progressive feature up-sampling. The proposed BIPNet significantly outperforms the existing methods on burst super-resolution, low-light image enhancement, low-light image super-resolution, and denoising tasks.

Abstract:
How can agents infer the intentions of others by simply observing their behavior? And how can they generate fast and accurate actions such as grasping a moving object on the fly? Recent advances in Bayesian model reduction have led to innovative, biologically plausible approaches to actively infer the state of affairs of the world and perform planning with continuous signals. However, reducing the surrounding environment into a small set of simpler hypotheses remains a challenge in highly dynamic contexts. In this study, we propose an approach, based on active inference, that employs dynamic priors sampled from reduced versions of a generative model. Each dynamic prior corresponds to an alternative evolution of the world, which the agent can evaluate by accumulating continuous data. We test our approach on two everyday tasks: inferring a trajectory and grasping a moving object. Our findings reveal how agents can smoothly infer and enact dynamic intentions, and emphasize the key role of intentional gain or precision in motor learning.

Abstract:
In this paper, we present UISE, a unified image segmentation framework that achieves efficient performance across various segmentation tasks, eliminating the need for multiple specialized pipelines. UISE employs dynamic convolutions between universal segmentation kernels and image feature maps, enabling a single pipeline for different tasks such as panoptic, instance, semantic, and video instance segmentation. To address computational requirements, we introduce a feature pyramid aggregator for image feature extraction and a separable dynamic decoder for generating segmentation kernels. The aggregator re-parameterizes interpolation-first modules in a convolution-first manner, resulting in a significant acceleration of the pipeline without incurring additional costs. The decoder incorporates multi-head cross-attention through separable dynamic convolution, enhancing both efficiency and accuracy. Extensive experiments are conducted to validate UISE’s performance across different segmentation tasks. To the best of our knowledge, UISE is the first universal segmentation framework that delivers competitive performance in terms of both speed and accuracy when compared to current state-of-the-art models.

Abstract:
We introduce HOLI-1-to-3, a novel technique for holistic 3D shape recovery from a single-viewpoint input, by effectively combining line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. We leverage advancements in ultrafast time-of-flight (ToF) sensors and learning-based 3D shape inference techniques, such as diffusion models. HOLI-1-to-3 employs a new neural plenoptic representation, which unifies radiance fields (for LOS RGB images) and transient fields (for NLOS transients). HOLI-1-to-3 is optimized through a two-stage pipeline involving diffusion priors and transients prior. Our technique allows for accurate and continuous reconstruction of both visible and invisible parts of objects from a single view. Comprehensive experiments on both simulated and real-world datasets demonstrate the effectiveness of HOLI-1-to-3in resolving ambiguities in invisible parts of objects and significantly improving overall generation quality. The datasets used in our experiments will be made available to the research community to facilitate further achievements in holistic 3D shape recovery.

Abstract:
While recent image warping approaches achieved remarkable success on existing benchmarks, they still require training separate models for each specific task and cannot generalize well to different camera models or customized manipulations. To address diverse types of warping in practice, we propose a Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To further enable dynamic task-aware image warping, we introduce a lightweight point-based classifier that predicts the task type, serving as prompts to modulate the feature maps for more accurate estimation. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model. Extensive experiments demonstrate that our MOWA, which is trained on six tasks for multiple-in-one single image warping, outperforms state-of-the-art task-specific models across most tasks. Moreover, MOWA also exhibits promising potential to generalize into unseen scenes, as evidenced by cross-domain and zero-shot evaluations.

Abstract:
This paper presents a Spatial Re-parameterization (SpRe) method for the N:M sparsity. SpRe stems from an observation regarding the restricted variety in spatial sparsity of convolution kernels presented in N:M sparsity compared with unstructured sparsity. Particularly, N:M sparsity exhibits a fixed sparsity rate within the spatial domains due to its distinctive pattern that mandates N non-zero components among M successive weights in the input channel dimension of convolution filters. On the contrary, we observe that conventional unstructured sparsity displays a substantial divergence in sparsity across the spatial domains, which we experimentally verify to be very crucial for its robust performance retention compared with N:M sparsity. Therefore, SpRe employs the spatial-sparsity distribution of unstructured sparsity by assigning an extra branch in conjunction with the original N:M branch at training time, which allows the N:M sparse network to sustain a similar distribution of spatial sparsity with unstructured sparsity. During inference, the extra branch can be further re-parameterized into the main N:M branch, without exerting any distortion on the sparse pattern or additional computation costs. SpRe has achieved a commendable feat by matching the performance of N:M sparsity methods with state-of-the-art unstructured sparsity methods across various benchmarks. Our project is available at https://github.com/zyxxmu/SpRE.

Abstract:
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high-resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost, i.e., FLOPs. Our approach involves computing self-attention in a fixed low-resolution space, regardless of the input image’s resolution, with additional \text3× \text33×3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20 K, COCO-Stuff, and CityScapes datasets demonstrate that LRFormer outperforms state-of-the-art models.

Abstract:
The light field camera has significantly advanced conventional imaging methods and microscopy over the past decades, providing high-dimensional information in 2D images and enabling a variety of applications. However, inherent shortcomings persist, mainly due to the complex optical setup and the trade-off between resolution. In this work, we propose a Neural Defocus Light Field (NDLF) rendering method, which constructs the light field without a micro-lens array but achieves the same resolution as the original image. The basic unit of NDLF is the 3D point spread function (3D-PSF), which extends the 2D-PSF by incorporating the focus depth axis. NDLF can directly solve the distribution of PSFs in 3D space, enabling direct manipulation of the PSF in 3D and enhancing our understanding of the defocus process. NDLF achieves the focused images rendering by redefining the focus images as slices of the NDLF, which are superpositions of cross-sections of the 3D-PSFs. NDLF modulates the 3D-PSFs using three multilayer perceptron modules, corresponding to three Gaussian-based models from coarse to fine. NDLF is trained on 20 highresolution (1024 × 1024) images at different focus depths, enabling it to render focused images at any given focus depth. The structural similarity index between the predicted and measured focused images is 0.9794. Moreover, we developed a hardware system to collect the high resolution focused images with corresponding focus depth, and depth maps. NDLF achieves high-resolution light field imaging with a single-lens camera and also resolves the distribution of 3D-PSFs in 3D space, paving the way for novel lightfield synthesis techniques and deeper insights into defocus blur.

Abstract:
Coherent diffraction imaging (CDI) is a computational technique for reconstructing a complex-valued optical field from an intensity measurement. The approach is to illuminate an object with a coherent beam of light to form a diffraction pattern, and use a phase retrieval algorithm to reconstruct the object's complex transmittance from the measurement. However, as the name implies, conventional CDI assumes highly coherent illumination. Recent works therefore extend CDI to account for partial coherence and imperfect detection, by modeling light as an incoherent mixture of multiple fields (e.g., multiple wavelengths) and recovering each field simultaneously. In this work, we make strides towards the practical implementation and usage of multi-wavelength diffraction imaging. In particular, we provide novel analysis of the noise characteristics of multi-wavelength diffraction imaging, and show that it is preferable to coherent diffraction imaging under high signal-independent noise. Additionally, we present a compact coded diffraction imaging system and corresponding phase retrieval algorithms to robustly and simultaneously recover complex fields representing multiple wavelengths. Using a novel mixed-norm color prior, our prototype system reconstructs a larger number of multi-wavelength fields from fewer measurements than existing methods, and supports applications such as micron-scale optical path difference measurement via synthetic wavelength holography.

Abstract:
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level to score diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they suffer from low credibility and accuracy, thus insufficient for stringent applications, such as competitive sports and sports injury rehabilitation. We argue that a fine-grained understanding of actions requires the model to parse actions in semantics, time, and space, which is the key to the credibility and accuracy of the AQA technique. Based on this insight, we propose a new human-centric fine-grained action quality assessment method named Unified Fine-grained spatial-temporal action Parser, namely Uni-FineParser. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in semantics, time, and space, minimizing the impact of invalid backgrounds during the assessment. In addition, we construct human-centric foreground action mask annotations for the FineDiving, AQA-7, and MTL-AQA datasets, respectively called FineDiving-HM, AQA-7-HM, and MTL-AQA-HM. With refined spatio-temporal annotations on diverse target action procedures, Uni-FineParser can provide a potential for human-centric fine-grained action quality assessment with better interpretability. Through extensive experiments, we demonstrate the effectiveness of Uni-FineParser, which outperforms state-of-the-art methods while supporting more tasks of human-centric action understanding.

Abstract:
Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled and shifted target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although two lines of research share the major challenges – overcoming the underlying domain distribution shift, their studies are largely independent. It causes several issues: (1) The insights gained from each line of research remain fragmented, leading to a lack of holistic understanding of the problem and potential solutions. (2) Preventing the unification of methods and best practices across two scenarios (images and videos) will lead to redundant efforts and missed opportunities for cross-pollination of ideas. (3) Without a unified approach, the knowledge and advancements made in one scenario may not be effectively transferred to the other, leading to suboptimal performance and slower progress. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general domain augmentation perspective, serving as a unifying framework, enabling improved generalization, and potential for cross-pollination, ultimately contributing to the practical impact and overall progress. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling intra-domain discontinuity, fragmented gap bridging, and feature inconsistencies through four-directional paths designed for intra- and inter-domain mixing within an explicit feature space. To deal with temporal shifts within videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment, which is extendable to image scenarios. Extensive experiments show that QuadMix outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.

Abstract:
In this paper, we study two challenging but less-touched problems in image restoration, namely, i) how to quantify the relationship between image degradations and ii) how to improve the performance of a specific restoration task using the quantified relationship. To tackle the first challenge, we propose the Degradation Relationship Index (DRI), which is defined as the mean drop rate difference in validation loss between two models, where one trained solely with anchor degradation and the other trained with both anchor and auxiliary degradations. By quantifying degradation relationship using DRI, we reveal that i) a positive DRI consistently indicates performance improvement when a beneficial auxiliary degradation is incorporated during training; ii) the proportion of auxiliary degradation is crucial to the anchor task performance. In other words, performance improvement is achieved only when the anchor and auxiliary degradations are combined in an appropriate proportion. Based on these observations, we further propose a simple yet effective Degradation Proportion Determination (DPD) method to estimate whether a given degradation combinations can enhance performance on the anchor restoration task with the assistance of auxiliary degradation. Extensive experimental results verify the effectiveness and generalizability of our method on noise, rain streak, haze and snow.

Abstract:
Object detection is critical in autonomous driving, and it is more practical yet challenging to localize objects of unknown categories: an endeavour known as Class-Agnostic Object Detection (CAOD). Existing studies on CAOD predominantly rely on RGB cameras, but these frame-based sensors usually have high latency and limited dynamic range, leading to safety risks under extreme conditions like fast-moving objects, overexposure, and darkness. In this study, we turn to the event-based vision, featured by its sub-millisecond latency and high dynamic range, for robust CAOD. We propose Detecting Every Object in Events (DEOE), an approach aimed at achieving high-speed, class-agnostic object detection in event-based vision. Built upon the fast event-based backbone: recurrent vision transformer, we jointly consider the spatial and temporal consistencies to identify potential objects. The discovered potential objects are assimilated as soft positive samples to avoid being suppressed as backgrounds. Moreover, we introduce a disentangled objectness head to separate the foreground-background classification and novel object discovery tasks, enhancing the model's generalization in localizing novel objects while maintaining a strong ability to filter out the background. Extensive experiments confirm the superiority of our proposed DEOE in both open-set and closed-set settings, outperforming strong baseline methods.

Abstract:
Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: 1) learning representations from multimodal data without labels, 2) fusion of different modalities, and 3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider 1) objectives for learning from multimodal unlabeled data via self-supervision, 2) model architectures from the perspective of different multimodal fusion strategies, and 3) pair-free learning strategies for coarse-grained and fine-grained alignment. We also review real-world applications of SSML algorithms in diverse fields, such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML.

Abstract:
Despite the great success achieved, deep learning technologies usually suffer from data scarcity issues in real-world applications, where existing methods mainly explore sample relationships in a vanilla way from the perspectives of either the input or the loss function. In this paper, we propose a batch transformer module, BatchFormerV1, to equip deep neural networks themselves with the abilities to explore sample relationships in a learnable way. Basically, the proposed method enables data collaboration, e.g., head-class samples will also contribute to the learning of tail classes. Considering that exploring instance-level relationships has very limited impacts on dense prediction, we generalize and refer to the proposed module as BatchFormerV2, which further enables exploring sample relationships for pixel-/patch-level dense representations. In addition, to address the train-test inconsistency where a mini-batch of data samples are neither necessary nor desirable during inference, we also devise a two-stream training pipeline, i.e., a shared model is first jointly optimized with and without BatchFormerV2 which is then removed during testing. The proposed module is plug-and-play without requiring any extra inference cost. Lastly, we evaluate the proposed method on over ten popular datasets, including 1) different data scarcity settings such as long-tailed recognition, zero-shot learning, domain generalization, and contrastive learning; and 2) different visual recognition tasks ranging from image classification to object detection and panoptic segmentation.

Abstract:
In unsupervised meta-learning, the clustering-based pseudo-labeling approach is an attractive framework, since it is model-agnostic, allowing it to synergize with supervised algorithms to learn from unlabeled data. However, the pseudo-labels suffer from clustering noise and semantic chaos problems, further impacting the effectiveness of meta-learning. In this paper, we analyze and optimize the pseudo-labeling process, including encoding and clustering, aiming to generate semantic-like pseudo-labels to narrow the gap between unsupervised and supervised meta-learning. First, during the encoding, we observe that the embedding space of existing methods lacks clustering-friendly properties, which is the primary reason for clustering noise. To address this issue, we minimize the inter-to-intra-class similarity ratio to generate clustering-friendly embedding features and validate our approach through comprehensive experiments. Then, during the clustering, we find that the semantic quality of pseudo-labels is not adequately controlled, resulting in semantic chaos of pseudo-labels. We propose a semantic-stability index to measure the semantic quality of pseudo-labels quantitatively. Based on this index, we propose the Semantic-aware Pseudo-label Reassignment mechanism to generate semantic-like pseudo-labels for all samples. Our approach is model-agnostic and can easily be integrated into existing supervised methods. To demonstrate its generalization ability, we integrate it into two representative algorithms: MAML and EP. The results on three main few-shot benchmarks clearly show that the proposed method achieves significant improvement compared to state-of-the-art models. Notably, our approach also outperforms the corresponding supervised method in three tasks.

Abstract:
This paper explores stochastic multi-level compositional optimization, where the objective function is a composition of multiple smooth functions. Traditional methods for solving this problem suffer from either sub-optimal sample complexities or require huge batch sizes. To address these limitations, we introduce the Stochastic Multi-level Variance Reduction (SMVR) method. In the expectation case, our SMVR method attains the optimal sample complexity of \mathcal O(1/\epsilon ^3)O(1/ε3) to find an \epsilonε-stationary point for non-convex objectives. When the function satisfies convexity or the Polyak-Łojasiewicz (PL) condition, we propose a stage-wise SMVR variant. This variant improves the sample complexity to \mathcal O(1/\epsilon ^2)O(1/ε2) for convex functions and \mathcal O(1/(\mu \epsilon ))O(1/(με)) for functions meeting the \muμ-PL condition or \muμ-strong convexity. These complexities match the lower bounds not only in terms of \epsilonε but also in terms of \muμ (for PL or strongly convex functions), without relying on large batch sizes in each iteration. Furthermore, in the finite-sum case, we develop the SMVR-FS algorithm, which can achieve a complexity of \mathcal O(\sqrtn/\epsilon ^2)O(n/ε2) for non-convex objectives, \mathcal O(\sqrtn/\epsilon \log (1/\epsilon ))O(n/εlog(1/ε)) for convex functions and \mathcal O(\sqrtn/\mu \log (1/\epsilon ))O(n/μlog(1/ε)) for objectives satisfying the \muμ-PL condition, where nn denotes the number of functions in each level. To make use of adaptive learning rates, we propose the Adaptive SMVR method, which maintains the same complexities while demonstrating faster convergence in practice.

Abstract:
Attribute-missing graph learning, a common yet challenging problem, has recently attracted considerable attention. Existing efforts have at least one of the following limitations: 1) lack a noise filtering and information enhancing scheme, resulting in less comprehensive data completion; 2) isolate the node attribute and graph structure encoding processes, introducing more parameters and failing to take full advantage of the two types of information; and 3) impose overly strict distribution assumptions on the latent variables, leading to biased or less discriminative node representations. To tackle the issues, based on the idea of introducing intimate information interaction between the two information sources, we propose Weight-sharing Attribute-missing Graph autoEncoder (WAGE) to boost the expressive capacity of node representations for high-quality missing attribute reconstruction. Specifically, three strategies have been conducted. Firstly, we entangle the attribute embedding and structure embedding by introducing a weight-sharing architecture to share the parameters learned by both processes, which allows the network training to benefit from more abundant and diverse information. Secondly, we introduce a KK-nearest neighbor-based dual non-local learning mechanism to improve the quality of data imputation by revealing unobserved high-confidence connections while filtering unreliable ones. Thirdly, we manually mask the connections on multiple adjacency matrices and force the structure-oriented embedding sub-network to recover the actual adjacency matrix, thus enforcing the resulting network to be able to selectively exploit more high-order discriminative features for data completion. Extensive experiments on six benchmark datasets demonstrate the effectiveness and superiority of WAGE against state-of-the-art competitors.

Abstract:
Contemporary multi-modal trackers achieve strong performance by leveraging complex backbones and fusion strategies, but this comes at the cost of computational efficiency, limiting their deployment in resource-constrained settings. On the other hand, compact multi-modal trackers are more efficient but often suffer from reduced performance due to limited feature representation. To mitigate the performance gap between compact and more complex trackers, we introduce a cross-modality distillation framework. This framework includes a complementarity-aware mask autoencoder designed to enhance cross-modal interactions by selectively masking patches within a modality, thereby forcing the model to learn more robust multi-modal representations. Additionally, we present a specific-common feature distillation module that transfers both modality-specific and shared information from a more powerful model’s backbone to the compact model. Moreover, we develop a multi-path selection distillation module to guide a simple fusion module in learning more accurate multi-modal information from a sophisticated fusion mechanism using multiple paths. Extensive experiments on six multi-modal tracking benchmarks demonstrate that the proposed tracker, despite being lightweight, outperforms most state-of-the-art methods, highlighting its effectiveness. Notably, our tiny variant achieves a PR score of 67.5% on LasHeR, a PR score of 58.5% on DepthTrack, and a PR score of 73.1% on VisEvent with only 6.5 M parameters, while operating at 126 FPS on an NVIDIA 2080Ti GPU.

Abstract:
The current studies of provable robustness for deep neural networks (DNNs) usually assume that the class distribution is overall balanced. However, in real-world applications especially for safety-sensitive systems, the class distribution often exhibits a long-tailed property. It is well-known that the Area Under the ROC Curve (AUC) is a more proper metric for long-tailed learning problems. Motivated by this fact, an AUC-oriented provable robustness learning framework (named AUCPro) is first proposed in this paper. The key is to construct a proxy model smoothed by the isotropic Gaussian noise and then consider optimizing the proxy model from the AUC-oriented learning point of view. Theoretically, we provide a certified safety region for AUCPro within which the model would be free from the \ell _2ℓ2 adversarial attacks. Most importantly, we propose a novel standard to theoretically study the robustness generalization toward unseen data for provable robustness learning approaches. To the best of our knowledge, such a problem remains barely considered in the machine learning community. To be specific, under a general principle for performance-robustness trade-off, we prove that the generalization ability of the resulting model could be equivalently expressed as the expected adversarial risk of AUC under \ell _2ℓ2 perturbation. On top of this, we present two practical settings to explore the excess risk formed by the difference between the empirical risk of AUCPro and the derived generalization performance. Finally, comprehensive experiments speak to the efficacy of our proposed algorithm.

Abstract:
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this article, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric estimation, and subsequently refine points and camera poses as a post-processing. This refinement is robust to large detection noise and appearance changes, as it optimizes a featuremetric error based on dense features predicted by a neural network. This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors, challenging viewing conditions, and off-the-shelf deep features. Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale.

Abstract:
In open-world environments, classification models should be adept at identifying out-of-distribution (OOD) data whose semantics differ from in-distribution (ID) data, leading to the emerging research in OOD detection. As a promising learning scheme, outlier exposure (OE) enables the models to learn from auxiliary OOD data, enhancing model representations in discerning between ID and OOD patterns. However, these auxiliary OOD data often do not fully represent real OOD scenarios, potentially biasing our models in practical OOD detection. Hence, we propose a novel OE-based learning method termed Wasserstein Distribution-agnostic Outlier Exposure (W-DOE), which is both theoretically sound and experimentally superior to previous works. The intuition is that by expanding the coverage of training-time OOD data, the models will encounter fewer unseen OOD cases upon deployment. In W-DOE, we achieve additional OOD data to enlarge the OOD coverage, based on a new data synthesis approach called implicit data synthesis (IDS). It is driven by our new insight that perturbing model parameters can lead to implicit data transformation, which is simple to implement yet effective to realize. Furthermore, we suggest a general learning framework to search for the synthesized OOD data that can benefit the models most, ensuring the OOD performance for the enlarged OOD coverage measured by the Wasserstein metric. Our approach comes with provable guarantees for open-world settings, demonstrating that broader OOD coverage ensures reduced estimation errors and thereby improved generalization for real OOD cases. We conduct extensive experiments across a series of representative OOD detection setups, further validating the superiority of W-DOE against state-of-the-art counterparts in the field.

Abstract:
Existing Neural Radiance Fields (NeRF) methods suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multi-space neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward the existence of reflective and refractive objects. Our multi-space scheme works as an enhancement to existing NeRF methods, with only small computational overheads needed for training and inferring the extra-space outputs. We design different multi-space modules for representative MLP-based and grid-based NeRF methods, which improve Mip-NeRF 360 by 4.15 dB in PSNR with 0.5% extra parameters and further improve TensoRF by 2.71 dB with 0.046% extra parameters on reflective regions without degrading the rendering quality on other regions. We further construct a novel dataset consisting of 33 synthetic scenes and 7 real captured scenes with complex reflection and refraction, where we design complex camera paths to fully benchmark the robustness of NeRF-based methods. Extensive experiments show that our approach significantly outperforms the existing single-space NeRF methods for rendering high-quality scenes concerned with complex light paths through mirror-like objects.

Abstract:
We present the first framework capable of synthesizing the all-in-focus neural radiance field (NeRF) from inputs without manual refocusing. Without refocusing, the camera will automatically focus on the fixed object for all views, and current NeRF methods typically using one camera fail due to the consistent defocus blur and a lack of sharp reference. To restore the all-in-focus NeRF, we introduce the dual-camera from smartphones, where the ultra-wide camera has a wider depth-of-field (DoF) and the main camera possesses a higher resolution. The dual camera pair saves the high-fidelity details from the main camera and uses the ultra-wide camera’s deep DoF as reference for all-in-focus restoration. To this end, we first implement spatial warping and color matching to align the dual camera, followed by a defocus-aware fusion module with learnable defocus parameters to predict a defocus map and fuse the aligned camera pair. We also build a multi-view dataset that includes image pairs of the main and ultra-wide cameras in a smartphone. Extensive experiments on this dataset verify that our solution, termed DC-NeRF, can produce high-quality all-in-focus novel views and compares favorably against strong baselines quantitatively and qualitatively. We further show DoF applications of DC-NeRF with adjustable blur intensity and focal plane, including refocusing and split diopter.

Abstract:
The exploration of quantum advantages with Quantum Neural Networks (QNNs) is an exciting endeavor. Recurrent neural networks, the widely used framework in deep learning, suffer from the gradient vanishing and exploding problem, which limits their ability to learn long-term dependencies. To address this challenge, in this work, we develop the sequential model of Quantum Gated Recurrent Neural Networks (QGRNNs). This model naturally integrates the gating mechanism into the framework of the variational ansatz circuit of QNNs, enabling efficient execution on near-term quantum devices. We present rigorous proof that QGRNNs can preserve the gradient norm of long-term interactions throughout the recurrent network, enabling efficient learning of long-term dependencies. Meanwhile, the architectural features of QGRNNs can effectively mitigate the barren plateau phenomenon. The effectiveness of QGRNNs in sequential learning is convincingly demonstrated through various typical tasks, including solving the adding problem, learning gene regulatory networks, and predicting stock prices. The hardware-efficient architecture and superior performance of our QGRNNs indicate their promising potential for finding quantum advantageous applications in the near term.

Abstract:
CRFs model spatial coherence in classical and deep learning computer vision. The most common CRF is called pairwise, as it connects pixel pairs. There are two types of pairwise CRF: sparse and dense. A sparse CRF connects the nearby pixels, leading to a linear number of connections in the image size. A dense CRF connects all pixel pairs, leading to a quadratic number of connections. While dense CRF is a more general model, it is much less efficient than sparse CRF. In fact, only Gaussian edge dense CRF is used in practice, and even then with approximations. We propose a new pairwise CRF, which we call sparse non-local CRF. Like dense CRF, it has non-local connections, and, therefore, it is more general than sparse CRF. Like sparse CRF, the number of connections is linear, and, therefore, our model is efficient. Besides efficiency, another advantage is that our edge weights are unrestricted. We show that our sparse non-local CRF models properties similar to that of Gaussian dense CRF. We also discuss connections to other CRF models. We demonstrate the usefulness of our model on classical and deep learning applications, for two and multiple labels.

Abstract:
The Concept Bottleneck Model (CBM) is an interpretable neural network that leverages high-level concepts to explain model decisions and conduct human-machine interaction. However, in real-world scenarios, the deficiency of informative concepts can impede the model's interpretability and subsequent interventions. This paper proves that insufficient concept information can lead to an inherent dilemma of concept and label distortions in CBM. To address this challenge, we propose the Decoupling Concept Bottleneck Model (DCBM), which comprises two phases: 1) DCBM for prediction and interpretation, which decouples heterogeneous information into explicit and implicit concepts while maintaining high label and concept accuracy, and 2) DCBM for human-machine interaction, which automatically corrects labels and traces wrong concepts via mutual information estimation. The construction of the interaction system can be formulated as a light min-max optimization problem. Extensive experiments expose the success of alleviating concept/label distortions, especially when concepts are insufficient. In particular, we propose the Concept Contribution Score (CCS) to quantify the interpretability of DCBM. Numerical results demonstrate that CCS can be guaranteed by the Jensen-Shannon divergence constraint in DCBM. Moreover, DCBM expresses two effective human-machine interactions, including forward intervention and backward rectification, to further promote concept/label accuracy via interaction with human experts.

Abstract:
Due to advancements in digital cameras, it is easy to gather multiple images (or videos) from an object under different conditions. Therefore, image-set classification has attracted more attention, and different solutions were proposed to model them. A popular way to model image sets is subspaces, which form a manifold called the Grassmann manifold. In this contribution, we extend the application of Generalized Relevance Learning Vector Quantization to deal with Grassmann manifold. The proposed model returns a set of prototype subspaces and a relevance vector. While prototypes model typical behaviours within classes, the relevance factors specify the most discriminative principal vectors (or images) for the classification task. They both provide insights into the model’s decisions by highlighting influential images and pixels for predictions. Moreover, due to learning prototypes, the model complexity of the new method during inference is independent of dataset size, unlike previous works. We applied it to several recognition tasks including handwritten digit recognition, face recognition, activity recognition, and object recognition. Experiments demonstrate that it outperforms previous works with lower complexity and can successfully model the variation, such as handwritten style or lighting conditions. Moreover, the presence of relevances makes the model robust to the selection of subspaces’ dimensionality.

Abstract:
Recent years have witnessed the success of deep networks in compressed sensing (CS), which allows for a significant reduction in sampling cost and has gained growing attention since its inception. In this paper, we propose a new practical and compact network dubbed PCNet for general image CS. Specifically, in PCNet, a novel collaborative sampling operator is designed, which consists of a deep conditional filtering step and a dual-branch fast sampling step. The former learns an implicit representation of a linear transformation matrix into a few convolutions and first performs adaptive local filtering on the input image, while the latter then uses a discrete cosine transform and a scrambled block-diagonal Gaussian matrix to generate under-sampled measurements. Our PCNet is equipped with an enhanced proximal gradient descent algorithm-unrolled network for reconstruction. It offers flexibility, interpretability, and strong recovery performance for arbitrary sampling rates once trained. Additionally, we provide a deployment-oriented extraction scheme for single-pixel CS imaging systems, which allows for the convenient conversion of any linear sampling operator to its matrix form to be loaded onto hardware like digital micro-mirror devices. Extensive experiments on natural image CS, quantized CS, and self-supervised CS demonstrate the superior reconstruction accuracy and generalization ability of PCNet compared to existing state-of-the-art methods, particularly for high-resolution images.

Abstract:
Understanding 3D human interactions is fundamental for fine-grained scene analysis and behavioural modeling. However, most of the existing models predict incorrect, lifeless 3D estimates, that miss the subtle human contact aspects–the essence of the event–and are of little use for detailed behavioral understanding. This paper addresses such issues with several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3D contact signature prediction; (2) we show how such components can be leveraged to ensure contact consistency during 3D reconstruction; (3) we construct several large datasets for learning and evaluating 3D contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3D motion capture dataset with 631 sequences containing 2,525 contact events, 728,664 ground truth 3D poses, as well as FlickrCI3D, a dataset of 11,216 images, with 14,081 processed pairs of people, and 81,233 facet-level surface correspondences. Finally, (4) we propose methodology for recovering the ground-truth pose and shape of interacting people in a controlled setup and (5) annotate all 3D interaction motions in CHI3D with textual descriptions.

Abstract:
The risk-controlling prediction sets (RCPS) framework is a general tool for transforming the output of any machine learning model to design a predictive rule with rigorous error rate control. The key idea behind this framework is to use labeled hold-out calibration data to tune a hyper-parameter that affects the error rate of the resulting prediction rule. However, the limitation of such a calibration scheme is that with limited hold-out data, the tuned hyper-parameter becomes noisy and leads to a prediction rule with an error rate that is often unnecessarily conservative. To overcome this sample-size barrier, we introduce a semi-supervised calibration procedure that leverages unlabeled data to rigorously tune the hyper-parameter without compromising statistical validity. Our procedure builds upon the prediction-powered inference framework, carefully tailoring it to risk-controlling tasks. We demonstrate the benefits and validity of our proposal through two real-data experiments: few-shot image classification and early time series classification.

Abstract:
With the advancements in technology and monitoring tools, we often encounter multivariate graph signals, which can be seen as the realizations of multivariate graph processes, and revealing the relationship between their constituent quantities is one of the important problems. To address this issue, we propose a cross-spectral analysis tool for bivariate graph signals. The main goal of this study is to extend the scope of spectral analysis of graph signals to bivariate graph signals. In this study, we define joint weak stationarity graph processes and introduce graph cross-spectral density and coherence for bivariate graph processes. We propose several estimators for the cross-spectral density and investigate the theoretical properties of the proposed estimators. Furthermore, we demonstrate the effectiveness of the proposed estimators through numerical experiments, including simulation studies and a real data application. Finally, as an interesting extension, we discuss robust spectral analysis of graph signals in the presence of outliers.

Abstract:
Moire patterns, unwanted color artifacts in images and videos, arise from the interference between spatially high-frequency scene contents and the spatial discrete sampling of digital cameras. Existing demoireing methods primarily rely on single-camera image/video processing, which faces two critical challenges: 1) distinguishing moire patterns from visually similar real textures, and 2) preserving tonal consistency and temporal coherence while removing moire artifacts. To address these issues, we propose a dual-camera framework that captures synchronized videos of the same scene: one in focus (retaining high-quality textures but may exhibit moire patterns) and one defocused (with significantly reduced moire patterns but blurred textures). We use the defocused video to help distinguish moire patterns from real texture, so as to guide the demoireing of the focused video. We propose a frame-wise demoireing pipeline, which begins with an optical flow based alignment step to address any discrepancies in displacement and occlusion between the focused and defocused frames. Then, we leverage the aligned defocused frame to guide the demoireing of the focused frame using a multi-scale CNN and a multi-dimensional training loss. To maintain tonal and temporal consistency, our final step involves a joint bilateral filter to leverage the demoireing result from the CNN as the guide to filter the input focused frame to obtain the final output. Experimental results demonstrate that our proposed framework largely outperforms state-of-the-art image and video demoireing methods.

Abstract:
Evaluating AI systems, particularly large models, is an essential yet computationally expensive task. The use of extensive benchmarks often leads to substantial computational/human costs that may even exceed those of pretraining. The efficiency of AI model evaluation focuses on estimating the model’s score on the full benchmark based on its responses to a smaller subset. Various empirical selection methods have been proposed to identify valuable subsets within these benchmarks. In this paper, we formally define and approximate the subset selection problem inherent in efficient evaluation. We prove that this problem actually optimizes a submodular function and that a unified subset can be identified using a simple greedy algorithm. Importantly, this approach is the first to provide theoretical guarantees of bias control and generalizability in score estimation. Using language models as a case study, experimental results across 11 different benchmarks validate its superiority in estimating model scores and maintaining ranking consistency. It can achieve accurate score estimation using no more than 30% of the full benchmark, thus facilitating efficient and sparse benchmark design.

Abstract:
We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70\mu s.μs. We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.

Abstract:
Person re-identification (ReID) is to identify the same person across non-overlapping camera views. After a decade of development, the methods based on deep networks have achieved high performance on benchmarks and become mainstream. In applications, the features of gallery images extracted by deep learning-based methods are stored to speed up the query process and protect the sensitive information contained in the images. Unfortunately, it is demonstrated that turning the images into features cannot properly protect privacy, as these features could be reversed to the corresponding images, revealing the sensitive information they contain. Therefore, for preventing privacy leakage, recent methods learn their features against some feature reversal methods, and most conventional reversal methods focus on minimizing the difference between a reconstruction and its original image. However, there could be many reasonable reconstruction results from a single feature, and the conventional reversal methods will inevitably generate reconstruction results that lie in a different distribution from one of the original images, which cannot properly assess the private information for learning to protect and thus hamper the privacy-protected feature learning. To mitigate this problem, we enforce the reconstructions to follow the same distribution as the original images by the generative adversarial network (GAN). We operate this GAN-based feature reversal module accompanied by the conventional ReID feature extraction module and form a novel GAN-based feature privacy-protected person ReID model, which is expected to protect feature privacy so as against reversal attack and maintain ReID utility. We demonstrate that optimizing ReID model to accommodate privacy protection faces a double adversarial objective and is thus challenging. As a remedy, we design a novel two-step training and lazy update strategy that alternatively optimizes the feature extraction module and stabilizes the update process of the GAN-based feature reversal module. To evaluate the efficiency of the model in balancing its ReID utility and feature privacy protection, we introduce a novel metric called utility-reversibility ratio (URR). Compared with existing privacy-protected feature extraction models, the proposed method achieves a better balance between privacy protection and person ReID performance. Extensive experiments validate that our model can effectively protect feature privacy at a tiny accuracy cost, and validate the effectiveness of our model with the emerging diffusion model.

Abstract:
Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so these methods inherently struggle with geometric details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate research on this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2 k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.

Abstract:
High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis demonstrate that our method effectively alleviates aliasing while successfully retaining details after demodulation. As a result, the proposed approach considerably enhances existing state-of-the-art segmentation models (e.g., Mask2Former-Swin-T +1.5 mIoU, InternImage-T +1.4 mIoU on ADE20 K). Furthermore, ARS also enhances the performance of powerful Deformable Convolution (+0.8 mIoU on Cityscapes) by maintaining relative positional order during non-uniform sampling. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks.

Abstract:
Random feature latent variable models (RFLVMs) are state-of-the-art tools for uncovering structure in high-dimensional, non-Gaussian data. However, their reliance on Monte Carlo sampling significantly limits scalability, posing challenges for large-scale applications. To overcome these limitations, we develop a scalable RFLVM framework based on variational Bayesian inference (VBI), a deterministic and optimization-based alternative to sampling methods. Applying VBI to RFLVMs is nontrivial due to two key challenges: (i) the lack of an explicit probability density function (PDF) for Dirichlet process (DP) mixing weights, and (ii) the inefficiency of existing VBI approaches when handling the high-dimensional variational parameters of RFLVMs. To address these issues, we adopt the stick-breaking construction for the DP, which provides an explicit and tractable PDF over mixing weights, and propose a novel inference algorithm, block coordinate descent variational inference (BCD-VI), which partitions variational parameters into blocks and applies tailored solvers to optimize them efficiently. The resulting scalable model, referred to as SRFLVM, supports various likelihoods; we demonstrate its effectiveness under Gaussian and logistic settings. Extensive experiments on diverse benchmark datasets show that SRFLVM achieves superior scalability, computational efficiency, and performance in latent representation learning and missing data imputation, consistently outperforming state-of-the-art latent variable models, including deep generative approaches.

Abstract:
Query-based object detectors and segmenters have made great progress in their respective tasks by employing an iterative refinement decoder. These query-based methods directly represent object instances with a set of learnable queries. These query vectors are progressively refined to stable, meaningful representations through a sequence of decoder layers, and then used to directly predict object locations (mask or box) and categories with customized heads. In this paper, we present a novel query-based object decoder design with infinite refinement (DEQ-Decoder) through a deep equilibrium model (DEQ). Our DEQ-Decoder models the query vector refinement as the fixed point solving of an implicit (DEQ) layer. To be more specific to query refinement, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement. Accordingly, we are able to incorporate refinement awareness into the DEQ-Decoder training with the inexact gradient back-propagation (RAG). In addition, to stabilize the training of our DEQ-Decoder and improve its generalization ability, we devise a deep supervision scheme on the optimization path of DEQ-Decoder with refinement-aware perturbation (RAP). To demonstrate the effectiveness of DEQ-Decoder, we apply it to object detection and instance segmentation. For object detection, we propose DEQDet based on our DEQ-Decode. DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart (AdaMixer). In particular, our DEQDet with ResNet50 backbone and 300 queries achieves the 49.6 mAP and 33.9 AP_ss on the MS COCO benchmark under 2×2× training scheme (24 epochs). For instance segmentation, Our DEQSeg achieves much better box mAP metrics and slightly better mask metrics for different mask decoding branches.

Abstract:
Graph machine learning has been extensively studied in both academia and industry. Although booming with a vast number of emerging methods and techniques, most of the literature is built on the in-distribution hypothesis, i.e., testing and training graph data are identically distributed. However, this in-distribution hypothesis can hardly be satisfied in many real-world graph scenarios where the model performance substantially degrades when there exist distribution shifts between testing and training graph data. To solve this critical problem, out-of-distribution (OOD) generalization on graphs, which goes beyond the in-distribution hypothesis, has made great progress and attracted ever-increasing attention from the research community. In this paper, we comprehensively survey OOD generalization on graphs and present a detailed review of recent advances in this area. First, we provide a formal problem definition of OOD generalization on graphs. Second, we categorize existing methods into three classes from conceptually different perspectives, i.e., data, model, and learning strategy, based on their positions in the graph machine learning pipeline, followed by detailed discussions for each category. We also review the theories related to OOD generalization on graphs and introduce the commonly used graph datasets for thorough evaluations. Finally, we share our insights on future research directions.

Abstract:
This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i^22RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6\uparrow↑ . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.

Affiliations: Tsinghua University, Shenzhen, China; .AI, Beijing, China; School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’An, China; University of Chinese Academy of Sciences, Beijing, China; PCA Lab, Nanjing University of Science and Technology, Nanjing, China; Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Huawei Noah’s Ark Lab, Shenzhen, Canada; Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China; Artificial Intelligence Innovation and Incubation (Al’) Institute of Fudan University, Shanghai, China; Shenzhen University, Shenzhen, China

Abstract:
Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans.

Abstract:
Owing to the excellent capability in dealing with label ambiguity, Label Distribution Learning (LDL), as an emerging machine learning paradigm, has received extensive research in recent years. Though remarkable progress has been achieved in various tasks, one limitation with existing LDL methods is that they are all based on the i.i.d. assumption that training and test data are identically and independently distributed. As a result, they suffer obvious performance degradation and are no longer applicable when tested in out-of-distribution scenarios, which severely limits the application of LDL in many tasks. In this paper, we identify and investigate the Generalizable Label Distribution Learning (GLDL) problem. To handle such a challenging problem, we delve into the characteristics of GLDL and find that the label annotations changing with the variability of the domains is the underlying reason for the performance degradation of the existing methods. Inspired by this observation, we explore domain-invariant feature-label correlation information to reduce the impact of label annotations changing with domains and propose two practical methods. Extensive experiments verify the superior performance of the proposed methods. Our work fills the gap in benchmarks and techniques for practical GLDL problems.

Abstract:
The accumulation of adversarial perturbations in the feature space makes it impossible for Deep Neural Networks (DNNs) to know what features are robust and reliable, and thus DNNs can be fooled by relying on a single contaminated feature. Numerous defense strategies attempt to improve their robustness by denoising, deactivating, or recalibrating non-robust features. Despite their effectiveness, we still argue that these methods are under-explored in terms of determining how trustworthy the features are. To address this issue, we propose a novel Evidence-based Multi-Feature Fusion (termed EMFF) for adversarial robustness. Specifically, our EMFF approach introduces evidential deep learning to help DNNs quantify the belief mass and uncertainty of the contaminated features. Subsequently, a novel multi-feature evidential fusion mechanism based on Dempster’s rule is proposed to fuse the trusted features of multiple blocks within an architecture, which further helps DNNs avoid the induction of a single manipulated feature and thus improve their robustness. Comprehensive experiments confirm that compared with existing defense techniques, our novel EMFF method has obvious advantages and effectiveness in both scenarios of white-box and black-box attacks, and also prove that by integrating into several adversarial training strategies, we can improve the robustness of across distinct architectures, including traditional CNNs and recent vision Transformers with a few extra parameters and almost the same cost.

Abstract:
Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target task. Nevertheless, existing approaches may still fail in the open world due to the inevitable in-distribution (ID) and out-of-distribution (OOD) noise from both support and query samples of the target task. With limited support samples available, i) the adverse effect of the dual noises can be severely amplified during task adaptation, and ii) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose DEnoised Task Adaptation (DETA++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a clean prototype loss and a noise entropy maximization loss are proposed to achieve noise-robust task adaptation. Additionally, DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model’s robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.

Abstract:
Stereopsis has widespread appeal in computer vision and robotics as it is the predominant way by which we perceive depth to navigate our 3D world. Event cameras are novel bio-inspired sensors that detect per-pixel brightness changes asynchronously, with very high temporal resolution and high dynamic range, enabling machine perception in high-speed motion and broad illumination conditions. The high temporal precision also benefits stereo matching, making disparity (depth) estimation a popular research area for event cameras ever since their inception. Over the last 30 years, the field has evolved rapidly, from low-latency, low-power circuit design to current deep learning (DL) approaches driven by the computer vision community. The bibliography is vast and difficult to navigate for non-experts due its highly interdisciplinary nature. Past surveys have addressed distinct aspects of this topic, in the context of applications, or focusing only on a specific class of techniques, but have overlooked stereo datasets. This survey provides a comprehensive overview, covering both instantaneous stereo and long-term methods suitable for simultaneous localization and mapping (SLAM), along with theoretical and empirical comparisons. It is the first to extensively review DL methods as well as stereo datasets, even providing practical suggestions for creating new benchmarks to advance the field. The main advantages and challenges faced by event-based stereo depth estimation are also discussed. Despite significant progress, challenges remain in achieving optimal performance in not only accuracy but also efficiency, a cornerstone of event-based computing. We identify several gaps and propose future research directions. We hope this survey inspires future research in depth estimation with event cameras and related topics, by serving as an accessible entry point for newcomers, as well as a practical guide for seasoned researchers in the community.

Abstract:
Out-of-distribution (OOD) detection is critical for identifying test samples that deviate from in-distribution (ID) data, ensuring network robustness and reliability. This paper presents a flexible framework for OOD knowledge distillation that extracts OOD-sensitive information from a network to develop a binary classifier capable of distinguishing between ID and OOD samples in both scenarios, with and without access to training ID data. To accomplish this, we introduce Confidence Amendment (CA), an innovative methodology that transforms an OOD sample into an ID one while progressively amending prediction confidence derived from the network to enhance OOD sensitivity. This approach enables the simultaneous synthesis of both ID and OOD samples, each accompanied by an adjusted prediction confidence, thereby facilitating the training of a binary classifier sensitive to OOD. Theoretical analysis provides bounds on the generalization error of the binary classifier, demonstrating the pivotal role of confidence amendment in enhancing OOD sensitivity. Extensive experiments spanning various datasets and network architectures confirm the efficacy of the proposed method in detecting OOD samples.

Abstract:
Parameter-efficient tunings (PETs) have demonstrated impressive performance and promising perspectives in training large models, while they are still confronted with a common problem: the trade-off between learning new content and protecting old knowledge, leading to zero-shot generalization collapse, and cross-modal hallucination. In this paper, we reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection, and first propose a unified framework called Parameter Efficient Gradient Projection (PEGP). We introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting even for large-scale models. It therefore modifies the gradient towards the direction that has less impact on the old feature space, with less extra memory space and training time. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets, and experiments comprehensively demonstrate its efficiency in reducing forgetting in class, online class, domain, task, and multi-modality continual settings.

Abstract:
The progress on Hyperspectral image (HSI) super-resolution (SR) is still lagging behind the research of RGB image SR. HSIs usually have a high number of spectral bands, so accurately modeling spectral band interaction for HSI SR is hard. Also, training data for HSI SR is hard to obtain so the dataset is usually rather small. In this work, we propose a new test-time training method to tackle this problem. Specifically, a novel self-training framework is developed, where more accurate pseudo-labels and more accurate LR-HR relationships are generated so that the model can be further trained with them to improve performance. In order to better support our test-time training method, we also propose a new network architecture to learn HSI SR without modeling spectral band interaction and propose a new data augmentation method Spectral Mixup to increase the diversity of the training data at test time. We also collect a new HSI dataset with a diverse set of images of interesting objects ranging from food to vegetation, to materials, and to general scenes. Extensive experiments on multiple datasets show that our method can improve the performance of pre-trained models significantly after test-time training and outperform competing methods significantly for HSI SR.

Abstract:
Recent advances in self-supervised learning have witnessed great achievements, especially with the introduction of contrastive learning, where the goal is to maximize the mutual information between different augmentations of the same image, i.e., positive pairs. However, such optimization does not necessarily correspond to optimal representation due to noisy samples, thus inevitably being over-confident in the relevance between views. As a result, the learned model would capture spurious correlation and retain superfluous information that deteriorates representations. In this paper, we facilitate contrastive learning by reducing superfluous relevance between positive views. To this end, we introduce the representation entropy minimization regularization over the objective of vanilla contrastive learning, which forces representations to retain possibly the least information, thus alleviating superfluous relevance from irrelevant views. Then, we derive the analytical expression of the proposed objective by converting it to an information bottleneck problem and solving via variation approximation, which leads to a novel contrastive learning framework, termed as CLIMB, short for Contrastive Learning via variational InforMation Bottleneck. Experiments over multiple benchmarks demonstrate that CLIMB brings consistent improvement. Notably, using DINO as an instantiation, CLIMB achieves 4.5% and 3.5% gain under the k-NN classification metric with EfficientNet-B0 and ResNet-50 as backbones, respectively.

Abstract:
Incompletely-Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi-annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean-teacher framework, SEE, that leverages the vision foundation model, “Segment Anything Model (SAM)”, to generate pseudo-labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low-quality segmentation masks, we introduce a series of strategies for pseudo-label generation, storage, and supervision. These strategies aim to produce informative pseudo-labels, store the best pseudo-labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid-granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single-object and multiple-object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state-of-the-art performance. Furthermore, SEE can serve as a plug-and-play solution, enhancing the performance of existing models.

Abstract:
Real-world graphs are typically complex, exhibiting heterogeneity in the global structure, as well as strong heterophily within local neighborhoods. While a growing body of literature has revealed the limitations of graph neural networks (GNNs) in handling homogeneous graphs with heterophily, little work has been conducted on investigating the heterophily properties in the context of heterogeneous graphs. To bridge this research gap, we identify the heterophily in heterogeneous graphs using metapaths and propose two practical metrics to quantitatively describe the levels of heterophily. Our empirical investigations on real-world heterogeneous graphs have revealed that heterogeneous graph neural networks (HGNNs), which inherit many mechanisms from GNNs designed for homogeneous graphs, struggle to generalize to heterogeneous graphs with heterophily or low levels of homophily. To address the challenge, we present Hetero^22Net, a heterophily-aware HGNN that incorporates masked metapath prediction and masked label prediction tasks to effectively and flexibly handle both homophilic and heterophilic heterogeneous graphs. We evaluate the performance of Hetero^22Net on five real-world heterogeneous graph benchmarks with varying levels of heterophily. Experimental results demonstrate that Hetero^22Net outperforms strong baselines in the semi-supervised node classification task. In particular, Hetero^22Net scales to an industrial-scale commercial graph with 13 M nodes and 157 M edges, demonstrating its effectiveness in handling large and complex heterogeneous graphs.

Abstract:
In multi-labeled complementary label learning (MLCLL), a complementary label (CL) represents an irrelevant label for an instance. Utilizing CLs instead of relevant labels as annotations simplifies the annotation process in multi-label learning (MLL) tasks, underscoring the practicality of the MLCLL problem. However, existing MLCLL approaches mainly focus on scenarios where an instance is associated with a single CL. This restricts their applicability in situations where annotators provide multiple CLs per instance. To address this limitation, we propose a novel paradigm called multi-label learning with multiple complementary labels (ML-MCL), which allows each instance to be associated with multiple CLs simultaneously. Through analyzing the generation process of multiple CLs, we construct the relationship between relevant labels and CLs. This assists in deriving a tailored risk-consistent estimator to solve MLCLL with multiple CLs. Theoretically, we establish an estimation error bound for this estimator, with a convergence rate of \mathcal O(1/\sqrtn)O(1/n). Furthermore, we observed that unbounded gradients can be produced in the derived estimator when optimizing with certain loss functions, which may lead to unstable optimization. To mitigate this issue, we enhance the estimator with a confidence truncation loss, stabilizing the optimization process. Experimental results confirm the effectiveness of our approach, showing improved learning stability and performance in MLCLL tasks involving multiple CLs.

Abstract:
Considering that scene flow estimation has the capability of the spatial domain to focus but lacks the coherence of the temporal domain, this study proposes long-term scene flow estimation (LSFE), a comprehensive task that can simultaneously capture the fine-grained and long-term 3D motion in an online manner. We introduce SceneTracker, the first LSFE network that adopts an iterative approach to approximate the optimal 3D trajectory. The network dynamically and simultaneously indexes and constructs appearance correlation and depth residual features. Transformers are then employed to explore and utilize long-range connections within and between trajectories. With detailed experiments, SceneTracker shows superior capabilities in addressing 3D spatial occlusion and depth noise interference, highly tailored to the needs of the LSFE task. We build a real-world evaluation dataset, LSFDriving, for the LSFE field and use it in experiments to further demonstrate the advantage of SceneTracker in generalization abilities.

Abstract:
Scattering-based computed tomography (CT) recovers a heterogeneous volumetric scattering medium using images taken from multiple directions. It is a nonlinear problem. Prior art mainly approached it by explicit physics-based optimization of image-fitting, being slow and difficult to scale. Scale is particularly important when the objects constitute large cloud fields, where volumetric recovery is important for climate studies. Besides speed, imaging and recovery need to be flexible, to efficiently handle variable viewing geometries and resolutions. These can be caused by perturbation in camera poses or fusion of data from different types of observational sensors. There is a need for fast variable imaging projection scattering tomography of clouds (VIP-CT). We develop a learning-based solution, using a deep-neural network (DNN) which trains on a large physics-based labeled volumetric dataset. The DNN parameters are oblivious to the domain scale, hence the DNN can work with arbitrarily large domains. VIP-CT offers much better quality than the state of the art. The inference speed and flexibility of VIP-CT make it effectively real-time in the context of spaceborne observations. The paper is the first to demonstrate CT of a real cloud using empirical data directly in a DNN. VIP-CT may offer a model for a learning-based solution to nonlinear CT problems in other scientific domains. Our code is available online.

Abstract:
Traditional hand-held light field cameras only observe a small fraction of the cone of light emitted by a scene point. As a consequence, the study of interesting angular effects like iridescence are beyond the scope of such cameras. This paper envisions a new design for sensing light fields with wide baselines, so as to sense a significantly larger fraction of the cone of light emitted by scene points. Our system achieves this by imaging the scene, indirectly, through an ellipsoidal mirror. We show that an ellipsoidal mirror maps a wide cone of light from locations near one of its foci to a narrower cone at its other focus; thus, by placing a conventional light field camera at a focus, we can observe a wide-baseline light field from the scene near the other focus. We show via simulations and a lab prototype that wide-baseline light fields excel in the traditional applications involving changes in focus and perspective. Additionally, the larger cone of light that they observe allows the study of iridescence and thin-film interference. Perhaps surprisingly, the larger cone of light allows us to estimate surface normals of scene points by reasoning about their visibility.

Abstract:
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and testing data by adapting a given model w.r.t. any testing sample. This task is particularly important when the test environment changes frequently. Although some recent attempts have been made to handle this task, we still face two key challenges: 1) prior methods have to perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA solutions can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as catastrophic forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even when the samples are underlying uncertain, leading to overconfident predictions that underestimate the data uncertainty. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we compare the divergence between predictions from the full network and its sub-networks to measure the reducible model uncertainty, on which we propose a test-time uncertainty reduction strategy with divergence minimization loss to encourage consistent predictions instead of overconfident ones. To further re-calibrate predicting confidence on different samples, we utilize the disagreement among predicted labels as an indicator of the data uncertainty. Based on this, we devise a min-max entropy regularization to selectively increase and decrease predicting confidence for confidence re-calibration. Note that EATA-C and EATA are different on the adaptation objective, while EATA-C still benefits from the active sample selection criterion and anti-forgetting Fisher regularization proposed in EATA. Extensive experiments on image classification and semantic segmentation verify the effectiveness of our proposed methods.

Abstract:
Deep learning has significantly propelled the development of photometric stereo by handling the challenges posed by unknown reflectance and global illumination effects. However, how supervised learning-based photometric stereo networks resolve these challenges remains to be elucidated. In this paper, we aim to reveal how existing methods address these challenges by revisiting their deep features, deep feature encoding strategies, and network architectures. Based on the insights gained from our analysis, we propose ESSENCE-Net, which effectively encodes deep shading features with an easy-first-encoding strategy, enhances shading features with shading supervision, and accurately decodes normal with spatial context-aware attention. The experimental results verify that the proposed method outperforms state-of-the-art methods on three benchmark datasets, whether with dense or sparse inputs.

Abstract:
Accurate hyperspectral image (HSI) interpretation is critical for providing valuable insights into various earth observation-related applications such as urban planning, precision agriculture, and environmental monitoring. However, existing HSI processing methods are predominantly task-specific and scene-dependent, which severely limits their ability to transfer knowledge across tasks and scenes, thereby reducing the practicality in real-world applications. To address these challenges, we present HyperSIGMA, a vision transformer-based foundation model that unifies HSI interpretation across tasks and scenes, scalable to over one billion parameters. To overcome the spectral and spatial redundancy inherent in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450 K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA’s versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, real-world applicability, and computational efficiency.

Abstract:
Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design DeepInteraction++, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.

Abstract:
This paper highlights a problem of evaluation metrics adopted in the open-vocabulary segmentation. The evaluation process relies heavily on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. We first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, or language models by comprehensive quantitative analysis and user study to tackle this issue. Based on those explored measurements, we design novel evaluation metrics, Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks. We benchmark the proposed evaluation metrics on twelve open-vocabulary methods in three segmentation tasks. Despite the relative subjectivity of similarity distance, we demonstrate that our metrics can still well evaluate the open ability of the existing open-vocabulary segmentation methods. We hope our work can bring the community new thinking about evaluating model ability for open-vocabulary segmentation.

Abstract:
Spatio-Temporal Video Grounding (STVG) aims at localizing the spatio-temporal tube of a specific object in an untrimmed video given a free-form natural language query. As the annotation of tubes is labor intensive, researchers are motivated to explore weakly supervised approaches in recent works, which usually results in significant performance degradation. To achieve a less expensive STVG method with acceptable accuracy, this work investigates the “single-frame supervision” paradigm that requires a single frame labeled with a bounding box within the temporal boundary of the fully supervised counterpart as the supervisory signal. Based on the characteristics of the STVG problem, we propose a Two-Stage Multiple Instance Learning (T-SMILE) method, which creates pseudo labels by expanding the annotated frame to its contextual frames, thereby establishing a fully-supervised problem to facilitate further model training. The innovations of the proposed method are three-folded, including 1) utilizing multiple instance learning to dynamically select instances in positive bags for the recognition of starting and ending timestamps, 2) learning highly discriminative query features by incorporating spatial prior constraints in cross-attention, and 3) designing a curriculum learning-based strategy that iterative assigns dynamic weights to spatial and temporal branches, thereby gradually adapting to the learning branch with larger difficulty. To facilitate future research on this task, we also contribute a large-scale benchmark containing 12,469 videos on complex scenes with single-frame annotation. The extensive experiments on two benchmarks demonstrate that T-SMILE significantly outperforms all weakly-supervised methods. Remarkably, it also performs better than some fully-supervised methods associated with much more annotation labor costs.

Abstract:
Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language (“cross-modal”) decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advances in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer’s overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. Based on the idea of conducting cross-modal feature fusion in the visual feature encoding stage, we propose a unified framework named Language-Aware Vision Transformer (LAVT), which leverages the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results can be harvested with a light-weight mask predictor. One of the key components in the proposed system is a dense attention mechanism for collecting pixel-specific linguistic cues. When dealing with video inputs, we present the video LAVT framework and design a 3D version of this component by introducing multi-scale convolutional operators arranged in a parallel fashion, which can exploit spatio-temporal dependencies at different granularity levels. We further introduce unified LAVT as a unified framework that could handle both image and video inputs with enhanced segmentation capability on unified referring segmentation task. Our methods surpass previous state-of-the-art methods on seven benchmarks for referring image segmentation and referring video segmentation. The code to reproduce our experiments is available at LAVT-RS.

Abstract:
Recently, post-training quantization (PTQ) has become the de facto way to produce efficient low-precision neural networks without long-time retraining. Despite its low cost, current PTQ works fail to succeed under the extremely low-bit setting. In this work, we delve into extremely low-bit quantization and construct a unified theoretical analysis, which provides an in-depth understanding of the reason for the failure of low-bit quantization. According to the theoretical study, we argue that the existing methods fail in low-bit schemes due to significant perturbation on weights and lack of consideration of activation quantization. To this end, we propose Brecq and QDrop to respectively solve these two challenges, based on which a Q-Limit framework is constructed. Then the Q-Limit framework is further extended to support a mixed precision quantization scheme. To the best of our knowledge, this is the first work that can push the limit of PTQ down to INT2. Extensive experiments on various handcrafted and searched neural architectures are conducted for both visual recognition/detection tasks and language processing tasks. Without bells and whistles, our PTQ framework can attain low-bit ResNet and MobileNetV2 comparable with quantization-aware training (QAT), establishing a new state-of-the-art for PTQ.

Abstract:
Recent advancements in model pruning have focused on developing new algorithms and improving upon benchmarks. However, the practical application of these algorithms across various models and platforms remains a significant challenge. To address this challenge, we propose ONNXPruner, a versatile pruning adapter designed for the ONNX format models. ONNXPruner streamlines the adaptation process across diverse deep learning frameworks and hardware platforms. A novel aspect of ONNXPruner is its use of node association trees, which automatically adapt to various model architectures. These trees clarify the structural relationships between nodes, guiding the pruning process, particularly highlighting the impact on interconnected nodes. Furthermore, we introduce a tree-level evaluation method. By leveraging node association trees, this method allows for a comprehensive analysis beyond traditional single-node evaluations, enhancing pruning performance without the need for extra operations. Experiments across multiple models and datasets confirm ONNXPruner’s strong adaptability and increased efficacy. Our work aims to advance the practical application of model pruning.

Abstract:
Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection.

Abstract:
Snapshot compressive imaging (SCI) surges as a novel way of capturing hyperspectral images. It operates an optical encoder to compress the 3D data into a 2D measurement and adopts a software decoder for the signal reconstruction. Recently, a representative SCI set-up of coded aperture snapshot compressive imager (CASSI) with Transformer reconstruction backend remarks high-fidelity sensing performance. However, dominant spatial and spectral attention designs show limitations in hyperspectral modeling. The spatial attention values describe the inter-pixel correlation but overlook the across-spectra variation within each pixel. The spectral attention size is unscalable to the token spatial size and thus bottlenecks information allocation. Besides, CASSI entangles the spatial and spectral information into a 2D measurement, placing a barrier for information disentanglement and modeling. In addition, CASSI blocks the light with a physical binary mask, yielding the masked data loss. To tackle above challenges, we propose a spatial-spectral (S^2S2-) Transformer implemented by a paralleled attention design and a mask-aware learning strategy. First, we systematically explore pros and cons of different spatial (-spectral) attention designs, based on which we find performing both attentions in parallel well disentangles and models the blended information. Second, the masked pixels induce higher prediction difficulty and should be treated differently from unmasked ones. We adaptively prioritize the loss penalty attributing to the mask structure by referring to the mask-encoded prediction as an uncertainty estimator. We theoretically discuss the distinct convergence tendencies between masked/unmasked regions of the proposed learning strategy. Extensive experiments demonstrate that on average, the results of the proposed method are superior over the state-of-the-art methods. We empirically visualize and reason the behaviour of spatial and spectral attentions, and comprehensively examine the impact of the mask-aware learning, both of which advances the physics-driven deep network design for the reconstruction with CASSI.

Abstract:
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research.

Abstract:
Digital image forensics plays a crucial role in image authentication and manipulation localization. Despite the progress powered by deep neural networks, existing forgery localization methodologies exhibit limitations when deployed to unseen datasets and perturbed images (i.e., lack of generalization and robustness to real-world applications). To circumvent these problems and aid image integrity, this paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts. The rationale is grounded on the observation that most image signal processors (ISP) involve the demosaicing process, which introduces pixel correlations in pristine images. Moreover, manipulating operations, including splicing, copy-move, and inpainting, directly affect such pixel regularity. We, therefore, first split the input image into several blocks and design masked self-attention mechanisms to model the global pixel dependency in input images. Simultaneously, we optimize another local pixel dependency stream to mine local manipulation clues within input forgery images. In addition, we design novel Learning-to-Weight Modules (LWM) to combine features from the two streams, thereby enhancing the final forgery localization performance. To improve the training process, we propose a novel Pixel-Inconsistency Data Augmentation (PIDA) strategy, driving the model to focus on capturing inherent pixel-level artifacts instead of mining semantic forgery traces. This work establishes a comprehensive benchmark integrating 16 representative detection models across 12 datasets. Extensive experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints and achieve state-of-the-art generalization and robustness performances in image manipulation localization.

Abstract:
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (Proof) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that Proof achieves state-of-the-art performance.

Abstract:
The development of a nonparametric and versatile clustering algorithm has been a longstanding challenge in unsupervised learning due to the exploratory nature of the clustering problem. This study presents a novel algorithm, named Gauging-\deltaδ, which can handle diverse cluster shapes and operate in a nonparametric manner. The algorithm employs a hierarchical merging process that starts from individual data points until no further clusters can be merged. The central component of Gauging-\deltaδ is the adaptive mergeability function, which progressively determines if two clusters are mergeable considering the perceptual statistics of the clusters and their environment. Empirical evaluations on 105 synthetic datasets demonstrate the superiority of the proposed algorithm, particularly in accurately handling well-separated clusters. Experiments on real-world datasets highlight the impact of selecting appropriate data features and distance metrics on clustering results.

Abstract:
Feature drift is caused by the dynamic coupling of target features and degradation factors, which reduce underwater detector performance. We redefine feature drift as the instability of target features within boundary constraints while solving partial differential equations (PDEs). From this insight, we propose the Spatial Residual (SR) block, which uses SkipCut to establish effective constraints across the network width for solving PDEs and optimizes the solution space. It is implemented as a general-purpose backbone with 5 Spatial Residuals (BSR5) for complex feature scenarios. Specifically, BSR5 extracts discrete channel slices through SkipCut, where each sliced feature is parsed within the appropriate data capacity. In gradient backpropagation, SkipCut functions as a ShortCut, optimizing information flow and gradient allocation to enhance performance and accelerate training. Experiments on the RUOD dataset show that BSR5-integrated DETRs and YOLOs achieve state-of-the-art results for conventional and end-to-end detectors. Specifically, our BSR5-DETR improves 1.3% and 2.7% AP than RT-DETR with ResNet-101, while reducing parameters by 41.6% and 6.6%, respectively. Further validation highlights BSR5's strong convergence and robustness, especially in training from scratch scenarios, making it well suited for data-scarce, resource-constrained, and real-time tasks.

Abstract:
We propose a method for unsupervised reconstruction of a temporally-consistent sequence of surfaces from a sequence of time-evolving point clouds. It yields dense and semantically meaningful correspondences between frames. We represent the reconstructed surfaces as atlases computed by a neural network, which enables us to establish correspondences between frames. The key to making these correspondences semantically meaningful is to guarantee that the metric tensors computed at corresponding points are as similar as possible. We have devised an optimization strategy that makes our method robust to noise and global motions, without a priori correspondences or pre-alignment steps. As a result, our approach outperforms state-of-the-art ones on several challenging datasets.

Abstract:
Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks. However, many model-free RL algorithms experience performance degradation due to inaccurate value estimation, particularly the overestimation of Q-values, which can lead to suboptimal policies. To address this issue, we previously proposed the Distributional Soft Actor-Critic (DSAC or DSACv1), an off-policy RL algorithm that enhances value estimation accuracy by learning a continuous Gaussian value distribution. Despite its effectiveness, DSACv1 faces challenges such as training instability and sensitivity to reward scaling, caused by high variance in critic gradients due to return randomness. In this paper, we introduce three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy: expected value substitution, twin value distribution learning, and variance-based critic gradient adjustment. The enhanced algorithm, termed DSAC with Three refinements (DSAC-T or DSACv2), is systematically evaluated across a diverse set of benchmark tasks. Without the need for task-specific hyperparameter tuning, DSAC-T consistently matches or outperforms leading model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T ensures a stable learning process and maintains robust performance across varying reward scales. Its effectiveness is further demonstrated through real-world application in controlling a wheeled robot, highlighting its potential for deployment in practical robotic tasks.

Abstract:
Matching suitable jobs with qualified candidates is crucial for online recruitment. Typically, users (i.e., candidates and employers) have specific expectations in the recruitment market, making them prefer similar jobs or candidates. Metric learning technologies provide a promising way to capture the similarity propagation between candidates and jobs. However, they rely on symmetric distance measures, failing to model users' asymmetric relationships in two-way selection. Additionally, users' behaviors (e.g., candidates) are highly affected by the feedback from their counterparts (e.g., employers), which can hardly be captured by the existing person-job fit methods that primarily explore homogeneous and undirected graphs. To address these problems, we propose a quasi-metric learning framework to capture the similarity propagation between candidates and jobs while modeling their asymmetric relations for bilateral person-job fit. Specifically, we propose a quasi-metric space that not only satisfies the triangle inequality to capture the fine-grained similarity between candidates and jobs, but also incorporates a tailored asymmetric measure to model users. two-way selection process in online recruitment. More importantly, the proposed quasi-metric learning framework can theoretically model recruitment rules from similarity and competitiveness perspectives, making it seamlessly align with bilateral person-job fit scenarios. To explore the mutual effects of two-sided users, we first organize candidates, employers, and their different-typed interactions into a heterogeneous relation graph, and then propose a relation-aware graph convolution network to capture users. mutual effects through their bilateral behaviors. Extensive experiments on several real-world datasets demonstrate the effectiveness of the proposed methods.

Abstract:
While deep neural networks (NNs) significantly advance image compressed sensing (CS) by improving reconstruction quality, the necessity of training current CS NNs from scratch constrains their effectiveness and hampers rapid deployment. Although recent methods utilize pre-trained diffusion models for image reconstruction, they struggle with slow inference and restricted adaptability to CS. To tackle these challenges, this paper proposes Invertible Diffusion Models (IDM), a novel efficient, end-to-end diffusion-based CS method. IDM repurposes a large-scale diffusion sampling process as a reconstruction model, and fine-tunes it end-to-end to recover original images directly from CS measurements, moving beyond the traditional paradigm of one-step noise estimation learning. To enable such memory-intensive end-to-end fine-tuning, we propose a novel two-level invertible design to transform both 1) multi-step sampling process and 2) noise estimation U-Net in each step into invertible networks. As a result, most intermediate features are cleared during training to reduce up to 93.8% GPU memory. In addition, we develop a set of lightweight modules to inject measurements into noise estimator to further facilitate reconstruction. Experiments demonstrate that IDM outperforms existing state-of-the-art CS networks by up to 2.64 dB in PSNR. Compared to the recent diffusion-based approach DDNM, our IDM achieves up to 10.09 dB PSNR gain and 14.54 times faster inference.

Abstract:
In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary challenge of IIN arises from the need to recognize the target object across varying viewpoints while ignoring potential distractors. Existing map-based navigation methods typically use Bird’s Eye View (BEV) maps, which lack detailed texture representation of a scene. Consequently, while BEV maps are effective for semantic-level visual navigation, they are struggling for instance-level tasks. To this end, we propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The GaussNav framework enables the agent to memorize both the geometry and semantic information of the scene, as well as retain the textural features of objects. By matching renderings of similar objects with the target, the agent can accurately identify, ground, and navigate to the specified object. Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset.

Abstract:
Large amount of redundancy is widely present in convolutional neural networks (CNNs). Identifying the redundancy in the network and removing the redundant filters is an effective way to compress the CNN model size with a minimal reduction in performance. However, most of the existing redundancy-based pruning methods only consider the distance information between two filters, which can only model simple correlations between filters. Moreover, we point out that distance-based pruning methods are not applicable for high-dimensional features in CNN models by our experimental observations and analysis. To tackle this issue, we propose a new pruning strategy based on high-order spectral clustering. In this approach, we use hypergraph structure to construct complex correlations among filters, and obtain high-order information among filters by hypergraph structure learning. Finally, based on the high-order information, we can perform better clustering on the filters and remove the redundant filters in each cluster. Experiments on various CNN models and datasets demonstrate that our proposed method outperforms the recent state-of-the-art works. For example, with ResNet50, we achieve a 57.1% FLOPs reduction with no accuracy drop on ImageNet, which is the first to achieve lossless pruning with such a high compression ratio.

Abstract:
Traditional object detection models often lose the detailed outline information of the object. To address this problem, we propose the Fourier Series Object Detection (FSD). It encodes the object's outline closed curve into two one-dimensional periodic Fourier series. The Fourier Series Model (FSM) is constructed to regress the Fourier series for each object in the image. Thus, during inference, the detailed outline information of each object can be retrieved. We introduce Rolling Optimization Matching for Fourier loss to ensure that the model's learning process is not affected by the sequence of the starting points of the labeled contour points, speeding up the training process. The FSM demonstrates improved feature extraction and descriptive capabilities for non-rectangular or elongated object regions. The model achieves AP50 = 73.3% on the DOTA 1.5 dataset, which surpasses the state-of-the-art (SOTA) method by 6.44% at 66.86%. On the UCAS dataset, the model achieves AP50 = 97.25%, also surpassing the performance indicators of the SOTA methods. Furthermore, we introduce the object's Fourier power spectrum to describe outline features and the Fourier vector to indicate its direction. This enhances the scene semantic representation of the object detection model and paves a new pathway for the evolution of object detection methodologies.

Abstract:
Autonomous driving systems require a comprehensive understanding and accurate prediction of the surrounding environment to facilitate informed decision-making in complex scenarios. Recent advances in learning-based systems have highlighted the importance of integrating prediction and planning. However, this integration poses significant alignment challenges through consistency between prediction patterns, to interaction between future prediction and planning. To address these challenges, we introduce a Hybrid-Prediction integrated Planning (HPP) framework, which operates through three novel modules collaboratively. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-specific motion forecasting. Our proposed MS-OccFormer module achieves spatial-temporal alignment with motion predictions across multiple granularities. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive dynamics among agents based on their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated into the Ego Planner and optimized by prediction guidance. The HPP framework establishes state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and safety in end-to-end configurations. Moreover, HPP’s interactive open-loop and closed-loop planning performance are demonstrated on the Waymo Open Motion Dataset (WOMD) and CARLA benchmark, outperforming existing integrated pipelines by achieving enhanced consistency between prediction and planning.

Abstract:
Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, real-world graphs invariably contain a certain proportion of heterophilous nodes, challenging the homophily assumption of traditional GNNs and hindering their performance. Most existing studies continue to design generic models with shared weights between heterophilous and homophilous nodes. Despite the incorporation of high-order messages or multi-channel architectures, these efforts often fall short. A minority of studies attempt to train different node groups separately but suffer from inappropriate separation metrics and low efficiency. In this paper, we first propose a new metric, termed Neighborhood Confusion (NC), to facilitate a more reliable separation of nodes. We observe that node groups with different levels of NC values exhibit certain differences in intra-group accuracy and visualized embeddings. These pave the way for Neighborhood Confusion-guided Graph Convolutional Network (NCGCN), in which nodes are grouped by their NC values and accept intra-group weight sharing and message passing. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that our framework can effectively separate nodes and yield significant performance improvement compared to the latest methods.

Abstract:
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability. This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

Abstract:
Regular unsupervised domain adaptive person re-identification (ReID) focuses on adapting a model from a source domain to a fixed target domain. However, an adapted ReID model can hardly retain previously-acquired knowledge and generalize to unseen data. In this paper, we propose a Dual-level Joint Adaptation and Anti-forgetting (DJAA) framework, which incrementally adapts a model to new domains without forgetting source domain and each adapted target domain. We explore the possibility of using prototype and instance-level consistency to mitigate the forgetting during the adaptation. Specifically, we store a small number of representative image samples and corresponding cluster prototypes in a memory buffer, which is updated at each adaptation step. With the buffered images and prototypes, we regularize the image-to-image similarity and image-to-prototype similarity to rehearse old knowledge. After the multi-step adaptation, the model is tested on all seen domains and several unseen domains to validate the generalization ability of our method. Extensive experiments demonstrate that our proposed method significantly improves the anti-forgetting, generalization and backward-compatible ability of an unsupervised person ReID model.

Abstract:
Snapshot Mosaic Hyperspectral Cameras (SMHCs) are popular hyperspectral imaging devices for acquiring both color and motion details of scenes. However, the narrow-band spectral filters in SMHCs may negatively impact their motion perception ability, resulting in blurry SMHC frames. In this paper, we propose a hardware-software collaborative approach to address the blurring issue of SMHCs. Our approach involves integrating SMHCs with neuromorphic event cameras for efficient event-enhanced SMHC frame deblurring. To achieve spectral information recovery guided by event signals, we formulate a spectral-aware Event-based Double Integral (sEDI) model that links SMHC frames and events from a spectral perspective, providing principled model design insights. Then, we develop a Diffusion-guided Noise Awareness (DNA) training framework that utilizes diffusion models to learn noise-aware features and promote model robustness towards camera noise. Furthermore, we design an Event-enhanced Hyperspectral frame Deblurring Network (EvHDNet) based on sEDI, which is trained with DNA and features improved spatial-spectral learning and modality interaction for reliable SMHC frame deblurring. Experiments on both synthetic data and real data show that the proposed DNA + EvHDNet outperforms state-of-the-art methods on both spatial and spectral fidelity. The code and dataset will be made publicly available.

Abstract:
Vision-Language Pre-training (VLP) has shown promising performance in various tasks by learning a generic image-text representation space. However, most existing VLP methods encounter the Noisy Correspondence (NC) problem which refers to wrongly matched image-text pairs harvested from the wild. In this paper, we empirically study the influence of NC on the VLP model and obtain the following two observations. First, the NC will largely degrade the performance in downstream tasks even via fine-tuning, indicating the necessity of handling NC in the pre-training period. Second, the influence of NC varies in different pre-training objectives, suggesting the objective-customized solution for achieving NC robustness. Based on the above observations, we propose a novel NoisE-robust Vision-languagE pRe-training method (NEVER) to endow the VLP model with robustness against NC. In brief, NEVER first divides the training data into clean and noisy subsets in a progressive and adaptive manner. Then NEVER employs the positive learning (PL) and negative learning (NL) on the splits to enjoy model convergence and noise robustness, respectively. To further handle the false negative in PL and NL, NEVER proposes to smoothen and sharpen the training targets with the predictions from a twin momentum model. Extensive experiments on the various V+L tasks verify the effectiveness of the proposed method.

Abstract:
Nonlocal self-similarity (NSS) is an important prior that has been successfully applied in multi-dimensional data processing tasks, e.g., image and video recovery. However, existing NSS-based methods are solely suitable for meshgrid data such as images and videos, but are not suitable for emerging off-meshgrid data, e.g., point cloud and weather data. In this work, we revisit the NSS from the continuous representation perspective and propose a novel Continuous Representation-based NonLocal method (termed as CRNL), which has two innovative features as compared with classical nonlocal methods. First, based on the continuous representation, our CRNL unifies the measure of self-similarity for on-meshgrid and off-meshgrid data and thus is naturally suitable for both of them. Second, the nonlocal continuous groups can be more compactly and efficiently represented by the coupled low-rank function factorization, which simultaneously exploits the similarity within each group and across different groups, while classical nonlocal methods neglect the similarity across groups. This elaborately designed coupled mechanism allows our method to enjoy favorable performance over conventional NSS methods in terms of both effectiveness and efficiency. Extensive multi-dimensional data processing experiments on-meshgrid (e.g., image inpainting and image denoising) and off-meshgrid (e.g., weather data prediction and point cloud recovery) validate the versatility, effectiveness, and efficiency of our CRNL as compared with state-of-the-art methods.

Abstract:
Multimodal learning is expected to boost model performance by integrating information from different modalities. However, its potential is not fully exploited because the widely-used joint training strategy, which has a uniform objective for all modalities, leads to imbalanced and under-optimized uni-modal representations. Specifically, we point out that there often exists modality with more discriminative information, e.g., vision of playing football and sound of blowing wind. They could dominate the joint training process, resulting in other modalities being significantly under-optimized. To alleviate this problem, we first analyze the under-optimized phenomenon from both the feed-forward and the back-propagation stages during optimization. Then, On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training. Concretely, OPM weakens the influence of the dominant modality by dropping its feature with dynamical probability in the feed-forward stage, while OGM mitigates its gradient in the back-propagation stage. In experiments, our methods demonstrate considerable improvement across a variety of multimodal tasks. These simple yet effective strategies not only enhance performance in vanilla and task-oriented multimodal models, but also in more complex multimodal tasks, showcasing their effectiveness and flexibility.

Abstract:
Achieving generalization for deep learning models has usually suffered from the bottleneck of annotated sample scarcity. As a common way of tackling this issue, few-shot learning focuses on “episodes”, i.e., sampled tasks that help the model acquire generalizable knowledge onto unseen categories – better the episodes, the higher a model's generalisability. Despite extensive research, the characteristics of episodes and their potential effects are relatively less explored. A recent paper discussed that different episodes exhibit different prediction difficulties, and coined a new metric “hardness” to quantify episodes, which however is too wide-range for an arbitrary dataset and thus remains impractical for realistic applications. In this paper therefore, we for the first time conduct an algebraic analysis of the critical factors influencing episode hardness supported by experimental demonstrations, that reveal episode hardness to largely depend on classes within an episode, and importantly propose an efficient pre-sampling hardness assessment technique named Inverse-Fisher Discriminant Ratio (IFDR). This enables sampling hard episodes at the class level via class-level (CL) sampling scheme that drastically decreases quantification cost. Delving deeper, we also develop a variant called class-pair-level (CPL) sampling, which further reduces the sampling cost while guaranteeing the sampled distribution. Finally, comprehensive experiments conducted on benchmark datasets verify the efficacy of our proposed method.

Abstract:
Most of the existing panoramic video navigation approaches are saliency-driven, whereby off-the-shelf saliency detection tools are directly employed to aid the navigation approaches in localizing video content that should be incorporated into the navigation path. In view of the dilemma faced by our research community, we rethink if the “saliency clues” are really appropriate to serve the panoramic video navigation task. According to our in-depth investigation, we argue that using “saliency clues” cannot generate a satisfying navigation path, failing to well represent the given panoramic video, and the views in the navigation path are also low aesthetics. In this paper, we present a brand-new navigation paradigm. Although our model is still trained on eye-fixations, our methodology can additionally enable the trained model to perceive the “meaningful” degree of the given panoramic video content. Outwardly, the proposed new approach is saliency-free, but inwardly, it is developed from saliency but biasing more to be “meaningful-driven”; thus, it can generate a navigation path with more appropriate content coverage. Besides, this paper is the first attempt to devise an unsupervised learning scheme to ensure all localized meaningful views in the navigation path have high aesthetics. Thus, the navigation path generated by our approach can also bring users an enjoyable watching experience. As a new topic in its infancy, we have devised a series of quantitative evaluation schemes, including objective verifications and subjective user studies. All these innovative attempts would have great potential to inspire and promote this research field in the near future.

Abstract:
Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.

Abstract:
Data augmentation plays a critical role in self-supervised learning, including anomaly detection. While hand-crafted transformations such as image rotations can achieve impressive performance on image data, effective transformations of non-image data are lacking. In this work, we study learning such transformations for end-to-end anomaly detection on arbitrary data. We find that a contrastive loss–which encourages learning diverse data transformations while preserving the relevant semantic content of the data–is more suitable than previously proposed losses for transformation learning, a fact that we prove theoretically and empirically. We demonstrate that anomaly detection using neural transformation learning can achieve state-of-the-art results for time series data, tabular data, text data and graph data. Furthermore, our approach can make image anomaly detection more interpretable by learning transformations at different levels of abstraction.

Abstract:
The number of categories of instances in the real world is normally huge, and each instance may contain multiple labels. To distinguish these massive labels utilizing machine learning, eXtreme Label Classification (XLC) has been established. However, as the number of categories increases, the number of parameters and nonlinear operations in the classifier also rises. This results in a Classifier Computational Overload Problem (CCOP). To address this, we propose a Multi-Head Encoding (MHE) mechanism, which replaces the vanilla classifier with a multi-head classifier. During the training process, MHE decomposes extreme labels into the product of multiple short local labels, with each head trained on these local labels. During testing, the predicted labels can be directly calculated from the local predictions of each head. This reduces the computational load geometrically. Then, according to the characteristics of different XLC tasks, e.g., single-label, multi-label, and model pretraining tasks, three MHE-based implementations, i.e., Multi-Head Product, Multi-Head Cascade, and Multi-Head Sampling, are proposed to more effectively cope with CCOP. Moreover, we theoretically demonstrate that MHE can achieve performance approximately equivalent to that of the vanilla classifier by generalizing the low-rank approximation problem from Frobenius-norm to Cross-Entropy. Experimental results show that the proposed methods achieve state-of-the-art performance while significantly streamlining the training and inference processes of XLC tasks.

Abstract:
The field of natural language processing (NLP) has witnessed significant progress in recent years, with a notable focus on improving large language models’ (LLM) performance through innovative prompting techniques. Among these, prompt engineering coupled with structures has emerged as a promising paradigm, with designs such as Chain-of-Thought, Tree of Thoughts, or Graph of Thoughts, in which the overall LLM reasoning is guided by a structure such as a graph. As illustrated with numerous examples, this paradigm significantly enhances the LLM’s capability to solve numerous tasks, ranging from logical or mathematical reasoning to planning or creative writing. To facilitate the understanding of this growing field and pave the way for future developments, we devise a general blueprint for effective and efficient LLM reasoning schemes. For this, we conduct an in-depth analysis of the prompt execution pipeline, clarifying and clearly defining different concepts. We then build the first taxonomy of structure-enhanced LLM reasoning schemes. We focus on identifying fundamental classes of harnessed structures, and we analyze the representations of these structures, algorithms executed with these structures, and many others. We refer to these structures as reasoning topologies, because their representation becomes to a degree spatial, as they are contained within the LLM context. Our study compares existing prompting schemes using the proposed taxonomy, discussing how certain design choices lead to different patterns in performance and cost. We also outline theoretical underpinnings, relationships between prompting and other parts of the LLM ecosystem such as knowledge bases, and the associated research challenges. Our work will help to advance future prompt engineering techniques.

Abstract:
This paper focuses on the problem of preventing information leakage in neural networks, i.e., assuming that attackers have obtained intermediate-layer features of a neural network, and preventing attackers from inverting these features to the input with private information. We propose a generic method to slightly revise each arbitrary traditional neural network into a multiary-valued rotation-equivariant neural network (RENN) for preventing information leakage. Specifically, we convert real-valued features in the network into multi-ary features, and each element in the feature vector is a multi-ary number. We hide the input information into a certain phase of the multi-ary feature, and rotate the multi-ary feature for attribute obfuscation in the encryption process. The rotation axis and angle can be considered as the private key. In this way, even when attackers have obtained network parameters and intermediate-layer features, they still cannot extract input information without knowing the rotation information. More crucially, the encryption operation does not damage the spatial correlations between features, so that the encrypted features can be easily processed by convolution operations in the neural network without difficulties. In order to implement successful encryption and decryption, the RENN is designed to satisfy the rotation equivariance property. To this end, we propose a set of rules to revise classic operations in the neural network to ensure the rotation equivariance property. Besides, we prove that the dd-ary RENN is downward compatible with the d^\prime d'-ary RENN when d^\prime < dd'

Affiliations: College of Computer Science, Sichuan University, Chengdu, China; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore; Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China; Department of Radiology, West China Hospital, Sichuan University, Chengdu, China; College of Intelligence and Computing, Tianjin University, Tianjin, China; Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Department of Mathematics, Sichuan University, Chengdu, China; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-Sen University, Shenzhen, China

Abstract:
Medical phrase grounding is crucial for identifying relevant regions in medical images based on phrase queries, facilitating accurate image analysis and diagnosis. However, current methods rely on manual extraction of key phrases from medical reports, reducing efficiency and increasing the workload for clinicians. Additionally, the lack of model confidence estimation limits clinical trust and usability. In this paper, we introduce a novel task—Medical Report Grounding (MRG)—which aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner. To address this challenge, we propose uMedGround, a a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases by embedding a unique token, < \mathtt BOXBOX >, into the vocabulary to enhance detection capabilities. A vision encoder-decoder processes the embedded token and input image to generate grounding boxes. Critically, uMedGround incorporates an uncertainty-aware prediction model, significantly improving the robustness and reliability of grounding predictions. Experimental results demonstrate that uMedGround outperforms state-of-the-art medical phrase grounding methods and fine-tuned large visual-language models, validating its effectiveness and reliability. This study represents a pioneering exploration of the MRG task, marking the first-ever endeavor in this domain. Additionally, we demonstrate the applicability of uMedGround in medical visual question answering and class-based localization tasks, where it highlights visual evidence aligned with key diagnostic phrases, supporting clinicians in interpreting various types of textual inputs, including free-text reports, visual question answering queries, and class labels.

Abstract:
Traditional 3D object detectors, whether fully-, semi-, or weakly-supervised, rely heavily on extensive human annotations. In contrast, this paper introduces an unsupervised 3D object detector that automatically discerns object patterns without such annotations. To achieve this, we propose a Commonsense Prototype-based Detector (CPD) for unsupervised 3D object detection. CPD first constructs Commonsense Prototypes (CProto) to represent the geometric center and size of objects. It then generates high-quality pseudo-labels and guides detector convergence using size and geometry priors from CProto. Building on CPD, we further introduce CPD++, an enhanced version that improves performance by leveraging motion cues. CPD++ learns localization from stationary objects and recognition from moving objects, facilitating the mutual transfer of localization and recognition knowledge between these two object types. Both CPD and CPD++ outperform existing state-of-the-art unsupervised 3D detectors. Furthermore, when trained on Waymo Open Dataset (WOD) and tested on KITTI, CPD++ achieves 89.25% 3D Average Precision (AP) on the moderate car class at a 0.5 IoU threshold, reaching 95.3% of the performance attained by fully supervised counterparts. These results underscore the significant advancements brought by our method.

Abstract:
Precise segmentation of thermal infrared images is crucial in domains like surveillance, medical diagnostics, intelligent transportation, accurate guidance and remote sensing. However, current thermal segmentation methods often oversimplify by treating thermal images as grayscale, neglecting vital physical factors such as thermal imaging effects and material information, thereby constraining segmentation precision. To address these limitations, we propose TherNet, a novel thermal infrared segmentation framework integrating thermal imaging effects and material physical information. The study elucidates the impacts of object radiation, inter-object thermal exchange, atmospheric scattering, and camera thermal inertia on thermal infrared imaging, developing four modules to model or rectify these physical processes. To validate the proposed framework, two large-scale infrared datasets were created: TI-Cityscapes for multi-class semantic segmentation in traffic scenes (4,200 frames, 18 classes), and TBRSD for single-object blindroad segmentation (5,180 frames from a pedestrian perspective). The proposed methods achieved SoTA performance across three infrared semantic segmentation datasets and the blind road segmentation dataset, underscoring the pivotal role of leveraging physical properties. TherNet provides innovative perspectives and robust benchmarks for future developments in the domain.

Abstract:
As XR technology continues to advance rapidly, 3D generation and editing are increasingly crucial. Among these, stylization plays a key role in enhancing the appearance of 3D models. By utilizing stylization, users can achieve consistent artistic effects in 3D editing using a single reference style image, making it a user-friendly editing method. However, recent NeRF-based 3D stylization methods encounter efficiency issues that impact the user experience, and their implicit nature limits their ability to accurately transfer geometric pattern styles. Additionally, the ability for artists to apply flexible control over stylized scenes is considered highly desirable to foster an environment conducive to creative exploration. To address the above issues, we introduce StylizedGS, an efficient 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting representation. We propose a filter-based refinement to eliminate floaters that affect the stylization effects in the scene reconstruction process. The nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale, and regions during the stylization to possess customization capabilities. Our method achieves high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference speed.

Abstract:
This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks.

Abstract:
The Laplace-Beltrami operator has established itself in the field of non-rigid shape analysis due to its many useful properties such as being invariant under isometric transformation, having a countable eigensystem forming an orthonormal basis, and fully characterizing geodesic distances of the manifold. However, this invariancy only applies under isometric deformations, which leads to a performance breakdown in many real-world applications. In recent years emphasis has been placed upon extracting optimal features using deep learning methods, however spectral signatures play a crucial role and still add value. In this paper we take a step back, revisiting the LBO and proposing a supervised way to learn several operators on a manifold. Depending on the task, by applying these functions, we can train the LBO eigenbasis to be more task-specific. The optimization of the LBO leads to enormous improvements to established descriptors such as the heat kernel signature in various tasks such as retrieval, classification, segmentation, and correspondence, proving the adaptation of the LBO eigenbasis to both global and highly local learning settings.

Abstract:
The lottery ticket hypothesis (LTH) has increased attention to pruning neural networks at initialization. We study this problem in the linear setting. We show that finding a sparse mask at initialization is equivalent to the sketching problem introduced for efficient matrix multiplication. This gives us tools to analyze the LTH problem and gain insights into it. Specifically, using the mask found at initialization, we bound the approximation error of the pruned linear model at the end of training. We theoretically justify previous empirical evidence that the search for sparse networks may be data independent. By using the sketching perspective, we suggest a generic improvement to existing algorithms for pruning at initialization, which we show to be beneficial in the data-independent case.

Abstract:
Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

Abstract:
The training of Generative Adversarial Networks (GANs) for high-fidelity images has predominantly relied on large-scale datasets. Emerging research, particularly on GANs ‘lottery tickets’, suggests that dense GANs models have sparse sub-networks capable of superior performance with limited data. However, the conventional process to uncover these ‘lottery tickets’ involves a resource-intensive train-prune-retrain cycle. Addressing this, our paper introduces Re-GAN, a novel, data-efficient approach for GANs training that dynamically reconfigures the GANs architecture during training. This method focuses on iterative pruning of non-important connections and regrowing them, thereby preventing premature loss of important features and maintaining the model’s representational strength. Re-GAN provides a more stable and efficient solution for GANs models with limited data, offering an alternative to existing progressive growing methods and GANs tickets. While Re-GAN has already demonstrated its potential in image generation across diverse datasets, domains, and resolutions, in this paper, we significantly expand our study. We incorporate new applications, notably Image-to-Image translation, include additional datasets, provide in-depth analyses, and explore compatibility with data augmentation techniques. This expansion not only broadens the scope of Re-GAN but also establishes it as a generic training methodology, demonstrating its effectiveness and adaptability in different GANs scenarios.

Abstract:
Self-attention and transformers have been widely used in deep learning. Recent efforts have been devoted to incorporating transformer blocks into different neural architectures, including those with convolutions, leading to various visual transformers for computer vision tasks. In this paper, we propose a novel and compact transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a learnable scheme. LTM-Transformer is compatible with many popular and compact transformer networks, and it reduces the FLOPs and the inference time of the visual transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in popular visual transformers, including MobileViT, EfficientViT, ViT, and Swin, with LTM-Transformer blocks, leading to LTM-Transformer networks with different backbones. The LTM-Transformer is motivated by reduction of Information Bottleneck, and a novel and separable variational upper bound for the IB loss is derived. The architecture of the mask module in our LTM blocks which generates the token merging mask is designed to reduce the derived upper bound for the IB loss. Extensive results on computer vision tasks evidence that LTM-Transformer renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.

Abstract:
Hypergraphs, with their ability to model complex, beyond pair-wise correlations, presents a significant advancement over traditional graphs for capturing intricate relational data across diverse domains. However, the integration of hypergraphs into self-supervised learning (SSL) frameworks has been hindered by the intricate nature of high-order structural variations. This paper introduces the Self-Supervised Hypergraph Training Framework via Structure-Aware Learning (SS-HT), designed to enhance the perception and measurement of these variations within hypergraphs. The SS-HT framework employs a “Masking and Re-Masking” strategy to bolster feature reconstruction in Hypergraph Neural Networks (HGNNs), addressing the limitations of traditional SSL methods. It also introduces a metric strategy for local high-order correlation changes, streamlining the computational efficiency of structural distance calculations. Extensive experiments on 11 datasets demonstrate SS-HT’s superior performance over existing SSL methods for both low-order and high-order data. Notably, the framework significantly reduces data labeling dependency, achieving a 32% improvement over HGNN in the downstream task fine-tuning phase under the 1% labeled data setting in the Cora-CC dataset. Ablation studies further validate SS-HT’s scalability and its capacity to augment the performance of various HGNN methods, underscoring its robustness and applicability in real-world scenarios.

Abstract:
We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM’s constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12% and 13% in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.

Abstract:
This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands.

Abstract:
Deep learning (DL) requires large amounts of labeled data, which is extremely time-consuming and labor-intensive to obtain for medical image segmentation tasks. Meta-learning focuses on developing learning strategies that enable quick adaptation to new tasks with limited labeled data. However, rich-class medical image segmentation datasets for constructing meta-learning multi-tasks are currently unavailable. In addition, data collected from various healthcare sites and devices may present significant distribution differences, potentially degrading model’s performance. In this paper, we propose a task augmentation-based meta-learning method for retinal image segmentation (TAMS) to meet labor-intensive annotation demand. A retinal Lesion Simulation Algorithm (LSA) is proposed to automatically generate multi-class retinal disease datasets with pixel-level segmentation labels, such that meta-learning tasks can be augmented without collecting data from various sources. In addition, a novel simulation function library is designed to control generation process and ensure interpretability. Moreover, a generative simulation network (GSNet) with an improved adversarial training strategy is introduced to maintain high-quality representations of complex retinal diseases. TAMS is evaluated on three different OCT and CFP image datasets, and comprehensive experiments have demonstrated that TAMS achieves superior segmentation performance than state-of-the-art models.

Abstract:
We introduce WildVideo, an open-world benchmark dataset designed to address how to assess hallucination of Large Multi-modal Models (LMMs) for understanding video-language interaction in the wild. Our WildVideo comprehensively tests the perceptual, cognitive, and contextual comprehension hallucination of LMMs through both single-turn and multi-turn open-ended question-answering (QA) tasks on videos captured from two human perspectives (i.e. first-person view and third-person view). We define 9 distinct tasks that challenge LMMs across multi-level perceptual tasks (e.g., static and dynamic perception), multi-aspect cognitive tasks (e.g., commonsense, world knowledge), and multi-faceted contextual comprehension tasks (e.g., contextual ellipsis, cross-turn retrieval). The benchmark consists of 1,318 meticulously curated videos, supplemented with 13,704 single-turn QA pairs and 1,585 multi-turn dialogues (up to 5 turns). We evaluated 14 commonly-used LMMs on WildVideo, revealing significant hallucination issues of current LMMs, highlighting substantial gaps in their current capabilities.

Abstract:
Thanks to extracting the intra-sample representation, the convolution neural network (CNN) has achieved excellent performance in vision tasks. However, its numerous convolutional layers take a higher training expense. Recently, graph neural networks (GNN), a bilinear model, have succeeded in exploring the underlying topological relationship among the graph data with a few graph neural layers. Unfortunately, due to the lack of graph structure and high-cost inference on large-scale scenarios, it cannot be directly utilized on non-graph data. Inspired by these complementary strengths and weaknesses, we discuss a natural question, how to bridge these two heterogeneous networks? In this paper, we propose a novel CNN2GNN framework to unify CNN and GNN together via distillation. First, to break the limitations of GNN, we design a differentiable sparse graph learning module as the head of the networks. It can dynamically learn the graph for inductive learning. Then, a response-based distillation is introduced to transfer the knowledge and bridge these two heterogeneous networks. Notably, due to extracting the intra-sample representation of a single instance and the topological relationship among the datasets simultaneously, the performance of the distilled “boosted” two-layer GNN on Mini-ImageNet is much higher than CNN containing dozens of layers, such as ResNet152.

Abstract:
Tracking and reconstructing deformable objects with little texture is challenging due to the lack of features. Here we introduce “invisible markers” for accurate and robust correspondence matching and tracking. Our markers are visible only under ultraviolet (UV) light. We build a novel imaging system for capturing videos of deformed objects under their original untouched appearance (which may have little texture) and, simultaneously, with our markers. We develop an algorithm that first establishes accurate correspondences using video frames with markers, and then transfers them to the untouched views as ground-truth labels. In this way, we are able to generate high-quality labeled data for training learning-based algorithms. We contribute a large real-world dataset, DOT, for tracking deformable objects with little or no texture. Our dataset has about one million video frames of various types of deformable objects. We provide ground truth tracked correspondences in both 2D and 3D. We benchmark state-of-the-art methods on optical flow and deformable object reconstruction using our dataset, which poses great challenges. By training on DOT, their performance significantly improves, not only on our dataset, but also on other unseen data.

Abstract:
Differentiable 3D-Gaussian splatting (GS) is emerging as a prominent technique in computer vision and graphics for reconstructing 3D scenes. GS represents a scene as a set of 3D Gaussians with varying opacities and employs a computationally efficient splatting operation along with analytical derivatives to compute the 3D Gaussian parameters given scene images captured from various viewpoints. Unfortunately, capturing surround view (\text360^\circ 360∘ viewpoint) images is impossible or impractical in many real-world imaging scenarios, including underwater imaging, rooms inside a building, and autonomous navigation. In these restricted baseline imaging scenarios, the GS algorithm suffers from a well-known ‘missing cone’ problem, which results in poor reconstruction along the depth axis. In this paper, we demonstrate that using transient data (from sonars) allows us to address the missing cone problem by sampling high-frequency data along the depth axis. We extend the Gaussian splatting algorithms for two commonly used sonars and propose fusion algorithms that simultaneously utilize RGB camera data and sonar data. Through simulations, emulations, and hardware experiments across various imaging scenarios, we show that the proposed fusion algorithms lead to significantly better novel view synthesis (5 dB improvement in PSNR) and 3D geometry reconstruction (60% lower Chamfer distance).

Abstract:
Single-photon cameras (SPCs) have emerged as a promising new technology for high-resolution 3D imaging. A single-photon 3D camera determines the round-trip time of a laser pulse by precisely capturing the arrival of individual photons at each camera pixel. Constructing photon-timestamp histograms is a fundamental operation for a single-photon 3D camera. However, in-pixel histogram processing is computationally expensive and requires large amount of memory per pixel. Digitizing and transferring photon timestamps to an off-sensor histogramming module is bandwidth and power hungry. Can we estimate distances without explicitly storing photon counts? Yes—here we present an online approach for distance estimation suitable for resource-constrained settings with limited bandwidth, memory and compute. The two key ingredients of our approach are (a) processing photon streams using race logic, which maintains photon data in the time-delay domain, and (b) constructing count-free equi-depth histograms as opposed to conventional equi-width histograms. Equi-depth histograms are a more succinct representation for “peaky” distributions, such as those obtained by an SPC pixel from a laser pulse reflected by a surface. Our approach uses a binner element that converges on the median (or, more generally, to another kk-quantile) of a distribution. We cascade multiple binners to form an equi-depth histogrammer that produces multi-bin histograms. Our evaluation shows that this method can provide at least an order of magnitude reduction in bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional histogram-based processing methods.

Abstract:
A large number of works aim to alleviate the impact of noise due to an underlying conventional assumption of the negative role of noise. However, some existing works show that the assumption does not always hold. In this paper, we investigate how to benefit the classical models by random noise under the framework of Positive-incentive Noise (Pi-Noise) (Li, 2024). Since the ideal objective of Pi-Noise is intractable, we propose to optimize its variational bound instead, namely variational Pi-Noise (VPN). With the variational inference, a VPN generator implemented by neural networks is designed for enhancing base models and simplifying the inference of base models, without changing the architecture of base models. Benefiting from the independent design of base models and VPN generators, the VPN generator can work with most existing models. From the extensive experiments on different base models (including linear models, ResNet, ViT, etc.) it is shown that the proposed VPN generator can improve the base models. It is appealing that the trained VPN generator prefers to blur the irrelevant ingredients in complicated images, which meets our expectations.

Abstract:
Prompt tuning is a valuable technique for adapting visual language models (VLMs) to different downstream tasks, such as domain generalization and learning from a few examples. Previous methods have utilized Context Optimization approaches to deduce domain-shared or cross-modality prompt tokens, which enhance generalization and discriminative ability in textual or visual contexts. However, these prompt tokens, inferred from training data, cannot adapt perfectly to the distribution of the test dataset. This work introduces a novel approach called Bi-modality Individual-aware Prompt Tuning (BIP) by explicitly incorporating the individual's essential prior knowledge into the learnable prompt to enhance their discriminability and generalization. The critical insight of BIP involves applying the Textual Knowledge Embedding (TKE) and Visual Knowledge Embedding (VKE) models to project the class-aware textual essential knowledge and the instance-aware essential knowledge into the class-aware prompt and instance-aware prompt, referred to as Textual-level Class-aware Prompt tuning (TCP) and Visual-level Instance-aware Prompt tuning (VIP). On the one hand, TCP integrates the generated class-aware prompts into the Text Encoder to produce a dynamic class-aware classifier to improve generalization on unseen domains. On the other hand, VIP uses the instance-aware prompt to generate the dynamic visual embedding of each instance, thereby enhancing the discriminative capability of visual embedding. Comprehensive evaluations demonstrate that BIP can be used as a plug-and-play module easily integrated with existing methods and achieves superior performance on 15 benchmarks across four tasks.

Abstract:
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we leverage the powerful self-supervised image encoder (i.e., DINOv2) to extract the discriminative dentity feature of the target object. Besides, we complement the identity feature with detail features, which are carefully designed to maintain appearance details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Starting from the task of object insertion, we further extend the framework of AnyDoor to a general solution with region-to-region image reference. With the different definitions of the source region and target region, the tasks of object insertion, object removal, and image variation could be integrated into one model without introducing extra parameters. In addition, we investigate incorporating other conditions like the mask, pose skeleton, and depth map as additional guidance to achieve more controllable generation.

Abstract:
Hand-based multimodal biometrics have attracted significant attention due to their high security and performance. However, existing methods fail to adequately decouple various hand biometric traits, limiting the extraction of unique features. Moreover, effective feature extraction for multiple hand traits remains a challenge. To address these issues, we propose a novel method for the precise decoupling of hand multimodal features called ‘Normalized-Full-Palmar-Hand’ and construct an authentication system based on this method. First, we propose HSANet, which accurately segments various hand regions with diverse backgrounds based on low-level details and high-level semantic information. Next, we establish two hand multimodal biometric databases with HSANet: SCUT Normalized-Full-Palmar-Hand Database Version 1 (SCUT_NFPH_v1) and Version 2 (SCUT_NFPH_v2). These databases include full hand images, semantic masks, and images of various hand biometric traits obtained from the same individual at the same scale, totaling 157,500 images. Third, we propose the Full Palmar Hand Authentication Network framework (FPHandNet) to extract unique features of multiple hand biometric traits. Finally, extensive experimental results, performed via the publicly available CASIA, IITD, COEP databases, and our proposed databases, validate the effectiveness of our methods.

Abstract:
Multi-interest learning method for sequential recommendation aims to predict the next item according to user multi-faceted interests given the user historical interactions. Existing methods mainly consist of a multi-interest extractor that embeds the user interactions into the user multi-interest embeddings, and a multi-interest aggregator that aggregates the learned multi-interest embeddings to the final user embedding, used for predicting the user rating to an item. Despite their effectiveness, existing methods have two key limitations: 1) they directly feed the user interactions into the multi-interest extractor and aggregator, while ignoring their different learning objectives, and 2) they merely consider the centrality of the user interactions to capture the user interests, while overlooking their dispersion. To tackle these limitations, we propose a prompt-based multi-interest learning method (PoMRec), where specific prompts are inserted into the inputted user interactions to make them adaptive to the multi-interest extractor and aggregator. Moreover, we utilize both the mean and variance embeddings of user interactions to embed the user multiple interests for the comprehensively user interest learning. We conduct extensive experiments on three public datasets, and the results verify that our proposed PoMRec outperforms the state-of-the-art multi-interest learning methods.

Abstract:
Recent advancements in image denoising algorithms have significantly improved visual performance. However, they also introduce new challenges for image quality assessment (IQA) indicators to provide evaluations that align with human visual perception. To address the limitations of current single-indicator methods, we propose a comprehensive IQA framework that integrates multiple indicators to achieve a holistic assessment of image quality. We first develop a large-scale denoised image dataset to show the diversity of distortions. Then, we employ structural equation modeling to establish correlations among three fundamental aspects of image quality, i.e., structural similarity, information loss, and perceptual gain. Through a series of regressions and iterative refinements, we eliminate indicators with low accuracy and high redundancy, thus resulting in a robust and optimal indicator system. Finally, we systematically validate the reliability and effectiveness of the proposed system through statistical analysis and evaluate its performance across three key tasks, i.e., image quality prediction, IQA indicator comparison, and denoising algorithm optimization. Experimental results demonstrate that the proposed system not only offers highly reliable and valid assessments but also provides valuable insights for the analysis and application of IQA indicators.

Abstract:
Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.

Abstract:
Conventional compressed sensing (CS) algorithms typically apply a uniform sampling rate to different image blocks. A more strategic approach could be to allocate the number of measurements adaptively, based on each image block’s complexity. In this paper, we propose a Measurement-Bounds-based Rate-Adaptive Image Compressed Sensing Network (MB-RACS) framework, which aims to adaptively determine the sampling rate for each image block in accordance with traditional measurement bounds theory. Moreover, since in real-world scenarios statistical information about the original image cannot be directly obtained, we suggest a multi-stage rate-adaptive sampling strategy. This strategy sequentially adjusts the sampling ratio allocation based on the information gathered from previous samplings. We formulate the multi-stage rate-adaptive sampling as a convex optimization problem and address it using a combination of Newton’s method and binary search techniques. Our experiments demonstrate that the proposed MB-RACS method surpasses current leading methods, with experimental evidence also underscoring the effectiveness of each module within our proposed framework.

Abstract:
The Vehicle Routing Problem (VRP) is a classic optimization problem with diverse real-world applications. The neighborhood search has emerged as an effective approach, yielding high-quality solutions across different VRPs. However, most existing studies exhaustively explore all considered neighborhoods with a pre-fixed order, leading to an inefficient search process. To address this issue, this paper proposes a Learning-aided Neighborhood Search algorithm (LaNS) that employs a cutting-edge multi-agent reinforcement learning-driven adaptive operator/neighborhood selection mechanism to achieve efficient routing for VRP. Within this framework, two agents serve as high-level instructors, collaboratively guiding the search direction by selecting perturbation/improvement operators from a pool of low-level heuristics. Furthermore, to equip the agents with comprehensive information for learning guidance knowledge, we have developed a new informative state representation. This representation transforms the spatial route structures into an image-like tensor, allowing us to extract spatial features using a convolutional neural network. Comprehensive evaluations on diverse VRP benchmarks, including the capacitated VRP (CVRP), multi-depot VRP (MDVRP) and cumulative multi-depot VRP with energy constraints, demonstrate LaNS's superiority over the state-of-the-art neighborhood search methods as well as the existing learning-guided neighborhood search algorithms.

Abstract:
Recently, person re-identification (ReID) has witnessed fast development due to its broad practical applications and proposed various settings, e.g., traditional ReID, clothes-changing ReID, and visible-infrared ReID. However, current studies primarily focus on single specific tasks, which limits model applicability in real-world scenarios. This paper aims to address this issue by introducing a novel instruct-ReID task that unifies 6 existing ReID tasks in one model and retrieves images based on provided visual or textual instructions. Instruct-ReID is the first exploration of a general ReID setting, where 6 existing ReID tasks can be viewed as special cases by assigning different instructions. To facilitate research in this new instruct-ReID task, we propose a large-scale OmniReID++ benchmark equipped with diverse data and comprehensive evaluation methods, e.g., task-specific and task-free evaluation settings. In the task-specific evaluation setting, gallery sets are categorized according to specific ReID tasks. We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework. For task-free evaluation setting, where target person images are retrieved from task-agnostic gallery sets, we further propose a new method called IRM++ with novel memory bank-assisted learning. Extensive evaluations of IRM and IRM++ on OmniReID++ benchmark demonstrate the superiority of our proposed methods, achieving state-of-the-art performance on 10 test sets.

Abstract:
Lane is critical in the vision navigation system of intelligent vehicles. Naturally, the lane is a traffic sign with high-level semantics, whereas it owns the specific local pattern which needs detailed low-level features to localize accurately. Using different feature levels is of great importance for accurate lane detection, but it is still under-explored. On the other hand, current lane detection methods still struggle to detect complex dense lanes, such as Y-shape or fork-shape. In this work, we present Cross Layer Refinement Network aiming at fully utilizing both high-level and low-level features in lane detection. In particular, it first detects lanes with high-level semantic features and then performs refinement based on low-level features. In this way, we can exploit more contextual information to detect lanes while leveraging local-detailed features to improve localization accuracy. We present Fast-ROIGather to gather global context, which further enhances the representation of lane features. To detect dense lanes accurately, we propose Correlation Discrimination Module (CDM) to discriminate the correlation of dense lanes, enabling nearly cost-free high-quality dense lane prediction. In addition to our novel network design, we introduce LineIoU loss which regresses lanes as a whole unit to improve localization accuracy. Experiments demonstrate our approach significantly outperforms the state-of-the-art lane detection methods.

Abstract:
In image retrieval, standard evaluation metrics rely on score ranking, e.g. average precision (AP), recall at k (R@k), normalized discounted cumulative gain (NDCG). In this work, we introduce a general framework for robust and decomposable rank losses optimization. It addresses two major challenges for end-to-end training of deep neural networks with rank losses: non-differentiability and non-decomposability. First, we propose a general surrogate for ranking operator, SupRank, that is amenable to stochastic gradient descent. It provides an upperbound for rank losses and ensures robust training. Second, we use a simple yet effective loss function to reduce the decomposability gap between the averaged batch approximation of ranking losses and their values on the whole training set. We apply our framework to two standard metrics for image retrieval: AP and R@k. Additionally, we apply our framework to hierarchical image retrieval. We introduce an extension of AP, the hierarchical average precision \mathcal H\mathrm -APH- AP , and optimize it as well as the NDCG. Finally, we create the first hierarchical landmarks retrieval dataset. We use a semi-automatic pipeline to create hierarchical labels, extending the large scale Google Landmarks v2 dataset.

Abstract:
Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches.

Abstract:
Deep generative models have gained considerable attention in low-level vision tasks due to their powerful generative capabilities. Among these, diffusion model-based approaches, which employ a forward diffusion process to degrade an image and a reverse denoising process for image generation, have become particularly prominent for producing high-quality, diverse samples with intricate texture details. Despite their widespread success in low-level vision, there remains a lack of a comprehensive, insightful survey that synthesizes and organizes the advances in diffusion model-based techniques. To address this gap, this paper presents the first comprehensive review focused on denoising diffusion models applied to low-level vision tasks, covering both theoretical and practical contributions. We outline three general diffusion modeling frameworks and explore their connections with other popular deep generative models, establishing a solid theoretical foundation for subsequent analysis. We then categorize diffusion models used in low-level vision tasks from multiple perspectives, considering both the underlying framework and the target application. Beyond natural image processing, we also summarize diffusion models applied to other low-level vision domains, including medical imaging, remote sensing, and video processing. Additionally, we provide an overview of widely used benchmarks and evaluation metrics in low-level vision tasks. Our review includes an extensive evaluation of diffusion model-based techniques across six representative tasks, with both quantitative and qualitative analysis. Finally, we highlight the limitations of current diffusion models and propose four promising directions for future research. This comprehensive review aims to foster a deeper understanding of the role of denoising diffusion models in low-level vision.

Abstract:
Text-to-Image (T2I) models have advanced significantly, but their growing popularity raises security concerns due to their potential to generate harmful images. To address these issues, we propose UPAM, a novel framework to evaluate the robustness of T2I models from an attack perspective. Unlike prior methods that focus solely on textual defenses, UPAM unifies the attack on both textual and visual defenses. Additionally, it enables gradient-based optimization, overcoming reliance on enumeration for improved efficiency and effectiveness. To handle cases where T2I models block image outputs due to defenses, we introduce Sphere-Probing Learning (SPL) to enable optimization even without image results. Following SPL, our model bypasses defenses, inducing the generation of harmful content. To ensure semantic alignment with attacker intent, we propose Semantic-Enhancing Learning (SEL) for precise semantic control. UPAM also prioritizes the naturalness of adversarial prompts using In-context Naturalness Enhancement (INE), making them harder for human examiners to detect. Additionally, we address the issue of iterative queries–common in prior methods and easily detectable by API defenders–by introducing Transferable Attack Learning (TAL), allowing effective attacks with minimal queries. Extensive experiments validate UPAM’s superiority in effectiveness, efficiency, naturalness, and low query detection rates.

Abstract:
Weakly supervised object localization (WSOL) aims to localize objects using only image-level labels as supervision. Despite recent advancements incorporating transformers into WSOL have resulted in improvements, these methods often rely on category-agnostic attention maps, leading to suboptimal object localization. This paper presents a novel CLIP-Driven TRansformer (CDTR) that learns category-aware representations for accurate object localization. Specifically, we initially propose a Category-aware Stimulation Module (CSM) that embeds learnable category biases into self-attention maps, enhancing the learning process with auxiliary supervision. Additionally, an Object Constraint Module (OCM) is designed to refine object regions in a self-supervised manner, leveraging the discriminative potential of the self-attention maps provided by CSM. To create a synergistic connection between CSM and OCM, we further develop a Semantic Kernel Integrator (SKI), which generates a semantic kernel for self-attention maps. Meanwhile, we explore the CLIP model and design a Semantic Boost Adapter (SBA) to enrich object representations by integrating semantic-specific image and text representations into self-attention maps. Extensive experimental evaluations on benchmark datasets, such as CUB-200-2011 and ILSVRC highlight the superior performance of our CDTR framework.

Abstract:
We present Class-agnostic Repetitive action Counting (CaRaCount), a novel approach to count repetitive human actions in the wild using wearable devices time series data. CaRaCount is the first few-shot class-agnostic method, being able to count repetitions of any action class with only a short exemplar data sequence containing a few examples from the action class of interest. To develop and evaluate this method, we collect a large-scale time series dataset of repetitive human actions in various context, containing smartwatch data from 10 subjects performing 50 different activities. Experiments on this dataset and three other activity counting datasets namely Crossfit, Recofit, and MM-Fit show that CaRaCount can count repetitive actions with low error, and it outperforms other baselines and state-of-the-art action counting methods. Finally, with a user experience study, we evaluate the usability of our real-time implementation. Our results highlight the efficiency and effectiveness of our approach when deployed outside the laboratory environments.

Abstract:
Foundation models have emerged as critical components in a variety of artificial intelligence applications, and showcase significant success in natural language processing and several other domains. Meanwhile, the field of graph machine learning is witnessing a paradigm transition from shallow methods to more sophisticated deep learning approaches. The capabilities of foundation models in generalization and adaptation motivate graph machine learning researchers to discuss the potential of developing a new graph learning paradigm. This paradigm envisions models that are pre-trained on extensive graph data and can be adapted for various graph tasks. Despite this burgeoning interest, there is a noticeable lack of clear definitions and systematic analyses pertaining to this neuicew domain. To this end, this article introduces the concept of Graph Foundation Models (GFMs), and offers an exhaustive explanation of their key characteristics and underlying technologies. We proceed to classify the existing work related to GFMs into three distinct categories, based on their dependence on graph neural networks and large language models. In addition to providing a thorough review of the current state of GFMs, this article also outlooks potential avenues for future research in this rapidly evolving domain.

Abstract:
Clustering techniques are often validated using benchmark datasets where class labels are used as ground-truth clusters. However, depending on the datasets, class labels may not align with the actual data clusters, and such misalignment hampers accurate validation. Therefore, it is essential to evaluate and compare datasets regarding their cluster-label matching (CLM), i.e., how well their class labels match actual clusters. Internal validation measures (IVMs), like Silhouette, can compare CLM over different labeling of the same dataset, but are not designed to do so across different datasets. We thus introduce Adjusted IVMs as fast and reliable methods to evaluate and compare CLM across datasets. We establish four axioms that require validation measures to be independent of data properties not related to cluster structure (e.g., dimensionality, dataset size). Then, we develop standardized protocols to convert any IVM to satisfy these axioms, and use these protocols to adjust six widely used IVMs. Quantitative experiments (1) verify the necessity and effectiveness of our protocols and (2) show that adjusted IVMs outperform the competitors, including standard IVMs, in accurately evaluating CLM both within and across datasets. We also show that the datasets can be filtered or improved using our method to form more reliable benchmarks for clustering validation.

Abstract:
Real-world machine learning systems need to analyze test data that may differ from training data. In K-way classification, this is crisply formulated as open-set recognition, core to which is the ability to discriminate open-set data outside the K closed-set classes. Two conceptually elegant ideas for open-set discrimination are: 1) discriminatively learning an open-vs-closed binary discriminator by exploiting some outlier data as the open-set; and 2) unsupervised learning the closed-set data distribution with a GAN, using its discriminator as the open-set likelihood function. However, the former generalizes poorly to diverse open test data due to overfitting to the training outliers, which are unlikely to exhaustively span the open-world. The latter does not work well, presumably due to the unstable training of GANs. Motivated by the above, we propose OpenGAN, which addresses the limitation of each approach by combining them with several technical insights. First, we show that a carefully selected GAN-discriminator on some real outlier data already achieves the state-of-the-art. Second, we augment the available set of real open training examples with adversarially synthesized “fake” data. Third and most importantly, we build the discriminator over the features computed by the closed-world K-way networks. This allows OpenGAN to be implemented via a lightweight discriminator head built on top of an existing K-way network. Extensive experiments show that OpenGAN significantly outperforms prior open-set methods.

Abstract:
In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.

Abstract:
Recent advances in hand-crafted neural architectures for visual recognition underscore the pressing need to explore architecture designs comprising diverse building blocks. Concurrently, neural architecture search (NAS) methods have gained traction as a means to alleviate human efforts. Nevertheless, the question of whether NAS methods can efficiently and effectively manage diversified search spaces featuring disparate candidates, such as Convolutional Neural Networks (CNNs) and transformers, remains an open question. In this work, we introduce a novel unsupervised NAS approach called BossNAS (Block-wisely Self-supervised Neural Architecture Search), which aims to address the problem of inaccurate predictive architecture ranking caused by a large weight-sharing space while mitigating potential ranking issue caused by biased supervision. To achieve this, we factorize the search space into blocks and introduce a novel self-supervised training scheme called Ensemble Bootstrapping, to train each block separately in an unsupervised manner. In the search phase, we propose an unsupervised Population-Centric Search, optimizing the candidate architecture towards the population center. Additionally, we enhance our NAS method by integrating masked image modeling and present BossNAS++ to overcome the lack of dense supervision in our block-wise self-supervised NAS. In BossNAS++, we introduce the training technique named Masked Ensemble Bootstrapping for block-wise supernet, accompanied by a Masked Population-Centric Search scheme to promote fairer architecture selection. Our family of models, discovered through BossNAS and BossNAS++, delivers impressive results across various search spaces and datasets. Our transformer model discovered by BossNAS++ attains a remarkable accuracy of 83.2% on ImageNet with only 10.5B MAdds, surpassing DeiT-B by 1.4% while maintaining a lower computation cost. Moreover, our approach excels in architecture rating accuracy, achieving Spearman correlations of 0.78 and 0.76 on the canonical MBConv search space with ImageNet and the NATS-Bench size search space with CIFAR-100, respectively, outperforming state-of-the-art NAS methods.

Abstract:
This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd counting model. First, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground-truth; Second, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Third, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings.

Abstract:
The inherent complexity of image semantics engenders a fascinating variability in relationships between images. For instance, under a certain condition, two images may demonstrate similarity, while under different circumstances, the same pair could exhibit absolute dissimilarity. A singular feature space is therefore insufficient for capturing the nuanced semantic relationships that exist between samples. Conditional Similarity Learning (CSL) aims to address this gap by learning multiple, distinct feature spaces. Existing approaches in CSL often fail to capture the intricate similarity relationships between samples across different semantic conditions, particularly in weakly-supervised settings where condition labels are absent during training. To address this limitation, we introduce Distance Induced Semantic COndition VERification NETwork (DiscoverNet), a unified framework designed to cater to a range of CSL scenarios— supervised CSL (sCSL), weakly-supervised CSL (wsCSL), and semi-supervised CSL (ssCSL). In addition to traditional linear projections, we also introduce a prompt learning technique utilizing transformer encoding layer to create diverse embedding spaces. Our framework incorporates a Condition Match Module (CMM) that dynamically matches different training triplets with corresponding embedding spaces, adapting to varying levels of supervision. We also shed light on existing evaluation biases in wsCSL and introduce two novel criteria for a more robust evaluation. Through extensive experiments and visualizations on benchmark datasets such as UT-Zappos-50 k and Celeb-A, we substantiate the efficacy and interpretability of DiscoverNet.

Abstract:
Quantization is one of the efficient model compression methods, which represents the network with fixed-point or low-bit numbers. Existing quantization methods address the network quantization by treating it as a single-objective optimization that pursues high accuracy (performance optimization) while keeping the quantization constraint. However, owing to the non-differentiability of the quantization operation, it is challenging to integrate the quantization operation into the network training and achieve optimal parameters. In this paper, a novel multi-objective convex quantization for efficient model compression is proposed. Specifically, the network training is modeled as a multi-objective optimization to find the network with both high precision and low quantization error (actually, these two goals are somewhat contradictory and affect each other). To achieve effective multi-objective optimization, this paper designs a quantization error function that is differentiable and ensures the computation convexity in each period, so as to avoid the non-differentiable back-propagation of the quantization operation. Then, we perform a time-series self-distillation training scheme on the multi-objective optimization framework, which distills its past softened labels and combines the hard targets to guarantee controllable and stable performance convergence during training. At last and more importantly, a new dynamic Lagrangian coefficient adaption is designed to adjust the gradient magnitude of quantization loss and performance loss and balance the two losses during training processing. The proposed method is evaluated on well-known benchmarks: MNIST, CIFAR-10/100, ImageNet, Penn Treebank and Microsoft COCO, and experimental results show that the proposed method achieves outstanding performance compared to existing methods.

Abstract:
The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose RankFeat, a simple yet effective post hoc approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. RankFeat achieves state-of-the-art performance and reduces the average false positive rate (FPR95) by 17.90% compared with the previous best method. The success of RankFeat motivates us to investigate whether a similar phenomenon would exist in the parameter matrices of neural networks. We thus propose RankWeight which removes the rank-1 weight from the parameter matrices of a single deep layer. Our RankWeight is also post hoc and only requires computing the rank-1 matrix once. As a standalone approach, RankWeight has very competitive performance against other methods across various backbones. Moreover, RankWeight enjoys flexible compatibility with a wide range of OOD detection methods. The combination of RankWeight and RankFeat refreshes the new state-of-the-art performance, achieving the FPR95 as low as 16.13% on the ImageNet-1k benchmark. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.

Abstract:
By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blur and alleviate instant noise, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noise and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.

Abstract:
We consider the problems of testing and learning quantum kk-junta channels, which are nn-qubit to nn-qubit quantum channels acting non-trivially on at most kk out of nn qubits and leaving the rest of qubits unchanged. We show the following. 1) An O(k)O(k)-query algorithm to distinguish whether the given channel is kk-junta channel or is far from any kk-junta channels, and a lower bound \Omega (\sqrtk)Ω(k) on the number of queries and 2) An \widetildeO( 4^k )O˜(4k)-query algorithm to learn a kk-junta channel, and a lower bound \Omega ( 4^k/k )Ω(4k/k) on the number of queries. This partially answers an open problem raised by (Chen et al. 2023). In order to settle these problems, we develop a Fourier analysis framework over the space of superoperators and prove several fundamental properties, which extends the Fourier analysis over the space of operators introduced in (Montanaro and Osborne, 2010). The distance metric we consider in this paper is obtained by Fourier analysis, which is essentially the L2-distance between Choi representations. Besides, we introduce Influence-Sample to replace Fourier-Sample proposed in(Atici and Servedio, 2007). Our Influence-Sample includes only single-qubit operations and results in only constant-factor decrease in efficiency.

Abstract:
Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.

Abstract:
We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to losing human cues but prefer to exploit the correlation between non-human contexts and associated actions for recognition, and the contexts of interest agnostic to actions would reduce recognition performance in the target domain. To overcome this problem, we focus on uncovering human-centric action cues for domain adaptive action recognition, and our conception is to investigate two aspects of human-centric action cues, namely human cues and human-context interaction cues. Accordingly, our proposed Human-Centric Transformer (HCTransformer) develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning. Our HCTransformer first conducts human-aware temporal modeling by a human encoder, aiming to avoid a loss of human cues during domain-invariant video feature learning. Then, by a Transformer-like architecture, HCTransformer exploits domain-invariant and action-correlated contexts by a context encoder, and further models domain-invariant interaction between humans and action-correlated contexts. We conduct extensive experiments on three benchmarks, namely UCF-HMDB, Kinetics-NecDrone and EPIC-Kitchens-UDA, and the state-of-the-art performance demonstrates the effectiveness of our proposed HCTransformer.

Abstract:
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language pretraining models, VALOR jointly models the relationships among vision, audio, and language in an end-to-end manner. It consists of three separate encoders for single modality representations and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain the VALOR model: Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language, and audio into the same common space, simultaneously building vision-language, audio-language, and audiovisual-language alignment. MGC learns to generate text tokens under conditions of vision, audio, or both. To promote vision-audio-language pretraining research, we construct a large-scale, high-quality tri-modality dataset named VALOR-1M, containing 1 million audible videos with human-annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and generalize to various downstream tasks (e.g., retrieval, captioning, and question answering) with different input modalities (e.g., vision-language, audio-language, and audiovisual-language). VALOR achieves new state-of-the-art performance on a series of public cross-modality benchmarks.

Abstract:
Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over 13% and 7% on the final session of miniImageNet and CIFAR-100 FSCIL benchmarks.

Abstract:
Video snapshot compressive imaging (SCI) encodes the target dynamic scene compactly into a snapshot and reconstructs its high-speed frame sequence afterward, greatly reducing the required data footprint and transmission bandwidth as well as enabling high-speed imaging with a low frame rate intensity camera. In implementation, high-speed dynamics are encoded via temporally varying patterns, and only frames at corresponding temporal intervals can be reconstructed, while the dynamics occurring between consecutive frames are lost. To unlock the potential of conventional snapshot compressive videography, we propose a novel hybrid “intensity++ event imaging scheme by incorporating an event camera into a video SCI setup. Our proposed system consists of a dual-path optical setup to record the coded intensity measurement and intermediate event signals simultaneously, which is compact and photon-efficient by collecting the half photons discarded in conventional video SCI. Correspondingly, we developed a dual-branch Transformer utilizing the reciprocal relationship between two data modes to decode dense video frames. Extensive experiments on both simulated and real-captured data demonstrate our superiority to state-of-the-art video SCI and video frame interpolation (VFI) methods. Benefiting from the new hybrid design leveraging both intrinsic redundancy in videos and the unique feature of event cameras, we achieve high-quality videography at 0.1ms time intervals with a low-cost CMOS image sensor working at 24 FPS.

Abstract:
Information theory is an outstanding framework for measuring uncertainty, dependence, and relevance in data and systems. It has several desirable properties for real-world applications: naturally deals with multivariate data, can handle heterogeneous data, and the measures can be interpreted. However, it has not been adopted by a wider audience because obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality. We propose an indirect way of estimating information based on a multivariate iterative Gaussianization transform. The proposed method has a multivariate-to-univariate property: it reduces the challenging estimation of multivariate measures to a composition of marginal operations applied in each iteration of the Gaussianization. Therefore, the convergence of the resulting estimates depends on the convergence of well-understood univariate entropy estimates, and the global error linearly depends on the number of times the marginal estimator is invoked. We introduce Gaussianization-based estimates for Total Correlation, Entropy, Mutual Information, and Kullback-Leibler Divergence. Results on artificial data show that our approach is superior to previous estimators, particularly in high-dimensional scenarios. We also illustrate the method's performance in different fields to obtain interesting insights. We make the tools and datasets publicly available to provide a test bed for analyzing future methodologies.

Abstract:
Person search aims to localize a person of interest in a large image gallery captured by multiple, non-overlapping cameras. Prevalent unified methods have suffered from (1) noisy proposals with mis-detection and occlusion, and (2) large appearance variation within a class, which deteriorates the prototype-based metric learning. To address these problems, we introduce a Prototype-guided Attention Distillation, shortly PAD, which exploits a prototype (a typical representation of an identity) as a guidance to the attention module to consistently highlight identity-inherent regions across different poses. To utilize the knowledge encoded in prototypes for matching unseen IDs, PAD conducts attention distillation to guide student Re-ID queries by deeply mimicking attention maps from the prototype query. Additionally, to address large intra-class variation induced by pose or camera views, we extend PAD with multiple part prototypes representing consistent local regions across different instances. Furthermore, we exploit an adaptive momentum strategy for robust attention distillation in PAD to update more distinct prototypes. Extensive experiments conducted on CUHK-SYSU and PRW demonstrate the effectiveness of PAD, showcasing state-of-the-art performance. Moreover, our distilled attention surprisingly highlights distinguished multiple regions for person search.

Abstract:
For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called “Prompt and Transfer” (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.

Affiliations: Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Laboratory of Image and Video Understanding for Social Security, and School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; RIKEN Center for Advanced Intelligence Project, Tokyo, Japan; School of Computer Science, Faculty of Engineering, The University of Sydney, Camperdown, NSW, Australia; Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China; Nanyang Technological University, Singapore

Abstract:
Real-world data may contain a considerable amount of noisily labeled examples, which usually mislead the training algorithm and result in degraded classification performance on test data. Therefore, Label Noise Learning (LNL) was proposed, of which one popular research trend focused on estimating the critical statistics (e.g., sample mean and sample covariance), to recover the clean data distribution. However, existing methods may suffer from the unreliable sample selection process or can hardly be applied to multi-class cases. Inspired by the centroid estimation theory, we propose Per-Class Statistic Estimation (PCSE), which establishes the quantitative relationship between the clean (first-order and second-order) statistics and the corresponding noisy statistics for every class. This relationship is further utilized to induce a generative classifier for model inference. Unlike existing methods, our approach does not require sample selection from the instance level. Moreover, our PCSE can serve as a general post-processing strategy applicable to various popular networks pre-trained on the noisy dataset for boosting their classification performance. Theoretically, we prove that the estimated statistics converge to their ground-truth values as the sample size increases, even if the estimated label transition matrix is biased. Empirically, we conducted intensive experiments on various binary and multi-class datasets, and the results demonstrate that PCSE achieves more precise statistic estimation as well as higher classification accuracy when compared with state-of-the-art methods in LNL.

Abstract:
Optical flow has made great progress in clean scenes, while suffers degradation under adverse weather due to the violation of the brightness constancy and gradient continuity assumptions of optical flow. Typically, existing methods mainly adopt domain adaptation to transfer motion knowledge from clean to degraded domain through one-stage adaptation. However, this direct adaptation is ineffective, since there exists a large gap due to adverse weather and scene style between clean and real degraded domains. Moreover, even within the degraded domain itself, static weather (e.g., fog) and dynamic weather (e.g., rain) have different impacts on optical flow. To address above issues, we explore synthetic degraded domain as an intermediate bridge between clean and real degraded domains, and propose a cumulative homogeneous-heterogeneous adaptation framework for real adverse weather optical flow. Specifically, for clean-degraded transfer, our key insight is that static weather possesses the depth-association homogeneous feature which does not change the intrinsic motion of the scene, while dynamic weather additionally introduces the heterogeneous feature which results in a significant boundary discrepancy in warp errors between clean and degraded domains. For synthetic-real transfer, we figure out that cost volume correlation shares a similar statistical histogram between synthetic and real degraded domains, benefiting to holistically aligning the homogeneous correlation distribution for synthetic-real knowledge distillation. Under this unified framework, the proposed method can progressively and explicitly transfer knowledge from clean scenes to real adverse weather. In addition, we further collect a real adverse weather dataset with manually annotated optical flow labels and perform extensive experiments to verify the superiority of the proposed method.

Abstract:
Person Re-identification (ReID) has been extensively developed for a decade in order to learn the association of images of the same person across non-overlapping camera views. To overcome significant variations between images across camera views, mountains of variants of ReID models were developed for solving a number of challenges, such as resolution change, clothing change, occlusion, modality change, and so on. Despite the impressive performance of many ReID variants, these variants typically function distinctly and cannot be applied to other challenges. To our best knowledge, there is no versatile ReID model that can handle various ReID challenges at the same time. This work contributes to the first attempt at learning a versatile ReID model to solve such a problem. Our main idea is to form a two-stage prompt-based twin modeling framework called VersReID. Our VersReID firstly leverages the scene label to train a ReID Bank that contains abundant knowledge for handling various scenes, where several groups of scene-specific prompts are used to encode different scene-specific knowledge. In the second stage, we distill a V-Branch model with versatile prompts from the ReID Bank for adaptively solving the ReID of different scenes, eliminating the demand for scene labels during the inference stage. To facilitate training VersReID, we further introduce the multi-scene properties into self-supervised learning of ReID via a multi-scene prioris data augmentation (MPDA) strategy. Through extensive experiments, we demonstrate the success of learning an effective and versatile ReID model for handling ReID tasks under multi-scene conditions without manual assignment of scene labels in the inference stage, including general, low-resolution, clothing change, occlusion, and cross-modality scenes.

Abstract:
The difficulty of fine-grained image classification mainly comes from a shared overall appearance across classes. Thus, recognizing discriminative details, such as the eyes and beaks of birds, is a key to the task. However, this is particularly challenging when training data is limited. To address this, we propose Task Discrepancy Maximization (TDM), a task-oriented channel attention method tailored for fine-grained few-shot classification with two novel modules Support Attention Module (SAM) and Query Attention Module (QAM). SAM highlights channels encoding class-wise discriminative features, while QAM assigns higher weights to object-relevant channels of the query. Based on these submodules, TDM produces task-adaptive features by focusing on channels encoding class-discriminative details and possessed by the query at the same time, for accurate class-sensitive similarity measure between support and query instances. While TDM influences high-level feature maps by task-adaptive calibration of channel-wise importance, we further introduce Instance Attention Module (IAM) operating in intermediate layers of feature extractors to instance-wisely highlight object-relevant channels, by extending QAM. The merits of TDM and IAM and their complementary benefits are experimentally validated in fine-grained few-shot classification tasks. Moreover, IAM is also effective in coarse-grained and cross-domain few-shot classifications.

Abstract:
The development of Neural Architecture Search (NAS) is hindered by high costs associated with evaluating network architectures. Recently, several zero-cost proxies have been proposed as a promising method to reduce the evaluation cost of network architectures in NAS. They can quickly estimate the final performance of the network in a few seconds during the initial phase. However, existing zero-cost proxies either ignore the network structure's impact on performance or are limited to specific tasks. To address these issues, we propose a novel zero-cost proxy called Skeleton Path Kernel Trace (SPKT) that leverages the whole network architecture's skeleton path structure information. We then integrate it into an effective Bayesian optimization for NAS framework called PATNAS, and demonstrate its efficacy on different datasets. The results show that our proposed SPKT zero-cost proxy can achieve a high correlation with the final performance of the network across multiple tasks. Furthermore, it can significantly accelerate the search process for finding the best-performing network architectures.

Abstract:
Iterative methods such as iterative closest point (ICP) for point cloud registration often suffer from bad local optimality (e.g., saddle points), due to the nature of nonconvex optimization. To address this fundamental challenge, in this paper we propose learning to form the loss landscape of a deep iterative method w.r.t. predictions at test time into a convex-like shape locally around each ground truth given data, namely Deep Loss Convexification (DLC), thanks to the overparametrization in neural networks. To this end, we formulate our learning objective based on adversarial training by manipulating the ground-truth predictions, rather than input data. In particular, we propose using star-convexity, a family of structured nonconvex functions that are unimodal on all lines that pass through a global minimizer, as our geometric constraint for reshaping loss landscapes, leading to (1) extra novel hinge losses appended to the original loss and (2) near-optimal predictions. We demonstrate the state-of-the-art performance using DLC with existing network architectures for the tasks of training recurrent neural networks (RNNs), 3D point cloud registration, and multimodel image alignment.

Abstract:
The Diffusion Model (DM) has emerged as the SOTA approach for image synthesis. However, the existing DM cannot perform well on some image-to-image translation (I2I) tasks. Different from image synthesis, some I2I tasks, such as super-resolution, require generating results in accordance with GT images. Traditional DMs for image synthesis require extensive iterations and large denoising models to estimate entire images, which gives their strong generative ability but also leads to artifacts and inefficiency for I2I. To tackle this challenge, we propose a simple, efficient, and powerful DM framework for I2I, called DiffI2I. Specifically, DiffI2I comprises three key components: a compact I2I prior extraction network (CPEN), a dynamic I2I transformer (DI2Iformer), and a denoising network. We train DiffI2I in two stages: pretraining and DM training. For pretraining, GT and input images are fed into CPEN_S1S1 to capture a compact I2I prior representation (IPR) guiding DI2Iformer. In the second stage, the DM is trained to only use the input images to estimate the same IRP as CPEN_S1S1. Compared to traditional DMs, the compact IPR enables DiffI2I to obtain more accurate outcomes and employ a lighter denoising network and fewer iterations. Through extensive experiments on various I2I tasks, we demonstrate that DiffI2I achieves SOTA performance while significantly reducing computational burdens.

Abstract:
Clustering ensemble has been widely studied in data mining and machine learning. However, the existing clustering ensemble methods do not pay attention to fairness, which is important in real-world applications, especially in applications involving humans. To address this issue, this paper proposes a novel fair clustering ensemble method, which takes multiple base clustering results as inputs and learns a fair consensus clustering result. When designing the algorithm, we observe that one of the widely used definitions of fairness may cause a cluster imbalance problem. To tackle this problem, we give a new definition of fairness that can simultaneously characterize fairness and cluster capacity equality. Based on this new definition, we design an extremely simple yet effective regularized term to achieve fairness and cluster capacity equality. We plug this regularized term into our clustering ensemble framework, finally leading to our new fair clustering ensemble method. The extensive experiments show that, compared with the state-of-the-art clustering ensemble methods, our method can not only achieve a comparable or even better clustering performance, but also obtain a much fairer and better capacity equality result, which well demonstrates the effectiveness and superiority of our method.

Abstract:
This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample-wise redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.

Abstract:
Pedestrian detection currently suffers from two issues in crowded scenes: occlusion and dense boundary prediction, making it still challenging in complex real-world scenarios. In recent years, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shown their superiorities in addressing these issues, where ViTs capture global feature dependency to infer occlusion parts and CNNs make accurate dense predictions by local detailed features. Nevertheless, limited by the narrow receptive field, CNNs fail to infer occlusion parts, while ViTs tend to ignore local features that are vital to distinguish different pedestrians in the crowd. Therefore, it is essential to combine the advantages of CNN and ViT for pedestrian detection. However, manually designing a specific CNN and ViT hybrid network requires enormous time and resources for trial and error. To address this issue, we propose the first Neural Architecture Search (NAS) framework specifically designed for pedestrian detection named NAS-PED, which automatically designs an appropriate CNNs and ViTs hybrid backbone for the crowded pedestrian detection task. Specifically, we formulate transformers and convolutions with various kernel sizes in the same format, which provides an unconstrained space for diverse hybrid network search. Furthermore, to search for a suitable backbone, we propose an information bottleneck based NAS objective function, which treats the process of NAS as an information extraction process, preserving relevant information and suppressing redundant information from the dense pedestrians in crowd scenes Extensive experiments on CrowdHuman, CityPersons and EuroCity Persons datasets demonstrate the effectiveness of the proposed method. Our NAS-PED obtains absolute gains of 4.0% MR^-2-2 and 1.9% AP over the state-of-the-art (SOTA) pedestrian detection framework on CrowdHuman datasets. For the CityPersons and EuroCity Persons datasets, the searched backbone achieves stable improvement across all three subsets, outperforming some large language-image pre-trained models.

Abstract:
In crowdsourcing scenarios, we can obtain multiple noisy labels for an instance from crowd workers and then aggregate these labels to infer the unknown true label of this instance. Due to the lack of expertise of workers, obtained labels usually contain a degree of noise. Existing studies usually focus on the crowdsourcing scenarios with low noise ratios but rarely focus on the crowdsourcing scenarios with high noise ratios. In this paper, we focus on the crowdsourcing scenarios with high noise ratios and propose a novel label aggregation algorithm called enhanced label distribution propagation (ELDP). First, ELDP harnesses an internal worker weighting method to estimate the weights of workers and then performs the first label distribution enhancement. Then, for instances not covered in the first enhancement, ELDP performs the second enhancement using a class membership estimation method based on the intra-cluster distance. Finally, ELDP propagates enhanced label distributions from accurately enhanced instances to inaccurately enhanced instances. Experimental results on both simulated and real-world crowdsourced datasets show that ELDP significantly outperforms all the other state-of-the-art label aggregation algorithms.

Abstract:
Depth completion, inferring dense depth maps from sparse measurements, is crucial for robust 3D perception. Although deep learning based methods have made tremendous progress in this problem, these models cannot generalize well across different scenes that are unobserved in training, posing a fundamental limitation that yet to be overcome. A careful analysis of existing deep neural network architectures for depth completion, which are largely borrowing from successful backbones for image analysis tasks, reveals that a key design bottleneck actually resides in the conventional normalization layers. These normalization layers are designed, on one hand, to make training more stable, on the other hand, to build more visual invariance across scene scales. However, in depth completion, the scale is actually what we want to robustly estimate in order to better generalize to unseen scenes. To mitigate, we propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output, and simultaneously preserve the normalization operator for easy convergence. More specifically, we rescale the input using learned features of a single-layer perceptron from the normalized input, rather than directly normalizing the input as conventional normalization layers. We then develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone. We explore the composition of various basic blocks and architectures to achieve superior performance and efficient inference for generalizable depth completion. Extensive experiments are conducted on six unseen datasets with various types of sparse depth maps, i.e., randomly sampled 0.1%/1%/10% valid pixels, 4/8/16/32/64-line LiDAR points, and holes from Structured-Light. Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.

Abstract:
This paper proposes an improved linear discriminant analysis called spectrally-corrected and regularized LDA (SRLDA). This approach incorporates design principles from both the spectrally-corrected covariance matrix and the regularized discriminant analysis. With the support of a large-dimensional random matrix theory, it is demonstrated that SRLDA achieves a globally optimal linear classification solution under the spiked model assumption. According to simulation data analysis, the SRLDA classifier exhibits better performance compared to RLDA and ILDA, closely to the theoretical classifier. Empirical experiments across diverse datasets further reflect that the SRLDA algorithm excels in both classification accuracy and dimensionality reduction, outperforming currently employed tools.

Affiliations: College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, China; School of Artificial Intelligence, Anhui University, Hefei, China; School of Automation, Northwestern Polytechnical University, Xi’an, China; Chongqing University of Posts and Telecommunications, Chongqing, China; School of Computer Science and Center for Optical Imagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, China; Machine Learning Department, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

Abstract:
In this paper, we explain the mechanism of bilinear pooling as a module of hard sample generation, and find that bilinear pooling significantly expands variances of the first-order vectors when it produces discriminative bilinear features. In conjunction with the extremely high dimensionality of the obtained bilinear features, those variances lead to overfitting in subsequent learning models. To solve this issue, we construct a bi-level optimization problem, where the high-level problem is the supervised classification loss, and the low-level problem is the principal component analysis (PCA). Then, we find that PCA on bilinear features is equivalent to spectral clustering, which allows us to mathematically prove that the first \log _2(C)log2(C) principal components can support the discriminant information of CC classes. By removing the rest principal components, the dimensionality and variances are simultaneously reduced. To the best of our knowledge, this is the first work providing a lower bound for dimension reduction for bilinear pooling. However, the PCA projection matrix \mathbfLL is prone to overfitting due to having many parameters. To address this issue, we propose a rank-kk general bilinear projection (RK-GBP) that decomposes \mathbfLL into two small matrices \mathbfUU and \mathbfVV, whose learnable parameters are smaller. Different from traditional bilinear projections used in factorized bilinear pooling (FBiP), our RK-GBP can preserve the orthogonality of columns in \mathbfLL by constraining the orthogonality of columns in \mathbfUU and \mathbfVV. For computational efficiency, we relax the PCA in the low-level task into a dictionary learning problem, obtaining the rank-kk orthogonal factorization bilinear pooling (RK-OFBP). The RK-OFBP can be considered as a general form of current factorization bilinear pooling methods (e.g., Hadamard product-based ones). Finally, we evaluate our approach on fine-grained images and large-scale datasets, demonstrating that our proposed method not only produces extremely low-dimensional features but also outperforms other methods in classification tasks. For example, our RK-OFBP can employ 32-dimensional vectors to achieve comparable results to B-CNN (Lin, 2015) (dimension: 512512) for the 200-class classification task.

Abstract:
Graph neural networks (GNNs) have proven effective in capturing relationships among nodes in a graph. This study introduces a novel perspective by considering a graph as a simplicial complex, encompassing nodes, edges, triangles, and kk-simplices, enabling the definition of graph-structured data on any kk-simplex. We design a novel Hodge-Laplacian heterogeneous graph attention network (HL-HGAT) to learn heterogeneous signal representations across kk-simplices. The HL-HGAT incorporates three key components: HL convolutional filters (HL-filters), simplicial projection (SP), and simplicial attention pooling (SAP) operators, applied to kk-simplices. HL-filters leverage the unique topology of kk-simplices encoded by the Hodge-Laplacian (HL) operator, operating within the spectral domain of the kk-th HL operator. To address computation challenges, we introduce a polynomial approximation for HL-filters, exhibiting spatial localization properties. Additionally, we propose a pooling operator to coarsen kk-simplices, combining features through simplicial attention mechanisms of self-attention and cross-attention via transformers and SP operators, capturing topological interconnections across multiple dimensions of simplices. The HL-HGAT is comprehensively evaluated across diverse graph applications, including NP-hard problems, graph multi-label and classification challenges, and graph regression tasks in logistics, computer vision, biology, chemistry, and neuroscience. The results demonstrate the model’s efficacy and versatility in handling a wide range of graph-based scenarios.

Abstract:
We present a novel Geometry-aware Neural Interpolation (Geo-NI) framework for light field rendering. Previous learning-based approaches either perform direct interpolation via neural networks, which we dubbed Neural Interpolation (NI), or explore scene geometry for novel view synthesis, also known as Depth Image-Based Rendering (DIBR). Both kinds of approaches have their own strengths and weaknesses in addressing non-Lambert effect and large disparity problems. In this paper, we incorporate the ideas behind these two kinds of approaches by launching the NI within a specific DIBR pipeline. Specifically, a DIBR network in the proposed Geo-NI serves to construct a novel reconstruction cost volume for neural interpolated light fields sheared by different depth hypotheses. The reconstruction cost can be interpreted as an indicator reflecting the reconstruction quality under a certain depth hypothesis, and is further applied to guide the rendering of the final high angular resolution light field. To implement the Geo-NI framework more practically, we further propose an efficient modeling strategy to encode high-dimensional cost volumes using a lower-dimension network. By combining the superiorities of NI and DIBR, the proposed Geo-NI is able to render views with large disparities with the help of scene geometry while also reconstructing the non-Lambertian effect when depth is prone to be ambiguous. Extensive experiments on various datasets demonstrate the superior performance of the proposed geometry-aware light field rendering framework.

Abstract:
Survival analysis (SA) prediction involves the prediction of the time until an event of interest occurs (TTE), based on input attributes. The main challenge of SA is instances where the event is not observed (censored), typically through an alternative (censoring) event. Most SA prediction methods suffer from drawbacks limiting the usage of advanced machine learning methods: Ignoring the input of the censored samples, no separation between model and loss, and typical small datasets and high input dimensions. We propose a loss function, denoted suRvival Analysis lefT barrIer lOss (RATIO), that explicitly incorporates the censored samples input in the prediction. RATIO accounts for the difference between censored and uncensored samples, by only considering censoring events occurring after the predicted, and through a linear term on the uncensored data event time. RATIO can be used with any prediction model. We further propose FIESTA a data augmentation method, combining the TTE of uncensored samples with the input of censored samples. We show that RATIO drastically improves the precision and reduces the bias of SA prediction in both models and real-life SA problems, and FIESTA allows for the inclusion of high-dimension data in SA methods even with a small number of uncensored samples.

Abstract:
This paper takes a crucial step in the development of energy-aware (EA) NAS methods by offering a benchmark that enhances the reproducibility and accessibility of EA-NAS research. Specifically, we introduce EA-HAS-Bench, the first large-scale energy-aware benchmark designed to enable the study of AutoML methods in achieving improved trade-offs between performance and search energy consumption. EA-HAS-Bench offers a vast architecture/hyperparameter joint search space, encompassing diverse configurations relevant to energy consumption, and proposes a novel surrogate model based on Bézier curves for predicting learning curves with versatile shapes and lengths. On the other hand, recent studies have started integrating large language models (LLMs) into AutoML frameworks to enhance model search efficiency and configuration prediction, yet challenges remain in adapting these methods for energy-efficient searches across vast configuration spaces, as they often neglect energy consumption metrics. As a result, we introduce the Language-Enhanced Shrinkage Search (LESS), a plug-and-play method that utilizes the analytical capabilities of LLMs to enhance the energy efficiency of existing hyperparameter optimization techniques. Moreover, we adapt existing AutoML algorithms to construct baselines. Our experiments demonstrate that these modified energy-aware AutoML methods and LESS achieve an improved balance between energy consumption and model performance.

Abstract:
Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Most of the existing methods rely on learning discriminative representations for prediction, often assuming that the underlying semantic components are correctly identified. However, this assumption does not always hold, leading to potential misidentifications that affect model robustness. Different from these discriminative-based methods, we propose a generative model to ensure the Semantic-Components Identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results.

Abstract:
We present a new generalizable NeRF method that is able to directly generalize to new unseen scenarios and perform novel view synthesis with as few as two source views. The key to our approach lies in the explicitly modeled correspondence matching information, so as to provide the geometry prior to the prediction of NeRF color and density for volume rendering. The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views, which is able to provide reliable cues about the surface geometry. Unlike previous methods where image features are extracted independently for each view, we consider modeling the cross-view interactions via Transformer cross-attention, which greatly improves the feature matching quality. Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density, demonstrating the effectiveness and superiority of our proposed method.

Abstract:
Sign language pre-training (SLP) has significantly improved the performance of diverse sign language understanding (SLU) tasks. However, many existing methods employ pre-training techniques that are tailored to a specific task with small data scale, resulting in limited model generalization. Some others focus solely on exploring visual cues, neglecting semantically textual cues embedded in sign translation texts. These limitations inherently diminish the representative capacity of pre-trained models. To this end, we present a multimodal SLP framework to leverage rich visual contextual information and vision-language semantic consistency with massively available data to enhance the representative capability of sign language video. Specifically, we first curate a large-scale text-labeled sign pose dataset (～∼1.5M), namely SL-1.5M, from various sources to alleviate the scarcity of pre-training data. Subsequently, we propose a pre-training framework, which integrates sign-text contrastive learning with masked pose modeling as the pretext task. In this way, our framework is empowered to effectively capture contextual cues within sign pose sequences and learn visual representation by aligning semantical text-rich features in a latent space. Moreover, in order to grasp the comprehensive meaning of sign language videos, we concurrently model manual and non-manual information to ensure the holistic integrity of visual content. To validate the generalization and superiority of our proposed pre-trained framework, we conduct extensive experiments without intricate design on diverse SLU tasks, achieving new state-of-the-art performance on multiple benchmarks.

Abstract:
Remote sensing images usually reveal various objects with complex structures and different locations within vast ground area backgrounds. That leads to a major challenge for conventional generative models in handling remote sensing objects with correct shapes and clear textures. Integrating additional object-level controls can be a potential solution to improve generation quality, yet previous approaches inject the object-related conditions by specifying their locations, causing a limitation in object layout in generated results. To enable high object fidelity, high layout diversity and object customizable generation for remote sensing images, we propose a remote sensing image generation via object text decoupling, namely OTD-GAN. OTD-GAN takes advantage of the inherent text-to-image generation procedure and adaptively integrates the decoupled textual representations of visual objects into the global captions, thus achieving object-level controls without layout restrictions. Specifically, we design an object text decoupling module to predict a semantically consistent textual representation for each object. By decoupling the textual representation into a class invariant part and an object specific part, the converted representation is able to catch general semantic for similar objects as well as differentiated details for individual objects. After that, we use an object text semantic enhancement module to fuse the obtained object text representations with the global captions to enrich the object-related semantic within the textual modality. As a result, the generator will benefit from the object conditions and reinforce the generation quality while remaining flexibility to create diverse layouts. Extensive experiments on remote sensing image-caption datasets including NWPU-Captions and RSICD demonstrate that our method achieves leading performance compared to existing state-of-the-art approaches.

Affiliations: School of Data Science, Chinese University of Hong Kong, Shenzhen, China; College of Control Science and Engineering and the State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China; State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China; Department of Electronic and Computer Engineering, Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

Abstract:
Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose an optimal two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimator. We prove that the proposed estimator achieves the same asymptotic statistical properties as the ML estimator: The first is consistency, i.e., the estimator converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimator converges to the theoretical lower bound — Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.

Abstract:
Neural Radiance Fields (NeRF) is a popular view synthesis technique that represents a scene as a continuous volumetric function, parameterized by multilayer perceptrons that provide the volume density and view-dependent emitted radiance at each location. While NeRF-based techniques excel at representing fine geometric structures with smoothly varying view-dependent appearance, they often fail to accurately capture and reproduce the appearance of glossy surfaces. We address this limitation by introducing Ref-NeRF, which replaces NeRF's parameterization of view-dependent outgoing radiance with a representation of reflected radiance and structures this function using a collection of spatially-varying scene properties. We show that together with a regularizer on normal vectors, our model significantly improves the realism and accuracy of specular reflections. Furthermore, we show that our model's internal representation of outgoing radiance is interpretable and useful for scene editing.

Affiliations: School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China; University of Sydney, Darlington, NSW, Australia; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China; JD Explore Academy, Beijing, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China; College of Computing and Data Science, Nanyang Technological University, Singapore

Abstract:
Reward finetuning has emerged as a powerful technique for aligning diffusion models with specific downstream objectives or user preferences. However, current approaches suffer from a persistent challenge of reward overoptimization, where models exploit imperfect reward feedback at the expense of overall performance. In this work, we identify three key contributors to overoptimization: (1) a granularity mismatch between the multi-step diffusion process and sparse rewards; (2) a loss of plasticity that limits the model’s ability to adapt and generalize; and (3) an overly narrow focus on a single reward objective that neglects complementary performance criteria. Accordingly, we introduce Constrained Diffusion Policy Optimization (CDPO), a novel reinforcement learning framework that addresses reward overoptimization from multiple angles. Firstly, CDPO tackles the granularity mismatch through a temporal policy optimization strategy that delivers step-specific rewards throughout the entire diffusion trajectory, thereby reducing the risk of overfitting to sparse final-step rewards. Then we incorporate a neuron reset strategy that selectively resets overactive neurons in the model, preventing overoptimization induced by plasticity loss. Finally, to avoid overfitting to a narrow reward objective, we integrate constrained reinforcement learning with auxiliary reward objectives serving as explicit constraints, ensuring a balanced optimization across diverse performance metrics.

Abstract:
Graph contrastive learning is usually performed by first conducting Graph Data Augmentation (GDA) and then employing a contrastive learning pipeline to train GNNs. As we know that GDA is an important issue for graph contrastive learning. Various GDAs have been developed recently which mainly involve dropping or perturbing edges, nodes, node attributes and edge attributes. However, to our knowledge, it still lacks a universal and effective augmentor that is suitable for different types of graph data. To address this issue, in this paper, we first introduce the graph message representation of graph data. Based on it, we then propose a novel Graph Message Augmentation (GMA), a universal scheme for reformulating many existing GDAs. The proposed unified GMA not only gives a new perspective to understand many existing GDAs but also provides a universal and more effective graph data augmentation for graph self-supervised learning tasks. Moreover, GMA introduces an easy way to implement the mixup augmentor which is natural for images but usually challengeable for graphs. Based on the proposed GMA, we then propose a unified graph contrastive learning, termed Graph Message Contrastive Learning (GMCL), that employs attribution-guided universal GMA for graph contrastive learning. Experiments on many graph learning tasks demonstrate the effectiveness and benefits of the proposed GMA and GMCL approaches.

Abstract:
We propose a novel spatial-temporal graph Mamba (STG-Mamba) for the music-guided dance video synthesis task, i.e., to translate the input music to a dance video. STG-Mamba consists of two translation mappings: music-to-skeleton translation and skeleton-to-video translation. In the music-to-skeleton translation, we introduce a novel spatial-temporal graph Mamba (STGM) block to effectively construct skeleton sequences from the input music, capturing dependencies between joints in both the spatial and temporal dimensions. For the skeleton-to-video translation, we propose a novel self-supervised regularization network to translate the generated skeletons, along with a conditional image, into a dance video. Lastly, we collect a new skeleton-to-video translation dataset from the Internet, containing 54,944 video clips. Extensive experiments demonstrate that STG-Mamba achieves significantly better results than existing methods.

Abstract:
We introduce a monocular-to-3D virtual try-on network based on a conditional 3D-aware Generative Adversarial Network (3D-GAN) for synthesizing multi-view try-on results from single monocular images. In contrast to previous 3D virtual try-on methods that rely on costly scanned meshes or pseudo-depth maps for supervision, our approach utilizes a conditional 3D-GAN trained solely on 2D images, greatly simplifying dataset construction and enhancing model scalability. Specifically, we propose a Generative monocular-to-3D Virtual Try-ON network (G3D-VTON) that integrates a 3D-aware conditional Parsing Module (3DPM), a U-Net Refinement Module (URM), and a Flow-based 2D Virtual Try-On Module (FTM). In our framework, the 3DPM is designed to generate a 3D representation of the virtual try-on result, thereby enabling multi-view rendering. To accomplish this, it is implemented using conditional generative semantic articulated fields, which leverage the 3D SMPL prior via inverse skinning to learn the Signed Distance Function (SDF) of the try-on results in a canonical pose space. This learned SDF enables the rendering of both a coarse human parsing map and a preliminary try-on output with explicit camera control. Furthermore, within 3DPM, we introduce deferred pose guidance to decouple style and pose conditions during training, thereby facilitating view controllable generation during inference. However, the rendered human parsing and try-on results exhibit imprecise shapes and blurry textures. To address these issues, the URM subsequently refines these rendered outputs using a refinement U-Net, and the FTM integrates the refined results with the 2D warped garment to generate the final try-on output with more accurate and realistic appearance details. Extensive experiments demonstrate that the proposed G3D-VTON effectively manipulates and generates faithful 3D human appearances wearing the desired garment, outperforming both 3D-GAN and depth-based 3D approaches while delivering superior visual results in 2D.

Abstract:
The pretraining-finetuning paradigm has become dominant in computer vision, yet strategically exploiting limited annotation budgets during finetuning remains unexplored. We introduce active finetuning—a novel task for selecting the most informative samples to annotate within this paradigm. We propose Sel4FT, a unified annotation selection framework that optimizes a parametric model in continuous feature space to identify a subset preserving the entire pool’s distribution while maintaining diversity. To address distribution shifts from data augmentation, we develop Sel4FT++ with augmentation-aware selection mechanisms. We theoretically prove our approach minimizes the Earth Mover’s Distance between selected subset and full data pool. Our framework eliminates iterative retraining and annotation process during selection, providing an efficient solution for real-world deployment. Extensive experiments on image classification, long-tailed recognition, and semantic segmentation demonstrate state-of-the-art performance with over 100×100× speedup compared to existing methods.

Abstract:
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g., a glass object being placed on a wooden surface), which we verify from video, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4 k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2 k non-categorised segments. We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods. We also conduct analysis on: the temporal overlap between audio events, the temporal and label correlations between audio and visual modalities, the ambiguities in annotating materials from audio-only input, the importance of audio-only labels and the limitations of current models to understand actions that sound.

Abstract:
Deep neural network (DNN) deployment has been confined to larger hardware devices due to their expensive computational requirements. This challenge has recently reached another scale with the emergence of large language models (LLMs). In order to reduce both their memory footprint and latency, a promising technique is quantization. It consists in converting floating point representations to low bit-width fixed point representations, usually by assuming a uniform mapping onto a regular grid. This process, referred to in the literature as uniform quantization, may however be ill-suited as most DNN weights and activations follow a bell-shaped distribution. This is even worse on LLMs whose weight distributions are known to exhibit large, high impact, outlier values. In this work, we propose an improvement over the most commonly adopted way to tackle this limitation in deep learning models quantization, namely, non-uniform quantization. NUPES leverages automorphisms to preserve the scalar multiplications. Such transformations are derived from power functions. However, the optimization of the exponent parameter and weight values remains a challenging and novel problem which could not be solved with previous post training optimization techniques which only learn to round up or down weight values in order to preserve the predictive function. We circumvent this limitation with a new paradigm: learning new quantized weights over the entire quantized space. Similarly, we enable the optimization of the power exponent, i.e. the optimization of the quantization operator itself during training by alleviating all the numerical instabilities. The resulting predictive function is compatible with integer-only low-bit inference. We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations. Our empirical benchmarks highlight the ability of NUPES to circumvent the limitations of previous post-training quantization techniques on transformers and large language models in particular.

Abstract:
Deep learning has advanced rapidly, but relies heavily on large-labeled datasets for effective training. This is particularly challenging in fields like medicine, where expert labeling is costly, labor-intensive, and prone to bias and error. Semi-supervised learning (SSL) addresses this challenge by reducing reliance on labeled data. SSL is closely tied to the concept of causation. However, recent works relating causality to SSL are limited by modeling only low-dimensional observations or designing a plug-in module to alleviate the class imbalance. In this paper, we take steps towards training causal generative models for semi-supervised learning, combining principles from causality and variational inference. We interpret the Mixup strategy as a stochastic intervention and introduce a consistency loss to promote coherent latent representations. Under reasonable assumptions, we provide theoretical guarantees that the learned latent representations align with true causal factors up to permissible ambiguities. The experimental results show the proposed approach achieves state-of-the-art performance on several medical datasets of different modalities. Additionally, we test our model on standard benchmarking datasets: CIFAR10, CIFAR100, and SVHN, where it achieves competitive performance.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositional concepts composed of seen single concepts. One of the problems of CZSL is to model attributes interacting with objects and objects interacting with attributes. In this work, we focus on this problem and propose Dual-Stream Conditional Network (DSCNet) that learns dual-stream conditional concepts as a solution, where the conditional visual and semantic embeddings of attributes and objects are learned. First, we argue that the condition of the attribute or object is supposed to contain the recognized object and input image, or the recognized attribute and input image. Next, for each concept which can either be an attribute or object, in the semantic stream, we propose to encode the recognized object or attribute semantic features and the input image visual features as the encoded condition, which is then injected into all concept semantic embeddings by a semantic cross encoder to acquire conditional semantic embeddings. In the visual stream, the conditional attribute or object visual embeddings are acquired by injecting the semantic features of the recognized object or attribute into the mapped attribute or object visual features. Experimental results on CZSL benchmarks demonstrate the superiority of our proposed method.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising representation for novel view synthesis, boosting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. In this paper, we propose HAC++, which explicitly minimizes the representation’s entropy during optimization, enabling efficient arithmetic coding after training for compressed storage. Specifically, to reduce entropy, HAC++ leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate non-effective Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100×100× compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20×20× size reduction compared to Scaffold-GS.

Abstract:
In modern computer vision, the optimal representation of 3D shape remains task-dependent. One fundamental operation applied to such representations is differentiable rendering, which enables learning-based inverse graphics approaches. Standard explicit representations are often easily rendered, but can suffer from limited geometric fidelity, among other issues. On the other hand, implicit representations generally preserve greater fidelity, but suffer from difficulties with rendering, limiting scalability. In this work, we devise Directed Distance Fields (DDFs), which map a ray or oriented point (position and direction) to surface visibility and depth. This enables efficient differentiable rendering, obtaining depth with a single forward pass per pixel, as well as higher-order geometry with only additional backward passes. Using probabilistic DDFs (PDDFs), we can model the inherent discontinuities in the underlying field. We then apply DDFs to single-shape fitting, generative modelling, and 3D reconstruction, showcasing strong performance with simple architectural components via the versatility of our representation. Finally, since the dimensionality of DDFs permits view-dependent geometric artifacts, we conduct a theoretical investigation of the constraints necessary for view consistency. We find a small set of field properties that are sufficient to guarantee a DDF is consistent, without knowing which shape the field is expressing.

Abstract:
In this paper, we introduce OpenCIR, a fully-functional Conditional Image Repainting (CIR) model designed for local image editing. Given an image and a combination of conditions related to geometry, texture, and color, CIR models are required to repaint instances and seamlessly composite them with the original images. Previous CIR models suffer from limited object categories, restricted condition modalities, and demanded geometry precision. In contrast, leveraging the generative priors from pre-trained models, OpenCIR could repaint open object categories. Equipped with redesigned condition injection modules and the condition extension strategy, OpenCIR is able to understand open condition modalities. Adopting the contour refinement strategy, OpenCIR allows users to specify instances with open geometry precision. In addition, we contribute the Open-CIR dataset, which includes detailed annotations, tailored for the comprehensive training and evaluation of the OpenCIR model. Extensive experiments demonstrate that OpenCIR outperforms relevant state-of-the-art methods, achieving superior visual quality, and more favorable results by human evaluators.

Abstract:
Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks.

Abstract:
Panchromatic (PAN) and multi-spectral (MS) remote satellite image fusion, known as pan-sharpening, aims to produce high-resolution MS images by combining the complementary information from the high-resolution, texture-rich PAN and the low-resolution but high spectral-resolution MS counterparts. Despite notable advancements in this field, the current state-of-the-art pan-sharpening techniques do not explicitly address the spatial resolution mismatching problem between the two modalities of PAN and MS images. This mismatching issue can lead to misalignment in feature representation and the creation of blurry artifacts in the model output, ultimately hindering the generation of high-frequency textures and impeding the performance improvement of such methods. To address the aforementioned spatial resolution mismatching problem in pan-sharpening, we propose a novel modality-aware feature-aligned pan-sharpening framework in this paper. The framework comprises three primary stages: modality-aware feature extraction, modality-aware feature aligning, and context integrated image reconstruction. First, we introduce the half-instance normalization strategy as the backbone to filter out the inconsistent features and promote the learning of consistent features between the PAN and MS modalities. Second, a learnable modality-aware feature interpolation is devised to effectively address the misalignment issue. Specifically, the extracted features from the backbone are integrated to predict the transformation offsets of each pixel, which allows for the adaptive selection of custom contextual information and enables the modality-aware features to be more aligned. Finally, within the context of the interactive offset correction, multi-stage information is aggregated to generate the feasible pan-sharpened model output. Extensive experimental results over multiple satellite datasets demonstrate that the proposed algorithm outperforms other state-of-the-art methods both qualitatively and quantitatively, exhibiting great generalization ability to real-world scenes.

Abstract:
Existing Graph Attention Networks (GATs) generally adopt the self-attention mechanism to learn graph edge attention, which usually return dense attention coefficients over all neighbors and thus are prone to be sensitive to graph edge noises. To overcome this problem, sparse GATs are desirable and have garnered increasing interest in recent years. However, existing sparse GATs usually suffer from high training complexity and are also not straightforward for inductive learning tasks. To address these issues, we propose to learn sparse GATs by exploiting spiking neuron (SN) mechanism, termed Graph Spiking Attention (GSAT). Specifically, it is known that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by it, this work attempts to exploit spiking neuron to learn sparse attention coefficients, resulting in edge-sparsified graph for GNNs. Therefore, GSAT can perform message passing on the selective neighbors naturally, which makes GSAT perform compactly and robustly w.r.t graph noises. Moreover, GSAT can be used straightforwardly for inductive learning tasks. Extensive experiments on both transductive and inductive tasks demonstrate the effectiveness, robustness and efficiency of GSAT.

Abstract:
Diffusion models suffer severe object repetition and local distortion when the inference resolution differs from its pre-trained resolution. We propose AccDiffusion v2, an accurate method for patch-wise higher-resolution diffusion extrapolation without training. Our in-depth analysis in this paper shows that using an identical text prompt for different patches leads to repetitive generation, while the absence of a prompt undermines image details. In response, our AccDiffusion v2 novelly decouples the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of a patch. Further analysis reveals that local distortion arises from inaccurate descriptions in prompts about the local structure of higher-resolution images. To address this issue, AccDiffusion v2, for the first time, introduces an auxiliary local structural information through ControlNet during higher-resolution diffusion extrapolation aiming to mitigate the local distortions. Finally, our analysis indicates that global semantic information is conducive to suppressing both repetitive generation and local distortion. Hence, our AccDiffusion v2 further proposes dilated sampling with window interaction for better global semantic information during higher-resolution diffusion extrapolation. We conduct extensive experiments, including both quantitative and qualitative comparisons, to demonstrate the efficacy of our AccDiffusion v2. The quantitative comparison shows that AccDiffusion v2 achieves state-of-the-art performance in image generation extrapolation without training. The qualitative comparison intuitively illustrates that AccDiffusion v2 effectively suppresses the issues of repetitive generation and local distortion in image generation extrapolation.

Affiliations: Nanyang Technological University, Singapore; Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence, Masdar City, United Arab Emirates; VCIP, School of Computer Science, Nankai University, Tianjin, China; Northeastern University, Shenyang, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China; Centre for Frontier AI Research (CFAR) and Institute of High Performance Computing(IHPC), Agency for Science, Technology and Research(A*STAR), Singapore; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China

Abstract:
Vision-language pre-training (VLP) models excel at interpreting both images and text but remain vulnerable to multimodal adversarial examples (AEs). Advancing the generation of transferable AEs, which succeed across unseen models, is key to developing more robust and practical VLP models. Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process, aiming to improve transferability by expanding the contrast space of image-text features. However, these methods focus solely on diversity around the current AEs, yielding limited gains in transferability. To address this issue, we propose to increase the diversity of AEs by leveraging the intersection regions along the adversarial trajectory during optimization. Specifically, we propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity. We provide a theoretical analysis to demonstrate the effectiveness of the proposed adversarial evolution triangle. Moreover, we find that redundant inactive dimensions can dominate similarity calculations, distorting feature matching and making AEs model-dependent with reduced transferability. Hence, we propose to generate AEs in the semantic image-text feature contrast space, which can project the original feature space into a semantic corpus subspace. The proposed semantic-aligned subspace can reduce the image feature redundancy, thereby improving adversarial transferability. Extensive experiments across different datasets and models demonstrate that the proposed method can effectively improve adversarial transferability and outperform state-of-the-art adversarial attack methods.

Abstract:
Deep neural networks often suffer from poor generalization due to complex and non-convex loss landscapes. Sharpness-Aware Minimization (SAM) is a popular solution that smooths the loss landscape by minimizing the maximized change of training loss when adding a perturbation to the weight. However, indiscriminate perturbation of SAM on all parameters is suboptimal and results in excessive computation, double the overhead of common optimizers like Stochastic Gradient Descent (SGD). In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves sparse perturbation by a binary mask. To obtain the sparse mask, we provide two solutions based on Fisher information and dynamic sparse training, respectively. We investigate the impact of different masks, including unstructured, structured, and NN:MM structured patterns, as well as explicit and implicit forms of implementing sparse perturbation. We theoretically prove that SSAM can converge at the same rate as SAM, i.e., O(\log T/\sqrtT)O(logT/T) . Sparse SAM has the potential to accelerate training and smooth the loss landscape effectively. Extensive experimental results on CIFAR and ImageNet-1K confirm that our method is superior to SAM in terms of efficiency, and the performance is preserved or even improved with a perturbation of merely 50% sparsity.

Abstract:
Training perception systems for self-driving cars requires substantial 2D annotations that are labor-intensive to manual label. While existing datasets provide rich annotations on pre-recorded sequences, they fall short in labeling rarely encountered viewpoints, potentially hampering the generalization ability for perception models. In this paper, we present PanopticNeRF-360, a novel approach that combines coarse 3D annotations with noisy 2D semantic cues to generate high-quality panoptic labels and images from any viewpoint. Our key insight lies in exploiting the complementarity of 3D and 2D priors to mutually enhance geometry and semantics. Specifically, we propose to leverage coarse 3D bounding primitives and noisy 2D semantic and instance predictions to guide geometry optimization, by encouraging predicted labels to match panoptic pseudo ground truth. Simultaneously, the improved geometry assists in filtering 3D&2D annotation noise by fusing semantics in 3D space via a learned semantic field. To further enhance appearance, we combine MLP and hash grids to yield hybrid scene features, striking a balance between high-frequency appearance and contiguous semantics. Our experiments demonstrate PanopticNeRF-360’s state-of-the-art performance over label transfer methods on the challenging urban scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360 enables omnidirectional rendering of high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance labels.

Abstract:
Fashion attribute editing is essential for combining the expertise of fashion designers with the potential of generative artificial intelligence. In this work, we focus on ‘any’ fashion attribute editing: 1) the ability to edit 78 fine-grained design attributes commonly observed in daily life; 2) the capability to modify desired attributes while keeping the rest components still; and 3) the flexibility to continuously edit on the edited image. To this end, we present the Any Fashion Attribute Editing (AFED) dataset, which includes 830 K high-quality fashion images from sketch and product domains, filling the gap for a large-scale, openly accessible fine-grained dataset. We also propose Twin-Net, a twin encoder-decoder GAN inversion method that offers diverse and precise information for high-fidelity image reconstruction. This inversion model, trained on the new dataset, serves as a robust foundation for attribute editing. Additionally, we introduce PairsPCA to identify semantic directions in latent space, enabling accurate editing without manual supervision. Comprehensive experiments, including comparisons with ten state-of-the-art image inversion methods and four editing algorithms, demonstrate the effectiveness of our Twin-Net and editing algorithm. All data and models are available at https://github.com/ArtmeScienceLab/AnyFashionAttributeEditing.

Abstract:
Federated learning (FL), recognized for its decentralized and privacy-preserving nature, faces vulnerabilities to backdoor attacks that aim to manipulate the model’s behavior on attacker-chosen inputs. Most existing defenses based on statistical differences take effect only against specific attacks. This limitation becomes significantly pronounced when malicious gradients closely resemble benign ones or the data exhibits non-IID characteristics, making the defenses ineffective against stealthy attacks. This paper revisits distance-based defense methods and uncovers two critical insights: First, Euclidean distance becomes meaningless in high dimensions. Second, a single metric cannot identify malicious gradients with diverse characteristics. As a remedy, we propose FedID, a simple yet effective strategy employing multiple metrics with dynamic weighting for adaptive backdoor detection. Besides, we present a modified z-score approach to select the gradients for aggregation. Notably, FedID does not rely on predefined assumptions about attack settings or data distributions and minimally impacts benign performance. We conduct extensive experiments on various datasets and attack scenarios to assess its effectiveness. FedID consistently outperforms previous defenses, particularly excelling in challenging Edge-case PGD scenarios. Our experiments highlight its robustness against adaptive attacks tailored to break the proposed defense and adaptability to a wide range of non-IID data distributions without compromising benign performance.

Abstract:
Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called hh-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.

Abstract:
Shared training approaches, such as multi-task learning (MTL) and gradient-based meta-learning, are widely used in various machine learning applications, but they often suffer from negative transfer, leading to performance degradation in specific tasks. While several optimisation techniques have been developed to mitigate this issue for pre-selected task cohorts, identifying optimal task combinations for joint learning—known as task grouping—remains underexplored and computationally challenging due to the exponential growth in task combinations and the need for extensive training and evaluation cycles. This paper introduces an efficient task grouping framework designed to reduce these overwhelming computational demands of the existing methods. The proposed framework infers pairwise task similarities through a sample-wise optimisation landscape analysis, eliminating the need for the shared model training required to infer task similarities in existing methods. With task similarities acquired, a graph-based clustering algorithm is employed to pinpoint near-optimal task groups, providing an approximate yet efficient and effective solution to the originally NP-hard problem. Empirical assessments conducted on 9 different datasets highlight the effectiveness of the proposed framework, revealing a five-fold speed enhancement compared to previous state-of-the-art methods. Moreover, the framework consistently demonstrates comparable performance, confirming its remarkable efficiency and effectiveness in task grouping.

Abstract:
One central theme in machine learning is function estimation from sparse and noisy data. An example is supervised learning where the elements of the training set are couples, each containing an input location and an output response. In the last decades, a substantial amount of work has been devoted to design estimators for the unknown function and to study their convergence to the optimal predictor, also characterizing the learning rate. These results typically rely on stationary assumptions where input locations are drawn from a probability distribution that does not change in time. In this work, we consider kernel-based ridge regression and derive convergence conditions under non stationary distributions, addressing also cases where stochastic adaption may happen infinitely often. This includes the important exploration-exploitation problems where e.g., a set of agents/robots has to monitor an environment to reconstruct a sensorial field and their movements rules are continuously updated on the basis of the acquired knowledge on the field and/or the surrounding environment.

Abstract:
The occlusion of the sun by clouds is one of the primary sources of uncertainties in solar power generation, and is a factor that affects the wide-spread use of solar power as a primary energy source. Real-time forecasting of cloud movement and, as a result, solar irradiance is necessary to schedule and allocate energy across grid-connected photovoltaic systems. Previous works monitored cloud movement using wide-angle field of view imagery of the sky. However, such images have poor resolution for clouds that appear near the horizon, which reduces their effectiveness for long term prediction of solar occlusion. Specifically, to be able to predict occlusion of the sun over long time periods, clouds that are near the horizon need to be detected, and their velocities estimated precisely. To enable such a system, we design and deploy a catadioptric system that delivers wide-angle imagery with uniform spatial resolution of the sky over its field of view. To enable prediction over a longer time horizon, we design an algorithm that uses carefully selected spatio-temporal slices of the imagery using estimated wind direction and velocity as inputs. Using ray-tracing simulations as well as a real testbed deployed outdoors, we show that the system is capable of predicting solar occlusion as well as irradiance for tens of minutes in the future, which is an order of magnitude improvement over prior work.

Abstract:
Recent advances in text-to-video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose MagicTime, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a simple yet effective two-stage Magic Adaptive Strategy, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called ChronoMagic, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world.

Abstract:
When taking images against strong light sources, the resulting images often contain heterogeneous flare artifacts. These artifacts can significantly affect image visual quality and downstream computer vision tasks. While collecting real data pairs of flare-corrupted/flare-free images for training flare removal models is challenging, current methods utilize the direct-add approach to synthesize training data. However, these methods do not consider automatic exposure and tone mapping in the image signal processing pipeline (ISP), leading to the limited generalization capability of deep model training using such data. Besides, existing light source recovery methods hardly recover multiple light sources due to the different sizes, shapes, and illuminance of various light sources. In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP, remodeling the principle of automatic exposure in the synthesis pipeline, and designing a more reliable light source recovery strategy. The new pipeline approaches realistic imaging by discriminating the local and global illumination through a convex combination, avoiding global illumination shifting and local over-saturation. Moreover, the current deep models are only generalized to specific devices due to the diversity of cameras’ ISPs. To achieve better generalization on different devices, we formulate the generalization problem as an adversarial training problem and embed an adversarial curve learning (ACL) paradigm in the synthesis pipeline to gain better performance. For recovering multiple light sources, our strategy convexly averages the input and output of the neural network based on illuminance levels, thereby avoiding the need for a hard threshold in identifying light sources. We also contribute a new flare removal testing dataset containing the flare-corrupted images captured by fifteen types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations.

Abstract:
Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the Otter model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the MIMIC-IT (MultI-Modal In-Context Instruction Tuning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Abstract:
Existing image fusion methods struggle to accommodate composite degradation and do not support users flexibly modulating the semantic objects of interest. To address these challenges, this study proposes a composite degradation-robust image fusion framework with language-driven semantics, called OmniFuse. Firstly, OmniFuse establishes a novel multi-modal information fusion paradigm based on the latent diffusion model (LDM). By projecting the information fusion function into the latent space of the LDM, the information fusion process is seamlessly integrated with the diffusion process. Thus, OmniFuse fully leverages the powerful generative capabilities of LDM to eliminate composite degradation, thereby achieving highly robust image fusion. Secondly, OmniFuse develops a language-driven controllable fusion strategy to strengthen fusion flexibility. It employs a language-driven feature fusion module (LFFM) to receive the specified localization priori, dynamically aggregating multi-modal features. Within LFFM, a visual enhancement regularization is introduced to highlight objects of interest for capturing perceptual attention, while reverse semantic driving is established to strengthen their semantic attributes. Together, the visual and semantic constraints can implicitly correct the imperfect localization priori, further refining the accuracy of language-driven control. Extensive experiments demonstrate the omnipotent performance of OmniFuse, with significant advantages in robustness and flexibility compared to state-of-the-art methods.

Abstract:
Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external data (i.e. generative vs. retrieval vs. original). Our key contributions are: 1) GenBench Construction: We devise GenBench, a broad benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks. 2) CLER Score: To address the insufficient correlation of existing metrics (e.g., FID, CLIP score) with downstream recognition performance, we propose CLER, a training-free metric indicating generative data’s efficiency for recognition tasks prior to training. 3) New Baselines: Comparisons of generative data with retrieved data from the same external pool help to elucidate the unique traits of generative data. 4) External Knowledge Injection: By fine-tuning special token embeddings for each category via Textual Inversion, performance improves across 17 datasets, except when dealing with low-resolution reference images. Our exhaustive benchmark and analysis spotlight generative data’s promise in visual recognition, while identifying key challenges for future investigation.

Abstract:
Despite the rapid advancements in few-shot segmentation (FSS), most of existing methods in this domain are hampered by their reliance on the limited and biased information from only a small number of labeled samples. This limitation inherently restricts their capability to achieve sufficiently high levels of performance. To address this issue, this paper proposes a pioneering framework named LLaFS++, which, for the first time, applies large language models (LLMs) into FSS and achieves notable success. LLaFS++ leverages the extensive prior knowledge embedded by LLMs to guide the segmentation process, effectively compensating for the limited information contained in the few-shot labeled samples and thereby achieving superior results. To enhance the effectiveness of the text-based LLMs in FSS scenarios, we present several innovative and task-specific designs within the LLaFS++ framework. Specifically, we introduce an input instruction that allows the LLM to directly produce segmentation results represented as polygons, and propose a region-attribute corresponding table to simulate the human visual system and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization, and propose a novel inference method to mitigate potential oversegmentation hallucinations caused by the regional guidance information. Incorporating these designs, LLaFS++ constitutes an effective framework that achieves state-of-the-art results on multiple datasets including PASCAL-5^i5i, COCO-20^i20i, and FSS-1000. Our superior performance showcases the remarkable potential of applying LLMs to process few-shot vision tasks.

Abstract:
Learning to forecast or synthesize bimanual object manipulation sequences has broad applications in assistive robotics and extended reality. Previous methods have several limitations: (1) They can only forecast for short durations as the output deteriorates with longer predictions. (2) They minimize the MSE of fine motions, such as cutting or stirring, which are treated as noise and averaged out, leading to static outputs. (3) They model hand-object contact implicitly, resulting in unrealistic motion where the objects float in the air. We address long-term forecasting degradation by decomposing long sequences of bimanual actions into shorter subsequences defined by keystates, minimizing output quality deterioration. Segmenting sequences into meaningful keystates allows us to treat fine periodic motions as primitives without optimizing their MSE in raw trajectories. We construct a motion dictionary to store representative dynamics for each action category, queried at test time to generate fine motions. Lastly, we improve hand-object contact using a novel neural network that forecasts the pose for objects in motion, while encouraging hand-object contact through generative models for 3D hand grasps. We evaluate our approach on publicly available bimanual manipulation datasets, showing significant improvements over the state of the art.

Abstract:
Adversarial examples are well-known tools to evaluate the vulnerability of deep neural networks (DNNs). Although lots of adversarial attack algorithms have been developed, it’s still challenging in the practical scenario that the model’s parameters and architectures are inaccessible to the attacker/evaluator, i.e., black-box adversarial attacks. Due to the practical importance, there has been rapid progress from recent algorithms, reflected by the quick increase in attack success rate and quick decrease in query numbers to the target model. However, there lacks thorough evaluations and comparisons among these algorithms, causing difficulties in tracking the real progress, analyzing advantages and disadvantages of different technical routes, as well as designing future development roadmap of this field. Thus, we aim at building a comprehensive benchmark of black-box adversarial attacks, called BlackboxBench. It mainly provides: 1) a unified, extensible and modular-based codebase, implementing 29 query-based attack algorithms and 30 transfer-based attack algorithms; 2) comprehensive evaluations: we evaluate the implemented algorithms against several mainstreaming model architectures on 2 widely used datasets (CIFAR-10 and a subset of ImageNet), leading to 14,950 evaluations^11 in total; 3) thorough analysis and new insights, as well analytical tools.

Abstract:
Traditional cameras face a trade-off between low-light performance and high-speed imaging: longer exposure times to capture sufficient light results in motion blur, whereas shorter exposures result in Poisson-corrupted noisy images. While burst photography techniques help mitigate this tradeoff, conventional cameras are fundamentally limited in their sensor noise characteristics. Event cameras and single-photon avalanche diode (SPAD) sensors have emerged as promising alternatives to conventional cameras due to their desirable properties. SPADs are capable of single-photon sensitivity with microsecond temporal resolution, and event cameras can measure brightness changes up to 1 MHz with low bandwidth requirements. We show that these properties are complementary, and can help achieve low-light, high-speed image reconstruction with low bandwidth requirements. We introduce a sensor fusion framework to combine SPADs with event cameras to improve the reconstruction of high-speed, low-light scenes while reducing the high bandwidth cost associated with using every SPAD frame. Our evaluation, on both synthetic and real sensor data, demonstrates significant enhancements (> 5>5 dB PSNR) in reconstructing low-light scenes at high temporal resolution (100 kHz) compared to conventional cameras. Event-SPAD fusion shows great promise for real-world applications, such as robotics or medical imaging.

Abstract:
In deep learning, initializing models with pre-trained weights has become the de facto practice for various downstream tasks. Many unsupervised domain adaptation (UDA) methods typically adopt a backbone pre-trained on ImageNet, and focus on reducing the source-target domain discrepancy. However, the impact of pre-training on adaptation received little attention. In this study, we delve into UDA from the novel perspective of pre-training. We first demonstrate the impact of pre-training by analyzing the dynamic distribution discrepancies between pre-training data domain and the source/ target domain during adaptation. Then, we reveal that the target error also stems from the pre-training in the following two factors: 1) empirically, target error arises from the gradually degenerative pre-trained knowledge during adaptation; 2) theoretically, the error bound depends on difference between the gradient of loss function, i.e., on the target domain and pre-training data domain. To address these two issues, we redefine UDA as a three-domain problem, i.e., source domain, target domain, and pre-training data domain; then we propose a novel framework, named TriDA. We maintain the pre-trained knowledge and improve the error bound by incorporating pre-training data into adaptation for both vanilla UDA and source-free UDA scenarios. For efficiency, we introduce a selection strategy for pre-training data, and offer a solution with synthesized images when pre-training data is unavailable during adaptation. Notably, TriDA is effective even with a small amount of pre-training or synthesized images, and seamlessly complements the two scenario UDA methods, demonstrating state-of-the-art performance across multiple benchmarks. We hope our work provides new insights for better understanding and application of domain adaptation.

Abstract:
This paper studies the problem of distribution matching (DM), which is a fundamental machine learning problem seeking to robustly align two probability distributions. Our approach is established on a relaxed formulation, called partial distribution matching (PDM), which seeks to match a fraction of the distributions instead of matching them completely. We theoretically derive the Kantorovich-Rubinstein duality for the partial Wasserstein-1 (PW) discrepancy, and develop a partial Wasserstein adversarial network (PWAN) that efficiently approximates the PW discrepancy based on this dual form. Partial matching can then be achieved by optimizing the network using gradient descent. Two practical tasks, point set registration and partial domain adaptation are investigated, where the goals are to partially match distributions in 3D space and high-dimensional feature space respectively. The experiment results confirm that the proposed PWAN effectively produces highly robust matching results, performing better or on par with the state-of-the-art methods.

Abstract:
In this article, we consider policy evaluation in off-policy reinforcement learning and propose a novel procedure (Stochastic Preconditioned Temporal Difference (SPTD)) that achieves the optimal convergence rate under linear function approximation. The procedure has a linear computational complexity of the dimension of the feature space in each iteration. Under Markovian sampling, we establish finite-sample rates when the target policy can be different from the behavior policy for data generation. Our procedure is the first algorithm for the off-policy policy evaluation that has the optimal rate \mathcal O(1/t)O(1/t) under the mean square error. We also provide the first result on the asymptotic distribution and give the nearly optimal step size \alpha _t = \mathcal O(t^-2/3)αt=O(t-2/3). The numerical performance of the procedure is studied in both on-policy and off-policy settings. Extensive numerical experiments demonstrate that our procedure uniformly outperforms existing methods.

Abstract:
There has been rapid progress recently on 3D human rendering, including novel view synthesis and pose animation, based on the advances of neural radiance fields (NeRF). However, most existing methods focus on person-specific training and their training typically requires multi-view videos. This article deals with a new challenging task – rendering novel views and novel poses for a person unseen in training, using only multiview still images as input without videos. For this task, we propose a simple yet surprisingly effective method to train a generalizable NeRF with multiview images as conditional input. The key ingredient is a dedicated representation combining a canonical NeRF and a volume deformation scheme. Using a canonical space enables our method to learn shared properties of human and easily generalize to different people. Volume deformation is used to connect the canonical space with input and target images and query image features for radiance and density prediction. We leverage the parametric 3D human model fitted on the input images to derive the deformation, which works quite well in practice when combined with our canonical NeRF. The experiments on both real and synthetic data with the novel view synthesis and pose animation tasks collectively demonstrate the efficacy of our method.

Abstract:
Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations. Typical approaches urge models to predict specific contents of masked tokens, which can be intuitively considered as teaching a student (the model) to solve given problems (predicting masked contents). Under such settings, the performance is highly correlated with mask strategies (the difficulty of provided problems). We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself. Intuitively, patches with high values of reconstruction loss can be regarded as hard samples, and masking those hard patches naturally becomes a demanding reconstruction task. To empower the model as a teacher, we propose Hard Patch Mining (HPM), predicting patch-wise losses and subsequently determining where to mask. Technically, we introduce an auxiliary loss predictor, which is trained with a relative objective to prevent overfitting to exact loss values. To gradually guide the training procedure, we propose an easy-to-hard mask strategy. Empirically, HPM brings significant improvements under both image and video benchmarks. Interestingly, solely incorporating the extra loss prediction objective leads to better representations, verifying the efficacy of determining where is hard to reconstruct.

Abstract:
Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation models by weak labels, which is receiving significant attention due to its low annotation cost. Existing approaches focus on generating pseudo labels for supervision while largely ignoring to leverage the inherent semantic correlation among different pseudo labels. We observe that pseudo-labeled pixels that are close to each other in the feature space are more likely to share the same class, and those closer to the distribution centers tend to have higher confidence. Motivated by this, we propose to model the underlying label distributions and employ cross-label constraints to generate more accurate pseudo labels. In this paper, we develop a unified WSSS framework named Adaptive Gaussian Mixtures Model, which leverages a GMM to model the label distributions. Specifically, we calculate the feature distribution centers of pseudo-labeled pixels and build the GMM by measuring the distance between the centers and each pseudo-labeled pixel. Then, we introduce an Online Expectation-Maximization (OEM) algorithm and a novel maximization loss to optimize the GMM adaptively, aiming to learn more discriminative decision boundaries between different class-wise Gaussian mixtures. Based on the label distributions, we leverage the GMM to generate high-quality pseudo labels for more reliable supervision. Our framework is capable of solving different forms of weak labels: image-level labels, points, scribbles, blocks, and bounding-boxes. Extensive experiments on PASCAL, COCO, Cityscapes, and ADE20 K datasets demonstrate that our framework can effectively provide more reliable supervision and outperform the state-of-the-art methods under all settings.

Abstract:
Deep neural networks have significantly improved the performance of low-level vision tasks but also increased the difficulty of interpretability. A deep understanding of deep models is beneficial for both network design and practical reliability. To take up this challenge, we introduce causality theory to interpret low-level vision models and propose a model-/task-agnostic method called Causal Effect Map (CEM). With CEM, we can visualize and quantify the input-output relationships on either positive or negative effects. After analyzing various low-level vision tasks with CEM, we have reached several interesting insights, such as: (1) Using more information of input images (e.g., larger receptive field) does NOT always yield positive outcomes. (2) Attempting to incorporate mechanisms with a global receptive field (e.g., channel attention) into image denoising may prove futile. (3) Integrating multiple tasks to train a general model could encourage the network to prioritize local information over global context. Based on the causal effect theory, the proposed diagnostic tool can refresh our common knowledge and bring a deeper understanding of low-level vision models.

Abstract:
Unified, or more formally, all-in-one image restoration has emerged as a practical and promising low-level vision task for real-world applications. In this context, the key issue lies in how to deal with different types of degraded images simultaneously. Existing methods fit joint regression models over multi-domain degraded-clean image pairs of different degradations. However, due to the severe ill-posedness of inverting heterogeneous degradations, they often struggle with thoroughly perceiving the degradation semantics and rely on paired data for supervised training, yielding suboptimal restoration maps with structurally compromised results and lacking practicality for real-world or unpaired data. To break the barriers, we present a Degradation-Aware Residual-Conditioned Optimal Transport (DA-RCOT) approach that models (all-in-one) image restoration as an optimal transport (OT) problem for unpaired and paired settings, introducing the transport residual as a degradation-specific cue for both the transport cost and the transport map. Specifically, we formalize image restoration with a residual-guided OT objective by exploiting the degradation-specific patterns of the Fourier residual in the transport cost. More crucially, we design the transport map for restoration as a two-pass DA-RCOT map, in which the transport residual is computed in the first pass and then encoded as multi-scale residual embeddings to condition the second-pass restoration. This conditioning process injects intrinsic degradation knowledge (e.g., degradation type and level) and structural information from the multi-scale residual embeddings into the OT map, which thereby can dynamically adjust its behaviors for all-in-one restoration. Extensive experiments across five degradations demonstrate the favorable performance of DA-RCOT as compared to state-of-the-art methods, in terms of distortion measures, perceptual quality, and image structure preservation. Notably, DA-RCOT delivers superior adaptability to real-world scenarios even with mixed degradations and shows distinctive robustness to both degradation levels and the number of degradations.

Abstract:
Recent studies construct deblurred neural radiance fields (DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to the inherent complexity arising from the simultaneous optimization of blur kernels and NeRF from sparse view. Sparse-DeRF successfully regularizes the complicated joint optimization, presenting alleviated overfitting artifacts and enhanced quality on radiance fields. The regularization consists of three key components: Surface smoothness, helps the model accurately predict the scene structure utilizing unseen and additional hidden rays derived from the blur kernel based on statistical tendencies of real-world; Modulated gradient scaling, helps the model adjust the amount of the backpropagated gradient according to the arrangements of scene objects; Perceptual distillation improves the perceptual quality by overcoming the ill-posed multi-view inconsistency of image deblurring and distilling the pre-deblurred information, compensating for the lack of clean information in blurry images. We demonstrate the effectiveness of the Sparse-DeRF with extensive quantitative and qualitative experimental results by training DeRF from 2-view, 4-view, and 6-view blurry images.

Abstract:
This paper defines a positive and unlabeled classification problem for standard GANs, which then leads to a novel technique to stabilize the training of the discriminator in GANs and deal with corrupted data. Traditionally, real data are taken as positive while generated data are negative. This positive-negative classification criterion was kept fixed all through the learning process of the discriminator without considering the gradually improved quality of generated data, even if they could be more realistic than real data at times. In contrast, it is more reasonable to treat the generated data as unlabeled, which could be positive or negative according to their quality. The discriminator is thus a classifier for this positive and unlabeled classification problem, and we derive a new Positive-Unlabeled GAN (PUGAN). We theoretically discuss the global optimality the proposed model will achieve and the equivalent optimization goal. Empirically, we find that PUGAN can achieve comparable or even better performance than those sophisticated discriminator stabilization methods. Considering the potential corrupted data problem in real-world scenarios, we further extend our approach to PUGAN-C, which treats real data as unlabeled that accounts for both clean and corrupted instances, and generated data as positive. The samples from generator could be closer to those corrupted data within unlabeled data at first, but within the framework of adversarial training, the generator will be optimized to cheat the discriminator and produce samples that are similar to those clean data. Experimental results on image generation from several corrupted datasets demonstrate the effectiveness and generalization of PUGAN-C.

Abstract:
With the recent advances in technology, a wide range of systems continue to collect a large amount of data over time and thus generate time series. Time-Series Anomaly Detection (TSAD) is an important task in various time-series applications such as e-commerce, cybersecurity, vehicle maintenance, and healthcare monitoring. However, this task is very challenging as it requires considering both the intra-variable dependency (relationships within a variable over time) and the inter-variable dependency (relationships between multiple variables) existing in time-series data. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of TSAD using graphs, referred to as G-TSAD. First, we explore the significant potential of graph representation for time-series data and and its contributions to facilitating anomaly detection. Then, we review state-of-the-art graph anomaly detection techniques, mostly leveraging deep learning architectures, in the context of time series. For each method, we discuss its strengths, limitations, and the specific applications where it excels. Finally, we address both the technical and application challenges currently facing the field, and suggest potential future directions for advancing research and improving practical outcomes.

Abstract:
Due to the ill-posed nature of locating 3D objects based on image inputs, objects detected by camera-based detectors tend to have considerable uncertainty in their localization. Previous works in camera-based 3D detection and tracking represent each detected object as a single certain 3D bounding box, ignoring their localization uncertainty. We propose the uncertain representation of 3D objects to meet the indeterminacy of localizing objects in images. We model the localization uncertainty of objects during the detection process and represent the location of objects as a probability distribution in 3D space. For camera-based 3D detection, we propose to gather and suppress redundant predictions about an object to form its uncertain representation. For camera-based 3D multiple object tracking, we generalize the cross-frame association metric under the uncertain representation of objects for better-tracking objects with uncertain and unstable localization. As a plug-in module for camera 3D detectors, our proposed method brings a +3.5%/+3.2%/+3.7% NDS boost to BEVDet4D/BEVDet4D-Depth/DD3D on nuScenes validation set and a +4.7% NDS boost to BEVDet4D-Depth on nuScenes test set. With enhanced cross-frame association, our tracking method achieves a 48.2% AMOTA performance and reduces the remaining identity-switch cases to only 300 on nuScenes test set.

Abstract:
Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.

Abstract:
Anomaly detection has garnered extensive applications in real industrial manufacturing due to its remarkable effectiveness and efficiency. However, previous generative-based models have been limited by suboptimal reconstruction quality, hampering their overall performance. We introduce DiffusionAD, a novel anomaly detection pipeline comprising a reconstruction sub-network and a segmentation sub-network. A fundamental enhancement lies in our reformulation of the reconstruction process using a diffusion model into a noise-to-norm paradigm. Here, the anomalous region loses its distinctive features after being disturbed by Gaussian noise and is subsequently reconstructed into an anomaly-free one. Afterward, the segmentation sub-network predicts pixel-level anomaly scores based on the similarities and discrepancies between the input image and its anomaly-free reconstruction. Additionally, given the substantial decrease in inference speed due to the iterative denoising nature of diffusion models, we revisit the denoising process and introduce a rapid one-step denoising paradigm. This paradigm achieves hundreds of times acceleration while preserving comparable reconstruction quality. Furthermore, considering the diversity in the manifestation of anomalies, we propose a norm-guided paradigm to integrate the benefits of multiple noise scales, enhancing the fidelity of reconstructions. Comprehensive evaluations on four standard and challenging benchmarks reveal that DiffusionAD outperforms current state-of-the-art approaches and achieves comparable inference speed, demonstrating the effectiveness and broad applicability of the proposed pipeline.

Abstract:
We present a novel solution for mesh-based deformation simulation from a spectral perspective. Unlike existing approaches that demand separate training for each garment or body type and often struggle to produce rich folds and lifelike dynamics, our method achieves the quality of physics-based simulations while maintaining superior efficiency within a unified model. The key to achieve this lies in the development of a spectrum-enhanced deformation network, a result of in-depth theoretical analysis bridging neural networks and garment deformations. This enhancement compels the network to focus on learning spectral information predominantly within the frequency band associated with intricate deformations. Furthermore, building upon standard blend skinning techniques, we introduce target-aware temporal skinning weights. The weights describe how the underlying human skeleton dynamically affects the mesh vertices according to the garment and body shape, as well as the motion state. We validate our method on various garments, bodies, and motions through extensive ablation studies. Finally, we conduct comparisons to confirm its superiority in generalization, deformation quality, and performance over several state-of-the-art methods.

Abstract:
Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as “posecodes”, using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field.

Abstract:
Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.

Abstract:
Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.

Abstract:
In this paper, we formally address universal object detection, which aims to detect every category in every scene. The dependence on human annotations, the limited visual information, and the novel categories in open world severely restrict the universality of detectors. We propose UniDetector, a universal object detector that recognizes enormous categories in the open world. The critical points for UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces in training through image-text alignment, which guarantees sufficient information for universal representations. 2) it involves heterogeneous supervision training, which alleviates the dependence on the limited fully-labeled images. 3) it generalizes to open world easily while keeping the balance between seen and unseen classes. 4) it further promotes generalizing to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7 k categories, the largest measurable size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot ability on large-vocabulary datasets - it surpasses supervised baselines by more than 5% without seeing any corresponding images. On 13 detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data.

Abstract:
Image denoising has progressed significantly due to the development of effective deep denoisers. To improve the performance in real-world scenarios, recent trends prefer to formulate superior noise models to generate realistic training data, or estimate noise levels to steer non-blind denoisers. In this paper, we bridge both strategies by presenting an innovative noise estimation and realistic noise synthesis pipeline. Specifically, we integrates a fine-grained statistical noise model and contrastive learning strategy, with a unique data augmentation to enhance learning ability. Then, we use this model to estimate noise parameters on evaluation dataset, which are subsequently used to craft camera-specific noise distribution and synthesize realistic noise. One distinguishing feature of our methodology is its adaptability: our pre-trained model can directly estimate unknown cameras, making it possible to unfamiliar sensor noise modeling using only testing images, without calibration frames or paired training data. Another highlight is our attempt in estimating parameters for fine-grained noise models, which extends the applicability to even more challenging low-light conditions. Through empirical testing, our calibration-free pipeline demonstrates effectiveness in both normal and low-light scenarios, further solidifying its utility in real-world noise synthesis and denoising tasks.

Abstract:
Anomaly detection is a common application of machine learning. Out-of-distribution (OOD) detection in particular is a semi-supervised anomaly detection technique where the detection method is trained only on the inlier (in-distribution) samples—unlike the fully supervised variant, the distribution of the outlier samples are never explicitly modeled in OOD detection tasks. In this work, we design a novel GAN-based OOD detection network specifically designed to protect a cyber-physical signal systems from novel Trojan malware called non-control data (NCD) attack that evades conventional malware detection techniques. Inspired in part by the classical locally most powerful (LMP) test in statistical inferences, the proposed LMP-GAN trains the OOD detector (discriminator) by generating OOD samples that are aimed at making maximal alteration to the inlier samples while evading detection. We experimentally compare the results to the state-of-the-art anomaly detection methods to demonstrate the benefits and the appropriateness of the LMP-GAN OOD detector.

Affiliations: National Engineering Research Center for Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, School of Robotics, and the State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha, China; Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China; Department of Information Engineering and Computer Science, University of Trento, Trento, Italy; School of Computing and Communications, Lancaster University, Lancaster, U.K.; Department of Computer Science, The University of Western Australia, Crawley, WA, Australia

Abstract:
Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance.

Abstract:
The paper introduces DIST, an innovative knowledge distillation method that excels in learning from a superior teacher model. DIST differentiates itself from conventional techniques by adeptly handling the often significant prediction discrepancies between the student and teacher models. It achieves this by focusing on maintaining the relationships between their predictions, implementing a correlation-based loss to explicitly capture the teacher's intrinsic inter-class relations. Moreover, DIST uniquely considers the semantic similarities between different instances and each class at the intra-class level. The method is further enhanced by two significant improvements: (1) A teacher acclimation strategy, which effectively reduces the discrepancy between teacher and student, thereby optimizing the distillation process. (2) An extension of the DIST loss from the logit level to the feature level, a modification that proves especially beneficial for dense prediction tasks. DIST stands out for its simplicity, practicality, and adaptability to various architectures, model sizes, and training strategies. It consistently delivers state-of-the-art results across a range of applications, including image classification, object detection, and semantic segmentation.

Abstract:
Imitation Learning (IL) learns from experts, on which most existing studies assume that the imitator will be deployed in stationary environments. However, real-world scenarios commonly involve perturbations, necessitating robust imitators for non-stationary scenarios. To fulfill this, we leverage a multi-modal expert dataset encompassing diverse dynamics, while still adhering to the shared goal between the experts and imitator. Different from conventional multi-modal IL work that considers reproducing the demonstrated different behaviors, we aim to imitate a policy that rapidly adapts to sudden dynamic changes, even when encountering dynamics unseen during training. We propose a method called Generalizable Multi-modal Adversarial Imitation Learning (GMAIL) for non-stationary dynamics, which adversarially trains a discriminator and a generator. Due to dynamic mismatch between the experts and the imitator, the optimal next state for the imitator may require several steps for the experts to reach, inspiring us to propose to take the state-next-state pairs within multiple steps in the demonstrated trajectories to facilitate imitation under dynamic mismatch. For quick identification of the changed dynamic, GMAIL learns a dynamics-sensitive generator by introducing a history-based context encoder. On a wide range of navigation, locomotion and autonomous driving tasks, empirical results illustrate the effectiveness of GMAIL.

Affiliations: Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong, China; School of Medicine, Zhejiang University, Hangzhou, China; College of Electrical Engineering, Zhejiang University, Hangzhou, China; Department of Automation, University of Science and Technology of China, Hefei, China; SenseTime Group Ltd., Hong Kong, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China; School of Public Health, Zhejiang University, Hangzhou, China; CUHK Interdisciplinary Artificial Intelligence Research Institute, Hong Kong, China

Abstract:
Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks.

Abstract:
Foundation models are usually pre-trained on large-scale datasets and then adapted to different downstream tasks through tuning. This pre-training and then fine-tuning paradigm has become a standard practice in deep learning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1 K, YFCC15 M, and CC12 M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners, considering one may not be able to access or fully fine-tune the pre-trained models. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Transfer Learning.

Abstract:
We present the Decoupled VIdeo Segmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. The proposed decoupled framework efficiently handles universal and open-vocabulary object representations, allowing DVIS++ to conduct universal and open-vocabulary video segmentation. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in closed- and open-vocabulary settings.

Abstract:
The goal of object navigation task is to reach the expected objects using visual information in unseen environments. Previous works typically implement deep models as agents that are trained to predict actions based on visual observations. Despite extensive training, agents often fail to make wise decisions when navigating in unseen environments toward invisible targets. In contrast, humans demonstrate a remarkable talent to navigate toward targets even in unseen environments. This superior capability is attributed to the cognitive map in the hippocampus, which enables humans to recall past experiences in similar situations and anticipate future occurrences during navigation. It is also dynamically updated with new observations from unseen environments. The cognitive map equips humans with a wealth of prior knowledge, significantly enhancing their navigation capabilities. Inspired by human navigation mechanisms, we propose the Hierarchical Object-to-Zone (HOZ++) graph, which encapsulates the regularities among objects, zones, and scenes. The HOZ++ graph helps the agent to identify the current zone and the target zone, and computes an optimal path between them, then selects the next zone along the path as the guidance for the agent. Moreover, the HOZ++ graph continuously updates based on real-time observations in new environments, thereby enhancing its adaptability to new environments. Our HOZ++ graph is versatile and can be integrated into existing methods, including end-to-end RL and modular methods. Our method is evaluated across four simulators, including AI2-THOR, RoboTHOR, Gibson, and Matterport 3D. Additionally, we build a realistic environment to evaluate our method in the real world. Experimental results demonstrate the effectiveness and efficiency of our proposed method.

Abstract:
To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on non-convex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch’s FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam’s training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoEs.

Abstract:
In domain adaptation (DA), the effectiveness of deep learning-based models is often constrained by batch learning strategies that fail to fully apprehend the global statistical and geometric characteristics of data distributions. Addressing this gap, we introduce “Global Awareness Enhanced Domain Adaptation” (GAN-DA), a novel approach that transcends traditional batch-based limitations. GAN-DA integrates a unique predefined feature representation (PFR) to facilitate the alignment of cross-domain distributions, thereby achieving a comprehensive global statistical awareness. This representation is innovatively expanded to encompass orthogonal and common feature aspects, which enhances the unification of global manifold structures and refines decision boundaries for more effective DA. Our extensive experiments, encompassing 27 diverse cross-domain image classification tasks, demonstrate GAN-DA's remarkable superiority, outperforming 24 established DA methods by a significant margin. Furthermore, our in-depth analyses shed light on the decision-making processes, revealing insights into the adaptability and efficiency of GAN-DA. This approach not only addresses the limitations of existing DA methodologies but also sets a new benchmark in the realm of domain adaptation, offering broad implications for future research and applications in this field.

Abstract:
In the image super-resolution (SR) field, recovering missing high-frequency textures has always been an important goal. However, deep SR networks based on pixel-level constraints tend to focus on stable edge details and cannot effectively restore random high-frequency textures. It was not until the emergence of the generative adversarial network (GAN) that GAN-based SR models achieved realistic texture restoration and quickly became the mainstream method for texture SR. However, GAN-based SR models still have some drawbacks, such as relying on a large number of parameters and generating fake textures that are inconsistent with ground truth. Inspired by traditional texture analysis research, this paper proposes a novel SR network based on local texture pattern estimation (LTPE), which can restore fine high-frequency texture details without GAN. A differentiable local texture operator is first designed to extract local texture structures, and a texture enhancement branch is used to predict the high-resolution local texture distribution based on the LTPE. Then, the predicted high-resolution texture structure map can be used as a reference for the texture fusion SR branch to obtain high-quality texture reconstruction. Finally, L_1L1 loss and Gram loss are simultaneously used to optimize the network. Experimental results demonstrate that the proposed method can effectively recover high-frequency texture without using GAN structures. In addition, the restored high-frequency details are constrained by local texture distribution, thereby reducing significant errors in texture generation.

Abstract:
Despite notable advancements across various tasks, deep sequence recognition models are shown to grapple with the dilemma of over-confidence, leading to unreliable predicted confidence, necessitating the need for calibration. Current efforts predominantly focus on classification model calibration, leaving the sequence recognition model calibration analysis underexplored and challenging. In this work, we discover that the primary reason for over-confidence in sequence recognition models stems from the one-hot encoding target sequence training paradigm and identify two distinct manifestations of over-confidence: perception and semantic context over-confidence. To address these challenges, we propose a heterogeneous correlation aware sequence regularization (HCSR) method that adaptively incorporates correlated sequences into training alongside the target sequence as additional supervision to regularize the probability of the target sequence from arbitrarily escalating. Specifically, a correlated sequence mining (CSM) model is designed, capable of efficiently mining heterogeneous correlated sequences, which can be flexibly customized to search for specific types of correlated sequences in demand to facilitate the calibration of corresponding types of over-confidence in the calibrating model, thereby achieving fine-grained calibration. Meanwhile, an adaptive calibration module is introduced to adaptively coordinate the optimization weights between the target sequence and correlated sequences, enabling the co-calibration among different samples. Comprehensive experiments conducted on several widely employed sequence recognition tasks demonstrate that the proposed method outperforms the current competing methods by a substantial margin.

Abstract:
Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Auto-Regressive (AR)-based models implement the recognition in a character-by-character manner, showing superiority in accuracy but with slow inference speed. Alternatively, Parallel Decoding (PD)-based models infer all characters in a single decoding pass, offering faster inference speed but generally worse accuracy. To realize the dual goals of “AR-level accuracy and PD-level speed”, we propose a Context Perception Parallel Decoder (CPPD) to perceive the related context and predict the character sequence in a PD pass. CPPD devises a character counting module to infer the occurrence count of each character, and a character ordering module to deduce the content-free reading order and positions. Meanwhile, the character prediction task associates the positions with characters. They together build a comprehensive recognition context, which benefits the decoder to focus accurately on characters with the attention mechanism, thereby improving the recognition accuracy. We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running much faster than existing leading models. Moreover, the plugged models achieve significant accuracy improvements.

Abstract:
This paper introduces a novel Bayesian approach to detect changes in the variance of a Gaussian sequence model, focusing on quantifying the uncertainty in the change point locations and providing a scalable algorithm for inference. We do that by framing the problem as a product of multiple single changes in the scale parameter. We fit the model through an iterative procedure similar to what is done for additive models. The novelty is that each iteration returns a probability distribution on time instances, which captures the uncertainty in the change point location. Leveraging a recent result in the literature, we can show that our proposal is a variational approximation of the exact model posterior distribution. We study the convergence of the algorithm and the change point localization rate. Extensive experiments in simulation studies and applications to biological data illustrate the performance of our method.

Abstract:
Remote photoplethysmography (rPPG) has been widely applied to measure heart rate from face videos. To increase the generalizability of the algorithms, domain generalization (DG) attracted increasing attention in rPPG. However, when rPPG is extended to simultaneously measure more vital signs (e.g., respiration and blood oxygen saturation), achieving generalizability brings new challenges. Although partial features shared among different physiological signals can benefit multi-task learning, the sparse and imbalanced target label space brings the seesaw effect over task-specific feature learning. To resolve this problem, we designed an end-to-end Mixture of Low-rank Experts for multi-task remote Physiological measurement (PhysMLE), which is based on multiple low-rank experts with a novel router mechanism, thereby enabling the model to adeptly handle both specifications and correlations within tasks. Additionally, we introduced prior knowledge from physiology among tasks to overcome the imbalance of label space under real-world multi-task physiological measurement. For fair and comprehensive evaluations, this paper proposed a large-scale multi-task generalization benchmark, named Multi-Source Synsemantic Domain Generalization (MSSDG) protocol. Extensive experiments with MSSDG and intra-dataset have shown the effectiveness and efficiency of PhysMLE. In addition, a new dataset was collected and made publicly available to meet the needs of the MSSDG. The code and data are available at https://github.com/WJULYW/PhysMLE.

Abstract:
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric detection methods perform poorly on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, BEVHeight++ surpasses depth-only methods with increases of +2.8% NDS and +1.7% mAP on the nuScenes test set, and even higher gains of +9.3% NDS and +8.8% mAP on the nuScenes-C benchmark with object-level distortion. Consistent and substantial performance improvements are achieved across the KITTI, KITTI-360, and Waymo datasets as well.

Abstract:
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available on our project webpage (https://antoyang.github.io/just-ask.html).

Abstract:
Neural volumetric representations such as Neural Radiance Fields (NeRF) have emerged as a compelling technique for learning to represent 3D scenes from images with the goal of rendering photorealistic images of the scene from unobserved viewpoints. However, NeRF's computational requirements are prohibitive for real-time applications: rendering views from a trained NeRF requires querying a multilayer perceptron (MLP) hundreds of times per ray. We present a method to train a NeRF, then precompute and store (i.e., “bake”) it as a novel representation called a Sparse Neural Radiance Grid (SNeRG) that enables real-time rendering on commodity hardware. To achieve this, we introduce 1) a reformulation of NeRF's architecture and 2) a sparse voxel grid representation with learned feature vectors. The resulting scene representation retains NeRF's ability to render fine geometric details and view-dependent appearance, is compact (averaging less than 90 MB per scene), and can be rendered in real-time (higher than 30 frames per second on a laptop GPU). Actual screen captures are shown in our video.

Abstract:
Recently, several spatial-temporal memory-based methods have verified that storing intermediate frames with masks as memory helps segment target objects in videos. However, they mainly focus on better matching between the current frame and memory frames without paying attention to the quality of the memory. Consequently, frames with poor segmentation masks may be memorized, leading to error accumulation problems. Besides, the linear increase of memory frames with the growth of frame numbers limits the ability of the models to handle long videos. To this end, we propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame, allowing the memory bank to selectively store accurately segmented frames and prevent error accumulation. Then, we combine the segmentation quality with temporal consistency to dynamically update the memory bank and make the models can handle videos of arbitrary length. The above operation ensures the reliability of memory frames and improves the quality of memory at the frame level. Moreover, we observe that the memory features extracted from reliable frames still contain noise and have limited representation capabilities. To address this problem, we propose to perform memory enhancement and anchoring on the basis of QDMN to improve the quality of memory from the feature level, resulting in a more robust and effective network QDMN++. Our method achieves state-of-the-art performance on all popular benchmarks. Moreover, extensive experiments demonstrate that the proposed memory screening mechanism can be applied to any memory-based methods as generic plugins.

Abstract:
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.

Abstract:
Unintentional action localization (UAL) is a challenging task that requires reasoning about action intention clues to detect the temporal locations of unintentional action occurrences in real-world videos. Previous efforts usually treated this task as a dense binary classification problem and did not fully explore the relationships between intention clues and unintentional actions, resulting in unsatisfactory performance on open-set scenarios during inference. In this paper, we propose a Transferable Unintentional Action Localization framework by introducing language-guided intention translation, which explicitly formulates unintentional action localization as an open-set localization problem. Our framework constructs a transferable reasoning model guided by natural languages to translate the action intention of the entire video, which generates natural and powerful supervision signals for reconstructing complete action intention clues to address the problem of unintentional action localization. Based on the fact that a video with failure action is composed of intentional and unintentional parts connected by a transient action transition. Our transferable reasoning model employs a transformer architecture to transfer knowledge between intentional and unintentional parts for learning complementary semantic representations of these two parts, completing the action intention clue in an implicit supervision manner. We also present a dense voting scheme for detecting the action transition from intentional to unintentional using discriminative representations incorporating action intention clues. Extensive experiments demonstrate that our framework outperforms representative unintentional action localization methods in a wide range of open-set scenarios. In addition, we create a new unintentional sports video dataset, FS-Falls, and extend our framework from in-the-wild scenarios to competitive sports to demonstrate better generalization ability. We hope this work will provide a new perspective on creating powerful representations with complete action intention priors, which will help us better understand human action and capture underlying intention clues in real-world videos.

Abstract:
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance.

Abstract:
Passive hyperspectral longwave infrared measurements are remarkably informative about the surroundings. Remote object material and temperature determine the spectrum of thermal radiance, and range, air temperature, and gas concentrations determine how this spectrum is modified by propagation to the sensor. We introduce a passive range imaging method based on computationally separating these phenomena. Previous methods assume hot and highly emitting objects; ranging is more challenging when objects’ temperatures do not deviate greatly from air temperature. Our method jointly estimates range and intrinsic object properties, with explicit consideration of air emission, though reflected light is assumed negligible. Inversion being underdetermined is mitigated by using a parametric model of atmospheric absorption and regularizing for smooth emissivity estimates. To assess where our estimate is likely accurate, we introduce a technique to detect which scene pixels are significantly influenced by reflected downwelling. Monte Carlo simulations demonstrate the importance of regularization, temperature differentials, and availability of many spectral bands. We apply our method to longwave infrared (8–13 \mathrm\mu \mathrmmμm) hyperspectral image data acquired from natural scenes with no active illumination. Range features from 15 m to 150 m are recovered, with good qualitative match to lidar data for pixels classified as having negligible reflected downwelling.

Abstract:
Transformer-based pre-trained models have gained much advance in recent years, Transformer architecture also becomes one of the most important backbones in natural language processing. Recent works show that the attention mechanism inside Transformer may not be necessary, and Transformer alternatives such as convolutional neural networks, multi-layer perceptron, and state space model have also been investigated. Transformer-based models have two main limitations: First, they have quadratic time complexity due to the full attention mechanism, which leads to high computational costs. Second, they rely on representation of a special token such as [CLS] to encode entire text, which limits its sentence-level expressiveness. In this paper, we consider a graph recurrent network with linear time complexity for language model pre-training, which builds a graph structure for each sequence with local token-level communications, together with a sentence-level representation detached from other normal tokens. On both English and Chinese text understanding tasks, our model can achieve comparable performance to existing pre-trained models while also achieving higher inference efficiency. Furthermore, we discovered that the representations generated by our model are more diverse and uniform compared to that of Transformer, which alleviates the problems in existing pre-trained models such as representation degradation.

Affiliations: School of Software, BNRist, THUIBCS, BLBCI, Tsinghua University, Beijing, China; Institute of Artificial Intelligence and Robotics, College of Artificial Intelligence, Xi’an Jiaotong University, Xi’an, China; Department of Mathematics, School of Science, Shanghai University, Shanghai, China; Department of Automation, Tsinghua University, Beijing, China; Department of Artificial Intelligence, School of Informatics, Institute of Artificial Intelligence, Fujian Engineering Research Center of Trusted Artificial Intelligence Analysis and Application, Media Analytics and Computing Laboratory, Xiamen University, Xiamen, China

Abstract:
We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we propose the Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework, which transposes visual feature maps into a semantic space and constructs a hypergraph for high-order message propagation. This enables the model to acquire both semantic and structural information, advancing beyond conventional feature-focused learning. Hyper-YOLO incorporates the proposed Mixed Aggregation Network (MANet) in its backbone for enhanced feature extraction and introduces the Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net) in its neck. HyperC2Net operates across five scales and breaks free from traditional grid structures, allowing for sophisticated high-order interactions across levels and positions. This synergy of components positions Hyper-YOLO as a state-of-the-art architecture in various scale models, as evidenced by its superior performance on the COCO dataset. Specifically, Hyper-YOLO-N significantly outperforms the advanced YOLOv8-N and YOLOv9-T with 12% \textAP^valAPval and 9% \textAP^valAPval improvements.

Abstract:
On account of the extreme settings, stealing the black-box model without its training data is difficult in practice. On this topic, along the lines of data diversity, this paper substantially makes the following improvements based on our conference version (dubbed STDatav1, short for Surrogate Training Data). First, to mitigate the undesirable impacts of the potential mode collapse while training the generator, we propose the joint-data optimization scheme, which utilizes both the synthesized data and the proxy data to optimize the surrogate model. Second, we propose the self-conditional data synthesis framework, an interesting effort that builds the pseudo-class mapping framework via grouping class information extraction to hold the class-specific constraints while holding the diversity. Within this new framework, we inherit and integrate the class-specific constraints of STDatav1 and design a dual cross-entropy loss to fit this new framework. Finally, to facilitate comprehensive evaluations, we perform experiments on four commonly adopted datasets, and a total of eight kinds of models are employed. These assessments witness the considerable performance gains compared to our early work and demonstrate the competitive ability and promising potential of our approach.

Abstract:
Retinex model-based methods have shown to be effective in layer-wise manipulation with well-designed priors for low-light image enhancement (LLIE). However, the hand-crafted priors and conventional optimization algorithm adopted to solve the layer decomposition problem result in the lack of adaptivity and efficiency. To this end, this paper proposes a Retinex-based deep unfolding network (URetinex-Net++), which unfolds an optimization problem into a learnable network to decompose a low-light image into reflectance and illumination layers. By formulating the decomposition problem as an implicit priors regularized model, three learning-based modules are carefully designed, responsible for data-dependent initialization, high-efficient unfolding optimization, and fairly-flexible component adjustment, respectively. Particularly, the proposed unfolding optimization module, introducing two networks to adaptively fit implicit priors in the data-driven manner, can realize noise suppression and details preservation for decomposed components. URetinex-Net++ is a further augmented version of URetinex-Net, which introduces a cross-stage fusion block to alleviate the color defect in URetinex-Net. Therefore, boosted performance on LLIE can be obtained in both visual quality and quantitative metrics, where only a few parameters are introduced and little time is cost. Extensive experiments on real-world low-light images qualitatively and quantitatively demonstrate the effectiveness and superiority of the proposed URetinex-Net++ over state-of-the-art methods.

Abstract:
Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.

Abstract:
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors, such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments.

Abstract:
In pedestrian attribute recognition (PAR), the loose umbrella term ‘attribute’ ranges from human soft-biometrics to wearing accessory, and even extending to various subjective body descriptors. As a result, the vast coverage of ‘attributes’ implies that, instead of being over-specialized to limited attributes with exclusive characteristic, PAR should be approached from a much fundamental perspective. To this end, given that most attributes are greatly under-represented in real-world datasets, we simply distill PAR into a visual task of multi-label recognition under significant data imbalance. Accordingly, we introduce feature re-sampled detached learning (FRDL) to decouple label-balanced learning from the curse of attributes co-occurrence. Specifically, FRDL is able to balance the sampling distribution of an attribute without biasing the label prior of co-occurring others. As a complementary method, we also propose gradient-oriented augment translating (GOAT) to alleviate the feature noise and semantics imbalance aggravated in FRDL. Integrated in a highly unified framework, FRDL and GOAT substantially refresh the state-of-the-art performance on various realistic benchmarks, while maintaining a minimal computational budget. Further analytical discussion and experimental evidence corroborate the veracity of our advancement: this is the first work that establishes labels-independent and impartial balanced learning for PAR.

Abstract:
Adversarial patch is one of the important forms of performing adversarial attacks in the physical world. To improve the naturalness and aggressiveness of existing adversarial patches, location-aware patches are proposed, where the patch's location on the target object is integrated into the optimization process to perform attacks. Although it is effective, efficiently finding the optimal location for placing the patches is challenging, especially under the black-box attack settings. In this paper, we first empirically find that the aggregation regions of adversarial patch's locations to show effective attacks for the same facial image are pretty similar across different face recognition models. Based on this observation, we then propose a novel framework called Distribution-Optimized Adversarial Patch (DOPatch) to efficiently search for the aggregation regions in a distribution modeling way. Using the distribution prior, we further design two query-based black-box attack methods: Location Optimization Attack (DOP-LOA) and Distribution Transfer Attack (DOP-DTA) to attack unseen face recognition models. We finally evaluate the proposed methods on various SOTA face recognition models and image recognition models (including the popular big models) to demonstrate our effectiveness and generalization. We also conduct extensive ablation studies and analyses to provide insights into the distribution of adversarial locations.

Abstract:
Subspace learning and Support Vector Machine (SVM) are two critical techniques in pattern recognition, playing pivotal roles in feature extraction and classification. However, how to learn the optimal subspace such that the SVM classifier can perform the best is still a challenging problem due to the difficulty in optimization, computation, and algorithm convergence. To address these problems, this paper develops a novel method named Optimal Discriminant Support Vector Machine (ODSVM), which integrates support vector classification with discriminative subspace learning in a seamless framework. As a result, the most discriminative subspace and the corresponding optimal SVM are obtained simultaneously to pursue the best classification performance. The efficient optimization framework is designed for binary and multi-class ODSVM. Moreover, a fast sequential minimization optimization (SMO) algorithm with pruning is proposed to accelerate the computation in multi-class ODSVM. Unlike other related methods, ODSVM has a strong theoretical guarantee of global convergence, highlighting its superiority and stability. Numerical experiments are conducted on thirteen datasets and the results demonstrate that ODSVM outperforms existing methods with statistical significance.

Abstract:
Robust 3D perception amidst corruption is a crucial task in the realm of 3D vision. Conventional data augmentation methods aimed at enhancing corruption robustness typically apply random transformations to all point cloud samples offline, neglecting sample structure, which often leads to over- or under-enhancement. In this study, we propose an alternative approach to address this issue by employing sample-adaptive transformations based on sample structure, through an auto-augmentation framework named AdaptPoint++. Central to this framework is an imitator, which initiates with Position-aware Feature Extraction to derive intrinsic structural information from the input sample. Subsequently, a Deformation Controller and a Mask Controller predict per-anchor deformation and per-point masking parameters, respectively, facilitating corruption simulations. In conjunction with the imitator, a discriminator is employed to curb the generation of excessive corruption that deviates from the original data distribution. Moreover, we integrate a perception-guidance feedback mechanism to steer the generation of samples towards an appropriate difficulty level. To effectively train the classifier using the generated augmented samples, we introduce a Structure Reconstruction-assisted learning mechanism, bolstering the classifier's robustness by prioritizing intrinsic structural characteristics over superficial discrepancies induced by corruption. Additionally, to alleviate the scarcity of real-world corrupted point cloud data, we introduce two novel datasets: ScanObjectNN-C and MVPNET-C, closely resembling actual data in real-world scenarios. Experimental results demonstrate that our method attains state-of-the-art performance on multiple corruption benchmarks.

Abstract:
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.

Abstract:
Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.

Abstract:
Our understanding of the temporal dynamics of the Earth's surface has been significantly advanced by deep vision models, which often require a massive amount of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present scalable multi-temporal change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model, namely the generative probabilistic change model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., condition-level change event simulation and image-level semantic change synthesis. To solve these two problems, we present Changen2, a GPCM implemented with a resolution-scalable diffusion transformer which can generate time series of remote sensing images and corresponding semantic and change labels from labeled and even unlabeled single-temporal images. Changen2 is a “generative change foundation model” that can be trained at scale via self-supervision, and is capable of producing change supervisory signals from unlabeled single-temporal images. Unlike existing “foundation models”, our generative change foundation model synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Comprehensive experiments suggest Changen2 has superior spatiotemporal scalability in data generation, e.g., Changen2 model trained on 256^22 pixel single-temporal images can yield time series of any length and resolutions of 1,024^22 pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterpart) and transferability across multiple types of change tasks, including ordinary and off-nadir building change, land-use/land-cover change, and disaster assessment.

Abstract:
Multi-modal image fusion aims to generate a fused image by integrating and distinguishing the cross-modality complementary information from multiple source images. While the cross-attention mechanism with global spatial interactions appears promising, it only captures second-order spatial interactions, neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap, we introduce a Synergistic High-order Interaction Paradigm (SHIP), designed to systematically investigate spatial fine-grained and global statistics collaborations between the multi-modal images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication, mathematically equivalent to global interactions, and then foster high-order formats by iteratively aggregating and evolving complementary information, enhancing both efficiency and flexibility. 2) Channel dimension: expanding on channel interactions with first-order statistics (mean), we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. We further introduce an enhanced version of the SHIP model, called SHIP++ that enhances the cross-modality information interaction representation by the cross-order attention evolving mechanism, cross-order information integration, and residual information memorizing mechanism. Harnessing high-order interactions significantly enhances our model’s ability to exploit multi-modal synergies, leading in superior performance over state-of-the-art alternatives, as shown through comprehensive experiments across various benchmarks in two significant multi-modal image fusion tasks: pan-sharpening, and infrared and visible image fusion.

Abstract:
Many existing adversarial attacks generate L_pLp-norm perturbations on image RGB space. Despite some achievements in transferability and attack success rate, the crafted adversarial examples are easily perceived by human eyes. Towards visual imperceptibility, some recent works explore unrestricted attacks without L_pLp-norm constraints, yet lacking transferability of attacking black-box models. In this work, we propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models. Specifically, instead of direct manipulation in pixel space, we craft perturbations in the latent space of diffusion models. Combined with well-designed content-preserving structures, we can generate human-insensitive perturbations embedded with semantic clues. For better transferability, we further “deceive” the diffusion model which can be viewed as an implicit recognition surrogate, by distracting its attention away from the target regions. To our knowledge, our proposed method, DiffAttack, is the first that introduces diffusion models into the adversarial attack field. Extensive experiments conducted across diverse model architectures (CNNs, Transformers, and MLPs), datasets (ImageNet, CUB-200, and Standford Cars), and defense mechanisms underscore the superiority of our attack over existing methods such as iterative attacks, GAN-based attacks, and ensemble attacks. Furthermore, we provide a comprehensive discussion on future research avenues in diffusion-based adversarial attacks, aiming to chart a course for this burgeoning field.

Abstract:
Non-maximum suppression (NMS) is an essential post-processing step for object detection. The de-facto standard for NMS, namely GreedyNMS, is not parallelizable and could thus be the performance bottleneck in object detection pipelines. MaxpoolNMS is introduced as a fast and parallelizable alternative to GreedyNMS. However, MaxpoolNMS is only capable of replacing the GreedyNMS at the first stage of two-stage detectors like Faster R-CNN. To address this issue, we observe that MaxpoolNMS employs the process of box coordinate discretization followed by local score argmax calculation, to discard the nested-loop pipeline in GreedyNMS to enable parallelizable implementations. In this paper, we introduce a simple Relationship Recovery module and a Pyramid Shifted MaxpoolNMS module to improve the above two stages, respectively. With these two modules, our PSRR-MaxpoolNMS is a generic and parallelizable approach, which can completely replace GreedyNMS at all stages in all detectors. Furthermore, we extend PSRR-MaxpoolNMS to the more powerful PSRR-MaxpoolNMS++. As for box coordinate discretization, we propose Density-based Discretization for better adherence to the target density of the suppression. As for local score argmax calculation, we propose an Adjacent Scale Pooling scheme for mining out the duplicated box pairs more accurately and efficiently. Extensive experiments demonstrate that both our PSRR-MaxpoolNMS and PSRR-MaxpoolNMS++ outperform MaxpoolNMS by a large margin. Additionally, PSRR-MaxpoolNMS++ not only surpasses PSRR-MaxpoolNMS but also attains competitive accuracy and much better efficiency when compared with GreedyNMS. Therefore, PSRR-MaxpoolNMS++ is a parallelizable NMS solution that can effectively replace GreedyNMS at all stages in all detectors.

Abstract:
Bias in computer vision systems can perpetuate or even amplify discrimination against certain populations. Considering that bias is often introduced by biased visual datasets, many recent research efforts focus on training fair models using such data. However, most of them heavily rely on the availability of protected attribute labels in the dataset, which limits their applicability, while label-unaware approaches, i.e., approaches operating without such labels, exhibit considerably lower performance. To overcome these limitations, this work introduces FLAC, a methodology that minimizes mutual information between the features extracted by the model and a protected attribute, without the use of attribute labels. To do that, FLAC proposes a sampling strategy that highlights underrepresented samples in the dataset, and casts the problem of learning fair representations as a probability matching problem that leverages representations extracted by a bias-capturing classifier. It is theoretically shown that FLAC can indeed lead to fair representations, that are independent of the protected attributes. FLAC surpasses the current state-of-the-art on Biased-MNIST, CelebA, and UTKFace, by 29.1%, 18.1%, and 21.9%, respectively. Additionally, FLAC exhibits 2.2% increased accuracy on ImageNet-A and up to 4.2% increased accuracy on Corrupted-Cifar10. Finally, in most experiments, FLAC even outperforms the bias label-aware state-of-the-art methods.

Abstract:
Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 – 2023, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport and its extensions, such as partial, unbalanced, Gromov and Neural Optimal Transport, and its interplay with Machine Learning practice.

Abstract:
In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. 1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. 2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks.

Abstract:
Applying current machine learning algorithms in complex and open environments remains challenging, especially when different changing elements are coupled and the training data is scarce. For example, in the activity recognition task, the motion sensors may change position or fall off due to the intensity of the activity, leading to changes in feature space and finally resulting in label noise. Learning from such a problem where the dynamic features are coupled with noisy labels is crucial but rarely studied, particularly when the noisy samples in new feature space are limited. In this paper, we tackle the above problem by proposing a novel two-stage algorithm, called Adaptive Learning for Dynamic features and Noisy labels (ALDN). Specifically, optimal transport is first modified to map the previously learned heterogeneous model to the prior model of the current stage. Then, to fully reuse the mapped prior model, we add a simple yet efficient regularizer as the consistency constraint to assist both the estimation of the noise transition matrix and the model training in the current stage. Finally, two implementations with direct (ALDN-D) and indirect (ALDN-ID) constraints are illustrated for better investigation. More importantly, we provide theoretical guarantees for risk minimization of ALDN-D and ALDN-ID. Extensive experiments validate the effectiveness of the proposed algorithms.

Abstract:
We propose an online learning algorithm tailored for a class of machine learning models within a separable stochastic approximation framework. The central idea of our approach is to exploit the inherent separability in many models, recognizing that certain parameters are easier to optimize than others. This paper focuses on models where some parameters exhibit linear characteristics, which are common in machine learning applications. In our proposed algorithm, the linear parameters are updated using the recursive least squares (RLS) algorithm, akin to a stochastic Newton method. Subsequently, based on these updated linear parameters, the nonlinear parameters are adjusted using the stochastic gradient method (SGD). This dual-update mechanism can be viewed as a stochastic approximation variant of block coordinate gradient descent, where one subset of parameters is optimized using a second-order method while the other is handled with a first-order approach. We establish the global convergence of our online algorithm for non-convex cases in terms of the expected violation of first-order optimality conditions. Numerical experiments demonstrate that our method achieves significantly faster initial convergence and produces more robust performance compared to other popular learning algorithms. Additionally, our algorithm exhibits reduced sensitivity to learning rates and outperforms the recently proposed slimTrain algorithm (Newman et al. 2022). For validation, the code has been made available on GitHub.

Abstract:
In this paper, we study the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles. We present V2X-ViTs, a robust cooperative perception framework with V2X communication using novel vision Transformer models. First, we present V2X-ViTv1 containing holistic attention modules that can effectively fuse information across on-road agents (i.e., vehicles and infrastructure). Specifically, V2X-ViTv1 consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention, which captures inter-agent interaction and per-agent spatial relationships. These key modules are designed in a unified Transformer architecture to handle common V2X challenges, including asynchronous information sharing, pose errors, and heterogeneity of V2X components. Second, we propose an advanced architecture, V2X-ViTv2, that enjoys increased ability for multi-scale perception. We also propose advanced data augmentation techniques tailored for V2X applications to improve performance. We construct a large-scale V2X perception dataset using CARLA and OpenCDA to validate our approach. Extensive experimental results on both synthetic and real-world datasets show that V2X-ViTs achieve state-of-the-art performance for 3D object detection and are robust even under harsh, noisy environments.

Abstract:
Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based on probability distributions in both temporal and spatial domains to improve interpretability. In temporal domain, we use timestamp deviations between processing events and central event to judge the temporal correlation and filter out temporal-irrelevant events. In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise and use the learned convolutional sparse coding to optimize the objective function. Based on the theoretical analysis, we build Temporal Window (TW) module and Soft Spatial Feature Embedding (SSFE) module to process temporal and spatial information separately, and construct a novel multi-scale window-based event denoising network, named WedNet. The high denoising accuracy and fast running speed of our WedNet enables us to achieve real-time denoising in complex scenes. Extensive experimental results verify the effectiveness and robustness of our WedNet. Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.

Abstract:
To date, the widely adopted way to perform fixation collection in panoptic video is based on a head-mounted display (HMD), where users’ fixations are collected while wearing a HMD to explore the given panoptic scene freely. However, this widely-used data collection method is insufficient for training deep models to accurately predict which regions in a given panoptic are most important when it contains intermittent salient events. The main reason is that there always exist “blind zooms” when using HMD to collect fixations since the users cannot keep spinning their heads to explore the entire panoptic scene all the time. Consequently, the collected fixations tend to be trapped in some local views, leaving the remaining areas to be the “blind zooms”. Therefore, fixation data collected using HMD-based methods that accumulate local views cannot accurately represent the overall global importance—the main purpose of fixations—of complex panoptic scenes. To conquer, this paper introduces the auxiliary window with a dynamic blurring (WinDB) fixation collection approach for panoptic video, which doesn't need HMD and is able to well reflect the regional-wise importance degree. Using our WinDB approach, we have released a new PanopticVideo-300 dataset, containing 300 panoptic clips covering over 225 categories. Specifically, since using WinDB to collect fixations is blind zoom free, there exists frequent and intensive “fixation shifting”—a very special phenomenon that has long been overlooked by the previous research—in our new set. Thus, we present an effective fixation shifting network (FishNet) to conquer it. All these new fixation collection tool, dataset, and network could be very potential to open a new age for fixation-related research and applications in 360^\mathrmo360o environments.

Abstract:
We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image, each accurately placed at predefined positions with attributes such as category, color, and shape, strictly following user specifications. MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation. To address attribute leakage, we propose the Multi-Instance Generation Controller (MIGC). MIGC generates multiple instances through a divide-and-conquer strategy, breaking down multi-instance shading into single-instance tasks with singular attributes, later integrated. To provide more types of instance descriptions, we developed MIGC++. MIGC++ allows attribute control through text & images and position control through boxes & masks. Lastly, we introduced the Consistent-MIG algorithm to enhance the iterative MIG ability of MIGC and MIGC++. This algorithm ensures consistency in unmodified regions during the addition, deletion, or modification of instances, and preserves the identity of instances when their attributes are changed. We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods. Extensive experiments on these benchmarks, along with the COCO-Position benchmark and DrawBench, demonstrate that our methods substantially outperform existing techniques, maintaining precise control over aspects including position, attribute, and quantity.

Abstract:
Previous Face Anti-spoofing (FAS) methods face the challenge of generalizing to unseen domains, mainly because most existing FAS datasets are relatively small and lack data diversity. Thanks to the development of face recognition in the past decade, numerous real face images are available publicly, which are however neglected previously by the existing literature. In this paper, we propose an Anomalous cue Guided FAS (AG-FAS) method, which can effectively leverage large-scale additional real faces for improving model generalization via a De-fake Face Generator (DFG). Specifically, by training on a large-scale real face only dataset, the generator obtains the knowledge of what a real face should be like, and thus has the capability of generating a “real” version of any input face image. Consequently, the difference between the input face and the generated “real” face can be treated as cues of attention for the fake feature learning. With the above ideas, an Off-real Attention Network (OA-Net) is proposed which allocates its attention to the spoof region of the input according to the anomalous cue. Extensive experiments on a total of nine public datasets show our method achieves state-of-the-art results under cross-domain evaluations with unseen scenarios and unknown presentation attacks. Besides, we provide theoretical analysis demonstrating the effectiveness of the proposed anomalous cues.

Abstract:
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to mitigate the bias in calculations within Banzhaf Interaction, we propose reconstructing the representation through a fusion of single-modal and cross-modal components. This reconstructed representation ensures fine granularity comparable to that of the single-modal representation, while also preserving the adaptive encoding characteristics of cross-modal representation. Additionally, we extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks. Extensive experiments on commonly used text-video retrieval, video-question answering, and video captioning benchmarks, with superior performance, validate the effectiveness and generalization of our method.

Abstract:
Inspired by the Lottery Ticket Hypothesis (LTH), which highlights the existence of efficient subnetworks within larger, dense networks, a high-performing Winning Subnetwork (WSN) in terms of task performance under appropriate sparsity conditions is considered for various continual learning tasks. It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios. In Few-Shot Class Incremental Learning (FSCIL), a variation of WSN referred to as the Soft subnetwork (SoftNet) is designed to prevent overfitting when the data samples are scarce. Furthermore, the sparse reuse of WSN weights is considered for Video Incremental Learning (VIL). The use of Fourier Subneural Operator (FSO) within WSN is considered. It enables compact encoding of videos and identifies reusable subnetworks across varying bandwidths. We have integrated FSO into different architectural frameworks for continual learning, including VIL, TIL, and FSCIL. Our comprehensive experiments demonstrate FSO's effectiveness, significantly improving task performance at various convolutional representational levels. Specifically, FSO enhances higher-layer performance in TIL and FSCIL and lower-layer performance in VIL.

Abstract:
Recently, leveraging pre-training techniques to enhance point cloud models has become a prominent research topic. However, existing approaches typically require full fine-tuning of pre-trained models to achieve satisfactory performance on downstream tasks, which is storage-intensive and computationally demanding. To address this issue, we propose a novel Parameter-Efficient Fine-Tuning (PEFT) method for point cloud, called PointGST (Point cloud Graph Spectral Tuning). PointGST freezes the pre-trained model and introduces a lightweight, trainable Point Cloud Spectral Adapter (PCSA) for fine-tuning parameters in the spectral domain. The core idea is built on two observations: 1) The inner tokens from frozen models might present confusion in the spatial domain; 2) Task-specific intrinsic information is important for transferring the general knowledge to the downstream task. Specifically, PointGST transfers the point tokens from the spatial domain to the spectral domain, effectively de-correlating confusion among tokens by using orthogonal components for separation. Moreover, the generated spectral basis involves intrinsic information about the downstream point clouds, enabling more targeted tuning. As a result, PointGST facilitates the efficient transfer of general knowledge to downstream tasks while significantly reducing training costs. Extensive experiments on challenging point cloud datasets across various tasks demonstrate that PointGST not only outperforms its fully fine-tuning counterpart but also significantly reduces trainable parameters, making it a promising solution for efficient point cloud learning. Moreover, it achieves superior accuracies of 99.48%, 97.76%, and 96.18% on the ScanObjNN OBJ_BG, OBJ_ONLY, and PB_T50_RS datasets, respectively, establishing a new state-of-the-art, while using only 0.67% of the trainable parameters.

Abstract:
Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment, i.e., associations between image patches and text tokens. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Affiliations: Laboratory of Advanced Theranostic Materials and Technology, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China; School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo, China; University Paris Dauphine, CNRS, UMR , CEREMADE, PSL Research University, Paris, France; Institute of High Performance Computing, A*STAR, Singapore; Eye Center, The Second Affiliated Hospital, Zhejiang University, Hangzhou, China; Shenzhen Institute of Advanced Technology, Chinese Academy of Science, Shenzhen, China; Ningbo Eye Hospital, Wenzhou Medical University, Ningbo, China

Abstract:
Choroidal neovascularization (CNV), a primary characteristic of wet age-related macular degeneration (wet AMD), represents a leading cause of blindness worldwide. In clinical practice, optical coherence tomography angiography (OCTA) is commonly used for studying CNV-related pathological changes, due to its micron-level resolution and non-invasive nature. Thus, accurate segmentation of CNV regions and vessels in OCTA images is crucial for clinical assessment of wet AMD. However, challenges existed due to irregular CNV shapes and imaging limitations like projection artifacts, noises and boundary blurring. Moreover, the lack of publicly available datasets constraints the CNV analysis. To address these challenges, this paper constructs the first publicly accessible CNV dataset (CNVSeg), and proposes a novel multilateral graph convolutional interaction-enhanced CNV segmentation network (MTG-Net). This network integrates both region and vessel morphological information, exploring semantic and geometric duality constraints within the graph domain. Specifically, MTG-Net consists of a multi-task framework and two graph-based cross-task modules: Multilateral Interaction Graph Reasoning (MIGR) and Multilateral Reinforcement Graph Reasoning (MRGR). The multi-task framework encodes rich geometric features of lesion shapes and surfaces, decoupling the image into three task-specific feature maps. MIGR and MRGR iteratively reason about higher-order relationships across tasks through a graph mechanism, enabling complementary optimization for task-specific objectives. Additionally, an uncertainty-weighted loss is proposed to mitigate the impact of artifacts and noise on segmentation accuracy. Experimental results demonstrate that MTG-Net outperforms existing methods, achieving a Dice socre of 87.21% for region segmentation and 88.12% for vessel segmentation.

Affiliations: College of Computer Science and Technology, National University of Defense Technology, Changsha, China; Center for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore; Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China; Department of Radiology, National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, China; College of Computer Science and Electronic Engineering, Supercomputing and Cloud Computing Institute, Hunan University, Changsha, China; Department of Computer Science, State University of New York, New Paltz, NY, USA

Abstract:
Multi-view bipartite graph clustering (MVBGC) is an active pipeline in unsupervised learning to tackle the limited scalability issue of traditional graph clustering. Despite improved performance, numerous variants still fall under conventional modeling that plugs additional modules, which however induces increasingly intricate models and fails to reveal the inherent variable relationship. We make the first attempt to introduce probabilistic graphical models for modeling the multi-view bipartite graph clustering task, reformulating it as a maximum likelihood estimation (MLE) problem. Such a setting uncovers the underlying probabilistic correlations among the commonality, view-specific variables, and noisy components. By pruning redundancy and disturbance collectively referred to as noise, we prove that minimizing the total noise is an approximation of the lower bound of MLE for multi-view data observations. We further generalize the MLE setting with clustering-suited constraints, deriving a Generalized Probabilistic Graphical Modeling framework (GProM), achieving an interpretable, concise, and flexible MVBGC framework. Extensive experiments verify the effectiveness of our framework. Furthermore, statistical significance analysis reveals the effectiveness of different distribution assumptions, providing valuable insights for model design.

Abstract:
Medical images are usually collected from multiple clinical centers with various types of scanners. When confronted with such significant cross-domain distribution discrepancy, a deep network tends to capture similar patterns by multiple channels, while different cross-domain patterns are also allowed to rest in the same channel. Such channel redundancy limits the expressive capability of a representation, resulting in less preferable generalization ability. To address this fundamental yet challenging issue, we propose a novel decoupled feature as query (DFQ) framework for domain generalized medical image representation learning. Its general idea is to leverage the channel-wise decoupled deep features as queries. Particularly, a deep instance whitening transform with restricted isometry is proposed, which enforces each channel orthogonal to the rest channels after decoupling. Besides, the long-range dependency between decoupled deep and shallow features is implicitly constrained to minimize channel redundancy throughout training. Extensive experiments show its state-of-the-art performance on three medical domain generalization tasks with four modalities.

Abstract:
Face Anti-Spoofing (FAS) is constantly challenged by new attack types and mediums, and thus it is crucial for a FAS model to not only mitigate Catastrophic Forgetting (CF) of previously learned spoofing knowledge on the training data during continual learning but also enhance the model’s generalization ability to potential spoofing attacks. In this paper, we first highlight that current strategies for catastrophic forgetting are not well-suited to the imperceptible nature of spoofing information in FAS and lack the focus on improving generalization capability. Then, the instance-wise dynamic central difference convolutional adapter module with the weighted ensemble strategy for Vision Transformer (ViT) is proposed for efficiently fine-tuning with low-shot data by extracting generalized spoofing texture information. Furthermore, we find that catastrophic forgetting in FAS can be reflected through the inconsistent attention matrices of ViT between different continual sessions, as the attention matrices embody relationships of spoofing clues between different patch tokens. Hence, we introduce attention consistency regularization by learning and reusing attention matrices to alleviate catastrophic forgetting. Finally, we devise new protocols and conduct extensive experiments to validate the superior performance of alleviating catastrophic forgetting and generalization on unseen domains.

Abstract:
Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector ww in StyleGAN controls the character’s style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts.

Abstract:
The insufficient high-throughput modeling capability for high-dimensional, multiscale, and nonlinear real-world observations and measurements stands as one of the major impediments for modern science advancements. In this regard, machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven disposition. With the ever-increasing stream of research data collection, it would be appealing to automate the exploration of patterns and insights from observational data for discovering novel classes of phenotypes and entities. However, in the discipline of biomedical investigation, the cumulative data is intrinsically subjected to non-i.i.d. distribution and severe biases amongst different clusters, inducing disorganization and ambiguity in the learned representation space. To contend with the inherent challenges, in this paper, we present a geometry- constrained probabilistic modeling treatment on hyperspherical manifolds. It firstly parameterizes the approximated posterior of instance-wise embedding as a marginal von MisesFisher distribution to account for the interference of distributional latent shift, and thereafter incorporates a suite of critical inductive biases to organically shape the layout of tailored embedding space. Together, these advancements offer a systematic solution to regularize the uncontrollable risk for unseen class learning and prospecting. Furthermore, we propose a spectral graph-theoretic method to efficiently estimate the number of potential novel classes and endow the prediction with adorable taxonomy adaptability. Through extensive experiments under various settings, we demonstrate the effectiveness and general applicability of the proposed methods in recognizing and structurally phenotyping novel visual concepts.

Abstract:
The fusion of low-spatial-resolution hyperspectral image (LR-HSI) with high-spatial-resolution multispectral image (HR-MSI) has become an effective way to obtain the high-spatial-resolution hyperspectral image (HR-HSI). Currently, learning-based methods have emerged as the mainstream solution in this field. However, these methods typically rely on predefined or simplified degradation models during fusion training, resulting in inaccurate supervision of the fusion networks. Meanwhile, most methods overlook the degradation characteristics in designing the fusion networks, leading to a mismatch between the degradation and fusion processes. These limitations ultimately result in unsatisfactory fusion performance on real data. To enhance the practicality of learning-based methods, accurate degradation modeling and effective network design have become the critical priorities. We observe that, in practical scenarios, the degree of pixel degradation varies across different positions due to the unforeseen factors such as illumination variations and imaging system fluctuations. Considering this, we propose a non-uniform degradation model (NUD), which introduces non-uniformity into the degradation processes of LR-HSI and HR-MSI. In addition, we emphasize that the essence of fusion is to reverse the degradation process. Therefore, to align with the non-uniform degradation process, the fusion process should exhibit similar positional specificity. For this purpose, we propose a position-aware fusion network (PAF), which employs positional encoding to endow the fusion process with the position-aware attribute. Experimental results show that our proposed methods provide an effective solution for HSI fusion in practical scenarios.

Abstract:
Existing optimization-based methods for non-rigid registration typically minimize an alignment error metric based on the point-to-point or point-to-plane distance between corresponding point pairs on the source surface and target surface. However, these metrics can result in slow convergence or a loss of detail. In this paper, we propose SPARE, a novel formulation that utilizes a symmetrized point-to-plane distance for robust non-rigid registration. The symmetrized point-to-plane distance relies on both the positions and normals of the corresponding points, resulting in a more accurate approximation of the underlying geometry and can achieve higher accuracy than existing methods. To solve this optimization problem efficiently, we introduce an as-rigid-as-possible regulation term to estimate the deformed normals and propose an alternating minimization solver using a majorization-minimization strategy. Moreover, for effective initialization of the solver, we incorporate a deformation graph-based coarse alignment that improves registration quality and efficiency. Extensive experiments show that the proposed method greatly improves the accuracy of non-rigid registration problems and maintains relatively high solution efficiency.

Abstract:
Temporal Graph Clustering (TGC) is a new task with little attention, focusing on node clustering in temporal graphs. Compared with existing static graph clustering, it can find the balance between time requirement and space requirement (Time-Space Balance) through the interaction sequence-based batch-processing pattern. However, there are two major challenges that hinder the development of TGC, i.e., inapplicable clustering techniques and inapplicable datasets. To address these challenges, we propose a comprehensive benchmark, called BenchTGC. Specially, we design a BenchTGC Framework to illustrate the paradigm of temporal graph clustering and improve existing clustering techniques to fit temporal graphs. In addition, we also discuss problems with public temporal graph datasets and develop multiple datasets suitable for TGC task, called BenchTGC Datasets. According to extensive experiments, we not only verify the advantages of BenchTGC, but also demonstrate the necessity and importance of TGC task. We wish to point out that the dynamically changing and complex scenarios in real world are the foundation of temporal graph clustering.

Abstract:
Story visualization aims to create visually compelling images or videos corresponding to textual narratives. Despite recent advances in diffusion models yielding promising results, existing methods still struggle to create a coherent sequence of subject-consistent frames based solely on a story. To this end, we propose DreamStory, an automatic open-domain story visualization framework by leveraging the LLMs and a novel multi-subject consistent diffusion model. DreamStory consists of (1) an LLM acting as a story director and (2) an innovative Multi-Subject consistent Diffusion model (MSD) for generating consistent multi-subject across the images. First, DreamStory employs the LLM to generate descriptive prompts for subjects and scenes aligned with the story, annotating each scene’s subjects for subsequent subject-consistent generation. Second, DreamStory utilizes these detailed subject descriptions to create portraits of the subjects, with these portraits and their corresponding textual information serving as multimodal anchors (guidance). Finally, the MSD uses these multimodal anchors to generate story scenes with consistent multi-subject. Specifically, the MSD includes Masked Mutual Self-Attention (MMSA) and Masked Mutual Cross-Attention (MMCA) modules. MMSA module ensures detailed appearance consistency with reference images, while MMCA captures key attributes of subjects from their reference text to ensure semantic consistency. Both modules employ masking mechanisms to restrict each scene’s subjects to referencing the multimodal information of the corresponding subject, effectively preventing blending between multiple subjects. To validate our approach and promote progress in story visualization, we established a benchmark, DS-500, which can assess the overall performance of the story visualization framework, subject-identification accuracy, and the consistency of the generation model. Extensive experiments validate the effectiveness of DreamStory in both subjective and objective evaluations.

Abstract:
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, “Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?”, we propose an affirmative solution. We examine the learned attention patterns for camouflaged objects and introduce a robust zero-shot COS framework. Our findings reveal that while transformer models for salient object segmentation (SOS) prioritize global features in their attention mechanisms, camouflaged object segmentation exhibits both global and local attention biases. Based on these findings, we design a framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS. This enables efficient zero-shot transfer for COS. Specifically, We incorporate a Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM encoder captures essential local features, while the PEFT module learns global and semantic representations from SOS datasets. To further enhance semantic granularity, we leverage the M-LLM to generate caption embeddings conditioned on visual cues, which are meticulously aligned with multi-scale visual features via MFA. This alignment enables precise interpretation of complex semantic contexts. Moreover, we introduce a learnable codebook to represent the M-LLM during inference, significantly reducing computational demands while maintaining performance. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with F_\beta ^wFβw scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Additionally, our method excels in polyp segmentation, and underwater scene segmentation, outperforming challenging baselines in both zero-shot and supervised settings, thereby implying its potentiality in various segmentation tasks.

Abstract:
Out-of-distribution (OOD) detection presents a significant challenge in deploying pattern recognition and machine learning models, as they frequently fail to generalize to data from unseen distributions. Recent advancements in vision-language models (VLMs), particularly CLIP, have demonstrated promising results in OOD detection through their rich multimodal representations. However, current CLIP-based OOD detection methods predominantly rely on single-modality in-distribution (ID) data (e.g., textual cues), overlooking the valuable information contained in ID visual cues. In this work, we demonstrate that incorporating ID visual information is crucial for unlocking CLIP’s full potential in OOD detection. We propose a novel approach, Dual-Pattern Matching (DPM), which effectively adapts CLIP for OOD detection by jointly exploiting both textual and visual ID patterns. Specifically, DPM refines visual and textual features through the proposed Domain-Specific Feature Aggregation (DSFA) and Prompt Enhancement (PE) modules. Subsequently, DPM stores class-wise textual features as textual patterns and aggregates ID visual features as visual patterns. During inference, DPM calculates similarity scores relative to both patterns to identify OOD data. Furthermore, we enhance DPM with lightweight adaptation mechanisms to further boost OOD detection performance. Comprehensive experiments demonstrate that DPM surpasses state-of-the-art methods on multiple benchmarks, highlighting the effectiveness of leveraging multimodal information for OOD detection. The proposed dual-pattern approach provides a simple yet robust framework for leveraging vision-language representations in OOD detection tasks.

Abstract:
Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360^\circ ∘ images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset.

Abstract:
In general, learning plentiful knowledge corresponding to known objects is an important ability for humans. The unknown objects could be assumed to depart from the familiar knowledge. Inspired by this idea, we explore leveraging the extracted knowledge to reason a set of unknown concepts. And they could be used to address unsupervised out-of-distribution object detection (OOD-OD) that aims to detect unseen OOD objects without accessing any auxiliary OOD data during training. To this end, we propose a new approach, i.e., Unknown-Concept Guided Feature Diffusion (UCFD), including an object-related knowledge extractor and an unknown-concept guided diffusor for synthesizing virtual OOD features. Specifically, we define multiple learnable codewords to capture object-relevant visual knowledge from all object categories. To avoid the detection performance degradation of the in-distribution (ID) objects, these codewords are utilized to enhance object features. Next, an unknown-concept pool is constructed by mixing up these extracted codewords. Finally, to reduce the impact of lacking OOD data for supervision, we design an unknown-concept guided diffusor, which leverages the sampled unknown concepts from the pool to guide the reverse process to generate expected OOD features that deviate from the familiar knowledge. The significant performance gains on three different tasks demonstrate the superiorities of our method. Meanwhile, extensive visualization results show that our method could synthesize effective virtual OOD features.

Abstract:
Deep learning methods have demonstrated state-of-the-art performance in image restoration, especially when trained on large-scale paired datasets. However, acquiring paired data in real-world scenarios poses a significant challenge. Unsupervised restoration approaches based on generative adversarial networks (GANs) offer a promising solution without requiring paired datasets. Yet, these GAN-based approaches struggle to surpass the performance of conventional unsupervised GAN-based frameworks without significantly modifying model structures or increasing the computational complexity. To address these issues, we propose a self-collaboration (SC) strategy for existing restoration models. This strategy utilizes information from the previous stage as feedback to guide subsequent stages, achieving significant performance improvement without increasing the framework’s inference complexity. The SC strategy comprises a prompt learning (PL) module and a restorer (ResRes). It iteratively replaces the previous less powerful fixed restorer \overlineResRes¯ in the PL module with a more powerful ResRes. The enhanced PL module generates better pseudo-degraded/clean image pairs, leading to a more powerful ResRes for the next iteration. Our SC can significantly improve the ResRes ’s performance by over 1.5 dB without adding extra parameters or computational complexity during inference. Meanwhile, existing self-ensemble (SE) and our SC strategies enhance the performance of pre-trained restorers from different perspectives. As SE increases computational complexity during inference, we propose a re-boosting module to the SC (Reb-SC) to improve the SC strategy further by incorporating SE into SC without increasing inference time. This approach further enhances the restorer’s performance by approximately 0.3 dB. Additionally, we present a baseline framework that includes parallel generative adversarial branches with complementary “self-synthesis” and “unpaired-synthesis” constraints, ensuring the effectiveness of the training framework. Extensive experimental results on restoration tasks demonstrate that the proposed model performs favorably against existing state-of-the-art unsupervised restoration methods.

Abstract:
Domain generalization (DG) focuses on transferring domain-invariant knowledge from multiple source (training) domains to an a priori unseen target domain(s). This task implicitly requires that classes of interest are expressed in multiple sources (domain-shared) to break spurious domain-class correlations. However, real-world data scarcity challenges may often result in classes present in only a specific domain (domain-linked), which we show leads to extremely poor generalization. In this work, we introduce the domain-linked DG task to the community and develop a methodology to learn useful domain-invariant representations from domain-shared classes for domain-linked ones. Specifically, we propose FOND, a Fairness-inspired and cONtrastive learning objective for Domain-linked DG. Rigorous and reproducible experimental results communicate that FOND accomplishes state-of-the-art improvements for domain-linked classes, given a sufficient number of domain-shared classes and with minimal performance trade-offs. Complementary to these contributions, we theoretically analyze this task and provide practical insights for domain-linked class generalizability.

Abstract:
Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

Abstract:
To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D.

Abstract:
Video event localization tasks include temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods tend to over-specialize on individual tasks, neglecting the equal importance of these different events for a complete understanding of video content. In this work, we aim to develop a unified framework to solve TAL, SED and AVEL tasks together to facilitate holistic video understanding. However, it is challenging since different tasks emphasize distinct event characteristics and there are substantial disparities in existing task-specific datasets (size/domain/duration). It leads to unsatisfactory results when applying a naive multi-task strategy. To tackle the problem, we introduce UniAV, a Unified Audio-Visual perception network to effectively learn and share mutually beneficial knowledge across tasks and modalities. Concretely, we propose a unified audio-visual encoder to derive generic representations from multiple temporal scales for videos from all tasks. Meanwhile, task-specific experts are designed to capture the unique knowledge specific to each task. Besides, instead of using separate prediction heads, we develop a novel unified language-aware classifier by utilizing semantic-aligned task prompts, enabling our model to flexibly localize various instances across tasks with an impressive open-set ability to localize novel categories. Extensive experiments demonstrate that UniAV, with its unified architecture, significantly outperforms both single-task models and the naive multi-task baseline across all three tasks. It achieves superior or on-par performances compared to the state-of-the-art task-specific methods on ActivityNet 1.3, DESED and UnAV-100 benchmarks.

Abstract:
Accurate 3D object detection from images can be hindered by inherent depth ambiguity. While knowledge distillation (KD) from privileged sensors such as LiDAR offers a promising direction, it often suffers from a critical cross-sensor domain gap. To address this, we introduce DK3D, a novel depth-aware knowledge distillation framework for 3D detection. Our core strategy involves providing the teacher with privileged ground-truth depth during training. This directly avoids the feature representation mismatch and subsequent inefficient knowledge transfer required when distilling from a LiDAR teacher (sparse, geometric) to a camera-based student (dense, semantic). DK3D introduces specialized modules tailored for two primary student paradigms. For depth-assisted models, we employ a channel-wise projection layer (CPL) and an adversarial scoring block (ASB) to align intermediate features at both the pixel and distribution levels. For depth-independent models, a novel vision-depth association module allows the student to implicitly reason about geometry by fusing depth cues with visual features. Both approaches are further enhanced by target-aware spatial response distillation, which captures complex inter-object spatial relationships. Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that DK3D significantly improves performance for both monocular and multi-view 3D detection, outperforming state-of-the-art methods. As a versatile, plug-and-play framework, DK3D boosts existing models without requiring additional training data or increasing the computational cost at inference.

Abstract:
Transformer-based approaches have shown promising performance in image restoration tasks due to their ability to model long-range dependencies, which are essential for recovering clear images. Although various efficient attention mechanisms have been proposed to address the intensive computational loads of transformers, they often suffer from redundant information and noisy interactions from irrelevant regions, as they consider all available tokens. In this work, we propose an Adaptive Sparse Transformer (AST-v2) to mitigate these issues by reducing noisy interactions in irrelevant areas and removing feature redundancy along channel dimension. AST-v2 incorporates two core components: an Adaptive Sparse Self-Attention (ASSA) block and a Feature Refinement Feed-forward Network (FRFN). ASSA adopts a dual-branch design, where the sparse branch guides the modulation of standard dense attention weights. This paradigm reduces the negative impact of irrelevant token interactions while preserving the important ones. Meanwhile, FRFN utilizes an enhance-and-ease scheme to eliminate feature redundancy across channels, enhancing the restoration of clear images. Experimental results on commonly used benchmarks show the competitive performance of our method for 6 restoration tasks, including rain streak removal, haze removal, shadow removal, snow removal, blur removal, and low-light enhancement. The code is available in the supplementary materials.

Abstract:
Pseudo-labeling is a dominant strategy for cross-domain semantic segmentation (CDSS), yet its effectiveness is limited by fragmented and noisy pixel-level predictions under severe domain shifts. To address this, we propose a semantic connectivity-driven pseudo-labeling framework, SeCo, which constructs and refines pseudo-labels at the connectivity level by aggregating high-confidence pixels into coherent semantic regions. The framework includes two key components: Pixel Semantic Aggregation (PSA), which leverages a dual prompting strategy to preserve category-specific granularity, and Semantic Connectivity Correction with Loss Distribution (SCC-LD), which filters noisy regions based on early-loss statistics. Building upon this foundation, we further present SeCoV2, which introduces SCC-Unc, a novel uncertainty-aware correction module that constructs a connectivity graph and enforces relational consistency for robust refinement in ambiguous regions. SeCoV2 also broadens the applicability of SeCo by extending evaluation to more challenging scenarios, including open-set and multimodal adaptation, semi-supervised domain generalization, and by validating compatibility with different interactive foundation segmentation models such as SAM Kirillov et al. 2023, SEEM Zou et al. 2023, and Fast-SAM Zhao et al. 2023. Extensive experiments across six CDSS tasks demonstrate that SeCoV2 achieves consistent improvements over previous methods, with an average performance gain of up to +4.6%, establishing new state-of-the-art results. These findings highlight the effectiveness and generalization ability for robust adaptation in diverse real-world environments.

Abstract:
Content-based video copy localization (VCL) aims to detect and locate copied segments in pairs of videos. VCL requires fine-grained video analysis to robustly identify copied segments that have been edited. Despite recent progress, the prohibitive cost of annotating copied segments and the lack of a fine-grained benchmark hinder the development of effective VCL systems. In this work, we annotate a new real-world dataset, FiGVCL, with challenging scenarios designed to evaluate VCL methods. FiGVCL is carefully annotated to preserve the temporal correspondences observed in copied segments. Moreover, we propose a novel fine-grained VCL benchmark metric based on temporal correspondences to improve discriminability. Finally, we design a simple but effective baseline model that uses fine-grained local embeddings for accurate copied segment localization. We also present an unsupervised training strategy that outperforms previous supervised VCL methods.

Abstract:
Temporal action detection aims to predict temporal boundaries and category labels of actions in untrimmed videos. In the past years, many weakly supervised temporal action detection methods have been proposed to relieve the annotation cost of fully supervised methods. Due to the discrepancy between action localization and action classification, the two-branch structure is widely adopted by existing weakly supervised methods, where the classification branch is used to predict category-wise score and the localization branch is used to predict foreground score for each segment. Under the weakly supervised setting, the model training is mainly guided by the video-level or sparse segment-level annotations. As a result, the classification branch tends to focus on the most discriminative segments while ignore less discriminative ones so as to minimize the classification cost, and the localization branch may assign high foreground scores for some negative segments. This phenomenon can severely damage the action detection performance, because the foreground scores and classification scores are combined together in the testing stage for action detection. To deal with this problem, several methods have been proposed to encourage the consistency between the classification branch and localization branch. However, these methods only consider the video-level or segment-level consistency, without considering the relation among different segments to be consistent. In this paper, we propose a Cross-Task Relation-Aware Consistency (CRC) strategy for weakly supervised temporal action detection, including an intra-video consistency module and an inter-video consistency module. The intra-video consistency module can well guarantee the relationship among segments from the same video to be consistent, and the inter-video consistency module guarantees the relationship among segments from different videos to be consistent. These two modules are complementary to each other by combining both intra-video and inter-video consistency. Experimental results show that the proposed CRC strategy can consistently improve the performance of existing weakly supervised methods, including click-level supervised methods (e.g., LACP Lee et al., 2021), video-level supervised methods (e.g., DELU Chen et al., 2022) and unsupervised methods (e.g., BaS-Net Lee et al., 2020), verifying the generality and effectiveness of the proposed method.

Abstract:
In the challenging realm of image-to-image translation, most traditional methods require separate models for different translation directions, leading to inefficient use of computational resources. This paper introduces the Bidirectional Brownian Bridge Diffusion Model (BiBBDM), a novel approach that leverages Brownian Bridge processes for bidirectional image-to-image translation. Unlike conventional Diffusion Models (DMs) that treat image-to-image translation as a unidirectional conditional generation process, BiBBDM models the translation as a stochastic Brownian Bridge process, enabling simultaneous learning of bidirectional translation between two domains. This innovation allows our method to achieve bidirectional image translation using different sampling directions of a single model, eliminating the need for multiple models for both translation directions. To the best of our knowledge, BiBBDM is the first image translation framework to achieve simultaneous dual-domain sampling with the same model and parameters, based on Brownian Bridge diffusion processes. Extensive experimental results on various benchmarks demonstrate that BiBBDM achieves competitive performance, as evidenced by both visual inspection and quantitative metrics.

Abstract:
Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Extensive experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based, Transformer-based, and diffusion-based methods while maintaining high computational efficiency.

Abstract:
Semi-supervised 3D object detection from point cloud aims to train a detector with a small number of labeled data and a large number of unlabeled data. Among existing methods, the pseudo-label based methods have achieved superior performance, and the core lies in how to select high-quality pseudo-labels with the designed quality evaluation criterion. Despite the success of these methods, they all consider the localization and classification quality estimation from a global perspective. For localization quality, they use a global score threshold to filter out low-quality pseudo-labels and assign equal importance to each side during training, ignoring the fact that sides with different localization quality should not be treat equally. Besides, a large number of pseudo-labels are discarded due to the high global threshold, which may also contain some correctly predicted sides that are helpful for model training. For the classification quality, they usually combine the objectness score and classification confidence score to filter out pseudo-labels. The main focus of them is designing effective classification confidence evaluation metrics, neglecting the importance of predicting better objectness score. In this paper, we propose SA3Det++, a side-aware quality estimation method for semi-supervised object detection, which consists of a probabilistic side localization strategy, a side-aware quality estimation strategy, and a soft pseudo-label selection strategy. Extensive results demonstrate that the proposed method consistently outperforms the baseline methods under different scenes and evaluation criterions.

Abstract:
Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability to CIR. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach, named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion), that involves mapping the visual information of the reference image into a pseudo-word token in the CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets – FashionIQ, CIRR, and the proposed CIRCO – and two additional evaluation settings, namely domain conversion and object composition.

Abstract:
Contrastive learning has shown remarkable success in the domain of skeleton-based action recognition. However, the design of data transformations, which is crucial for effective contrastive learning, remains a challenging aspect in the context of skeleton-based action recognition. The difficulty lies in creating data transformations that capture rich motion patterns while ensuring that the transformed data retains the same semantic information. To tackle this challenge, we introduce an innovative framework called ActCLR+ (Actionlet-Dependent Contrastive Learning), which explicitly distinguishes between static and dynamic regions in a skeleton sequence. We begin by introducing the concept of actionlet, connecting self-supervised learning quantitatively with downstream tasks. Actionlets represent regions in the skeleton where features closely align with action prototypes, highlighting dynamic sequences as distinct from static ones. We propose an anchor-based method for unsupervised actionlet discovery, establishing a motion-adaptive data transformation approach based on this discovery. This motion-adaptive data transformation strategy tailors data transformations for actionlet and non-actionlet regions, respectively, introducing more diverse motion patterns while preserving the original motion semantics. Additionally, we incorporate a semantic-aware masked motion modeling technique to enhance the learning of actionlet representations. Our comprehensive experiments on well-established benchmark datasets such as NTU RGB+D and PKUMMD validate the effectiveness of our proposed method.

Abstract:
Recovering the intrinsic physical attributes of a scene from images, generally termed as the inverse rendering problem, has been a central and challenging task in computer vision and computer graphics. In this paper, we present GUS-IR, a novel framework designed to address the inverse rendering problem for complicated scenes featuring rough and glossy surfaces. This paper starts by analyzing and comparing two prominent shading techniques popularly used for inverse rendering, forward shading and deferred shading, effectiveness in handling complex materials. More importantly, we propose a unified shading solution that combines the advantages of both techniques for better decomposition. In addition, we analyze the normal modeling in 3D Gaussian Splatting (3DGS) and utilize the shortest axis as normal for each particle in GUS-IR, along with a depth-related regularization, resulting in improved geometric representation and better shape reconstruction. Furthermore, we enhance the probe-based baking scheme proposed by GS-IR to achieve more accurate ambient occlusion modeling to better handle indirect illumination. Extensive experiments have demonstrated the superior performance of GUS-IR in achieving precise intrinsic decomposition and geometric representation, supporting many downstream tasks (such as relighting, retouching) in computer vision, graphics, and extended reality.

Abstract:
Adversarial training has emerged as a highly effective way to improve the robustness of deep neural networks (DNNs). It is typically conceptualized as a min-max optimization problem over model weights and adversarial perturbations, where the weights are optimized using gradient descent methods, such as SGD. In this paper, we propose a novel approach by treating model weights as random variables, which paves the way for enhancing adversarial training through Second-Order Statistics Optimization (S^22O) over model weights. We challenge and relax a prevalent, yet often unrealistic, assumption in prior PAC-Bayesian frameworks: the statistical independence of weights. From this relaxation, we derive an improved PAC-Bayesian robust generalization bound. Our theoretical developments suggest that optimizing the second-order statistics of weights can substantially tighten this bound. We complement this theoretical insight by conducting an extensive set of experiments that demonstrate that S^22O not only enhances the robustness and generalization of neural networks when used in isolation, but also seamlessly augments other state-of-the-art adversarial training techniques.

Abstract:
In recent times, following the paradigm of DETR (DEtection TRansformer), query-based end-to-end instance segmentation (QEIS) methods have exhibited superior performance compared to CNN-based models, particularly when trained on large-scale datasets. Nevertheless, the effectiveness of these QEIS methods diminishes significantly when confronted with limited training data. This limitation arises from their reliance on substantial data volumes to effectively train the pivotal queries/kernels that are essential for acquiring localization and shape priors. To address this problem, we propose a novel method for unsupervised pre-training in low-data regimes. Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts (UPLVP), which improves QEIS models’ instance segmentation by bringing language-vision prompts to queries/kernels. Our method consists of three parts: (1) Masks Proposal: Utilizes language-vision models to generate pseudo masks based on unlabeled images. (2) Prompt-Kernel Matching: Converts pseudo masks into prompts and injects the best-matched localization and shape features to their corresponding kernels. (3) Kernel Supervision: Formulates supervision for pre-training at the kernel level to ensure robust learning. With the help of our pre-training method, QEIS models can converge faster and perform better than CNN-based models in low-data regimes. Experimental evaluations conducted on MS COCO, Cityscapes, and CTW1500 datasets indicate that the QEIS models’ performance can be significantly improved when pre-trained with our method.

Abstract:
Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despite its success, EDL incorporates several nonessential settings: In model construction, (1) a commonly ignored prior weight parameter is fixed to the number of classes, while its value actually impacts the balance between the proportion of evidence and its magnitude in deriving predictive scores. In model optimization, (2) the empirical risk features a variance-minimizing optimization term that biases the PDF towards a Dirac delta function, potentially exacerbating overconfidence. (3) Additionally, the structural risk typically includes a KL-divergence-minimizing regularization, whose optimization direction extends beyond the intended purpose and contradicts common sense, diminishing the information carried by the evidence magnitude. Therefore, we propose Re-EDL, a simplified yet more effective variant of EDL, by relaxing the nonessential settings and retaining the essential one, namely, the adoption of projected probability from subjective logic. Specifically, Re-EDL treats the prior weight as an adjustable hyperparameter rather than a fixed scalar, and directly optimizes the expectation of the Dirichlet PDF provided by deprecating both the variance-minimizing optimization term and the divergence regularization term. Extensive experiments and state-of-the-art performance validate the effectiveness of our method.

Abstract:
Large-scale high-resolution remote sensing images (LSHR) are increasingly adopted for object detection, since they capture finer details. However, LSHR imposes a substantial computational cost. Existing methods explore lightweight backbones and advanced oriented bounding box regression mechanisms. Nevertheless, they still rely on high-resolution inputs to maintain detection accuracy. We observe that LSHR comprise extensive background areas that can be compressed to reduce unnecessary computation, while object regions contain details that can be reserved to improve detection accuracy. Thus, we propose a hybrid Gaussian deformation module that dynamically adjusts the sampling density at each location based on its relevance to the detection task, i.e., high-density sampling preserves more object regions and better retains detailed features, while low-density sampling diminishes the background proportion. Further, we introduce a bilateral deform-uniform detection framework to exploit the potential of the deformed sampled low-resolution images and original high-resolution images. Specifically, a deformed deep backbone takes the deformed sampled images as inputs to produce high-level semantic information, and a uniform shallow backbone takes the original high-resolution images as inputs to generate precise spatial location information. Moreover, we incorporate a deformation-aware feature registration module that calibrates the spatial information of deformed features, preventing regression degenerate solutions while maintaining feature activation. Subsequently, we introduce a feature relationship interaction fusion module to balance the contributions of features from both deformed and uniform backbones. Comprehensive experiments on three challenging datasets show that our method achieves superior performance compared with the state-of-the-art methods.

Affiliations: Electrical and Computer Engineering Department, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; State Key Laboratory of Complex & Critical Software Environment, Beihang University, Beijing, China; Department of Data Science and AI, Faculty of IT, Monash University, Clayton, VIC, Australia; Electrical Engineering Department, Yale University, New Haven, CT, USA; Department of Automation, Tsinghua University, Beijing, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore

Abstract:
Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues. However, unlike traditional models, diffusion models critically rely on the time-step for the multi-round denoising. Typically, each time-step is encoded into a hypersensitive temporal feature by several modules. Despite this, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory, as well as reduced compression efficiency. To address these challenges, we introduce a novel quantization framework that includes three strategies: 1) TIB-based Maintenance: Based on our innovative Temporal Information Block (TIB) definition, Temporal Information-aware Reconstruction (TIAR) and Finite Set Calibration (FSC) are developed to efficiently align original temporal features. 2) Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3) Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection between the two maintenance strategies for further disturbance reduction. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets, diffusion models and hardware confirms our superior performance and acceleration.

Affiliations: Key Laboratory of Big Data & Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University (BJTU), Beijing, China; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, Chinese Academy of Sciences (CASIA), Beijing, China

Abstract:
Color constancy, the human visual system’s ability to perceive consistent colors under varying illumination conditions, is crucial for accurate color perception. Recently, deep learning algorithms have been introduced into this task and have achieved remarkable achievements. However, existing methods are limited by the scale of current multi-illumination datasets and model size, hindering their ability to learn discriminative features effectively and their practical value for deployment in cameras. To overcome these limitations, this paper proposes a multi-illumination color constancy approach based on self-supervised learning and knowledge distillation. This approach includes three phases: self-supervised pre-training, supervised fine-tuning, and knowledge distillation. During the pre-training phase, we train Transformer-based and U-Net based encoders by two pretext tasks: light normalization task to learn lighting color contextual representation and grayscale colorization task to acquire objects’ inherent color information. For the downstream color constancy task, we fine-tune the encoders and design a lightweight decoder to obtain better illumination distributions with fewer parameters. During the knowledge distillation phase, we introduce a hybrid knowledge distillation technique to align CNN features with those of Transformer and U-Net respectively. Our proposed method outperforms state-of-the-art techniques on multi-illumination and single-illumination benchmarks. Extensive ablation studies and visualizations confirm the effectiveness of our model.

Abstract:
Multi-view clustering (MVC) is a fast-growing research direction. However, most existing MVC works focus on concrete objects (e.g., cats, desks) but ignore abstract objects (e.g., knowledge, thoughts), which are also important parts of our daily lives and more correlated to cognition. Relational knowledge, as a typical abstract concept, describes the relationship between entities. For example, “Cats like eating fishes,” as relational knowledge, reveals the relationship “eating” between “cats” and “fishes.” To fill this gap, we first point out that MVC on relational knowledge is considered an important scenario. Then, we construct 8 new datasets to lay research grounds for them. Moreover, a simple yet effective relational knowledge MVC paradigm (RK-MVC) is proposed by compensating the omitted sample-global correlations from the structural knowledge information. Concretely, the basic consensus features are first learned via adopted MVC backbones, and sample-global correlations are generated in both coarse-grained and fine-grained manners. In particular, the sample-global correlation learning module can be easily extended to various MVC backbones. Finally, both basic consensus features and sample-global correlation features are weighted fused as the target consensus feature. We adopt 9 typical MVC backbones in this paper for comparison from 7 aspects, demonstrating the promising capacity of our RK-MVC.

Abstract:
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with < < 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than 2×2× and 3×3× in speed with superior accuracy.

Abstract:
Most deraining methods work on day scenes while leaving nighttime deraining underexplored, where darkness and non-uniform illuminations pose additional challenges. Consequently, night rain has a quite different appearance varying by location and cannot be effectively handled. To accommodate this issue, we propose a Rain Location Prior (RLP) by implicitly learning it from rainy images to reflect rain location information and boost the performance of deraining models by prior injection. Then, we introduce a Rain Prior Injection Module (RPIM) with a multi-scale scheme to modulate it by attention and emphasize the features of rain streak areas for better injection efficiency. Finally, to alleviate the data scarcity issue and facilitate the research on nighttime deraining, we propose the GTAV-NightRain dataset by considering the interaction between rain streaks and non-uniform illuminations, and provide detailed instructions on data collection pipeline which is highly replicable and flexible to integrate challenging factors of rainy night in the future. Our method outperforms state-of-the-art backbone by 1.3 dB in PSNR and generalizes better on real data such as heavy rain and the presence of glow and glaring lights. Ablation studies are conducted to validate the effectiveness of each component and we visualize RLP to show good interpretability. Moreover, we apply our method to daytime deraining and desnow to show good generalizability on other location-dependent degradations. Our method is a step forward in nighttime deraining and the GTAV-NightRain dataset may become a good complement to previous datasets.

Abstract:
Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representations from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in fullprecision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.

Abstract:
Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at https://github.com/WonderSeven/FNSDA.

Abstract:
This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach’s individual components.

Abstract:
In practical applications, the difficulty of multi-view data annotation poses a challenge for multi-view semi-supervised learning. Although some graph-based approaches have been proposed for this task, they often struggle with capturing long-range information and memory bottlenecks, and usually encounter over-smoothing. To address these issues, this paper proposes an implicit model, named multi-channel Equilibrium Graph Neural Network (MEGNN). Through an equilibrium point iterative process, the proposed MEGNN naturally captures long-range information and effectively reduces the consumption of memory compared with explicit models. Furthermore, the proposed method deals with the issue of over-smoothing in deep graph convolutional networks by residual connection and shrinkage factor. We analyze the effect of the shrinkage factor on the information capturing capability of the model, and demonstrate that the proposed method does not encounter over-smoothing. Comprehensive experimental results demonstrate that the proposed method outperforms the state-of-the-art methods.

Abstract:
We propose that spaceborne polarimetric imagers can be calibrated, or self-calibrated using zodiacal light (ZL). ZL is created by a cloud of interplanetary dust particles. It has a significant degree of polarization in a wide field of view. From space, ZL is unaffected by terrestrial disturbances. ZL is insensitive to the camera location, so it is suited for simultaneous cross-calibration of satellite constellations. ZL changes on a scale of months, thus being a quasi-constant target in realistic calibration sessions. We derive a forward model for polarimetric image formation. Based on it, we formulate an inverse problem for polarimetric calibration and self-calibration, as well as an algorithm for the solution. The methods here are demonstrated in simulations. Towards these simulations, we render polarized images of the sky, including ZL from space, polarimetric disturbances, and imaging noise.

Abstract:
In real-life passive non-line-of-sight (NLOS) imaging there is an overwhelming amount of undesired scattered radiance, called clutter, that impedes reconstruction of the desired NLOS scene. This paper explores using the spectral domain of the scattered light field to separate the desired scattered radiance from the clutter. We propose two techniques: The first separates the multispectral scattered radiance into a collection of objects each with their own uniform color. The objects which correspond to clutter can then be identified and removed based on how well they can be reconstructed using NLOS imaging algorithms. This technique requires very few priors and uses off-the-shelf algorithms. For the second technique, we derive and solve a convex optimization problem assuming we know the desired signal's spectral content. This method is quicker and can be performed with fewer spectral measurements. We demonstrate both techniques using realistic scenarios. In the presence of clutter that is 50 times stronger than the desired signal, the proposed reconstruction of the NLOS scene is 23 times more accurate than typical reconstructions and 5 times more accurate than using the leading clutter rejection method.

Abstract:
We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.

Abstract:
Machine-learning models demand periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly updated model may commit mistakes the previous model did not make. Such misclassifications are referred to as negative flips, experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, previously ineffective adversarial attacks on some inputs may become successful, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the samples for which no adversarial example was found before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that both accuracy and robustness, even if improved after model update, can be affected by negative flips, and our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.

Abstract:
Vision-Language Pretraining (VLP) has developed a series of fancy foundation models, which continuously advance the state-of-the-art on various multimodal tasks. However, there has been limited exploration of their potential for large-scale image retrieval. In a real-world image retrieval system, images are collected together with user-annotated tags from the web. These tags contain various information about the corresponding image and could be used as weak supervision for image representation learning. In this paper, we seek to harness the powerful image-and-text alignment ability of VLP foundation models to enhance compact image representation. Specifically, we propose a new weakly supervised hashing framework, which learns a deep hashing network and enhances weak supervision alternatively. First, we extract the image and tag representation from VLP foundation models, and learn the deep hashing network with a policy gradient process, which directly optimizes the retrieval performance, i.e., mAP. Then given the learned deep hashing network, we further enhance the weak supervision with a separate probabilistic decision process. This process also optimizes the retrieval performance by the ground-truth defined with the learned hashing network. These two processes are alternatively repeated until a fixed number of steps. Experiments on public image datasets prove the effectiveness of our method.

Abstract:
Understanding and reasoning about objects’ physical properties in the natural world is a fundamental challenge in artificial intelligence. While some properties like colors and shapes can be directly observed, others, such as mass and electric charge, are hidden from the objects’ visual appearance. This paper addresses the unique challenge of inferring these hidden physical properties from objects’ motion and interactions and predicting corresponding dynamics based on the inferred physical properties. We first introduce the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes limited videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions. Besides the synthetic videos from simulators, we also collect a real-world dataset to show further test physical reasoning abilities of different models. We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties, which leads to inferior performance. We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties from question answering. Leveraging an object-centric representation, PCR utilizes videos and the associated natural language to infer objects’ physical properties without dense object annotations. Furthermore, It incorporates property-aware graph networks to approximate the dynamic interactions among objects. PCR also employs a semantic parser to convert questions into semantic programs, and a program executor to execute the programs based on the learned physical properties and dynamics. After training, PCR demonstrates remarkable capabilities. It can detect and associate objects across frames, ground visible and hidden physical properties, make future and counterfactual predictions, and utilize these extracted representations to answer challenging questions. We hope the proposed ComPhy dataset and the PCR model present a promising step towards more comprehensive physical reasoning in AI systems.

Abstract:
Traffic scene perception underpins essential tasks like map construction and route planning in modern intelligent transportation systems, thus receiving extensive attention. However, existing methods tend to concentrate solely on specific elements, lacking a comprehensive understanding of various traffic scenes. This paper addresses the Visual Traffic Knowledge Graph Generation (VTKGG) task, aiming to extract and represent traffic information from various elements in the traffic scene image as a knowledge graph. To achieve this, we propose Query-Denoising Network (QDNet) to integrate multiple subtasks through different types of queries in an end-to-end manner. These queries facilitate information communication between different modules, streamlining the generation of visual traffic knowledge graphs by eliminating cumbersome intermediate steps. Considering the challenges in optimizing such a cascaded multi-task model, we incorporate the query-denoising method into the training process of QDNet. By introducing the noised query, enhancing the internal noise of the model, and forcing the model to recover the ground truth, our approach achieves accurate results. This strategy improves the robustness and performance of our model. We conduct extensive ablation and comparative experiments to demonstrate the superiority and effectiveness of our framework and strategy, and experiments on a similar task Panoptic Scene Graph Generation also demonstrate its superiority.

Abstract:
Current point cloud registration algorithms struggle to effectively handle both deformations and occlusions simultaneously. Our manifold analysis reveals this limitation arises from the inaccurate modeling of the shape's underlying manifold and the lack of an effective optimization strategy for fragmented manifold structures. In this paper, we present AniSym-Net, a novel non-rigid registration framework designed to address near-isometric deformation registration in the presence of occlusions. To encode object's coarse topological properties and local geometric information, AniSym-Net introduces a novel anisotropic hybrid shape-motion deformation field. The effectiveness of the anisotropic hybrid shape-motion fields relies on both the holonomic constraints from the symplectic structure modeling in AniSym-Net and the motion-conditional cross-attention during fusion, which calibrates geometric features using velocity-boundary constrained point motion patterns. The harmonization of correspondences derived from anisotropic hybrid fields and those from motion-shape fields significantly mitigates registration errors and occlusions. This is achieved through the optimization of loop closures of cotangent bundles within the symplectic manifold framework. We conduct comprehensive evaluation across five popular benchmarks, namely CAPE, DT4D, SAPIEN, FAUST, and DeepDeform, to demonstrate our AniSym-Net's superior performance compared to the state-of-the-art methods. Code will be publicly available.

Abstract:
Video prediction aims to predict future frames by modeling the complex spatiotemporal dynamics in videos. However, most existing methods only model the temporal information and the spatial information for videos in an independent manner but have not fully explored the correlations between both terms. In this paper, we propose a SpatioTemporal-Aware Unit (STAU) for video prediction and beyond by exploring the significant spatiotemporal correlations in videos. On the one hand, the motion-aware attention weights are learned from the spatial states to help aggregate the temporal states in the temporal domain. On the other hand, the appearance-aware attention weights are learned from the temporal states to help aggregate the spatial states in the spatial domain. In this way, the temporal information and the spatial information can be greatly aware of each other in both domains, during which, the spatiotemporal receptive field can also be greatly broadened for more reliable spatiotemporal modeling. Experiments are not only conducted on video prediction tasks (deterministic and stochastic), but also another task beyond video prediction, the early action recognition task. Experimental results show that the proposed STAU can achieve satisfactory performance on all tasks compared with other methods.

Abstract:
Stochastic gradient descent (SGD) performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. However, the generalization performance of asynchronous delayed SGD, which is an essential metric for assessing machine learning algorithms, has rarely been explored. Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization. In this paper, we investigate sharper generalization error bound for SGD with asynchronous delay \tauτ. Leveraging the generating function analysis tool, we first establish the average stability of the delayed gradient algorithm. Based on this algorithmic stability, we provide upper bounds on the generalization error of \widetilde\mathcal O(\fracT-\tau n\tau )O˜(T-τnτ) and \widetilde\mathcal O(\frac1n)O˜(1n) for quadratic convex and strongly convex problems, respectively, where TT refers to the iteration number and nn is the amount of training data. Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm. Analogous analysis can be generalized to the random delay setting, and the experimental results validate our theoretical findings.

Abstract:
In this work, we explore the rendering of photo-realistic free-viewpoint hand pose animation. We present HandNeRF, the first NeRF-based framework to reconstruct accurate appearance and geometry for interacting hands. To overcome the texture contamination and shape artifact problems when dealing with complex interacting scenarios, we further introduce HandNeRF++ to achieve better performance. In our advanced framework, a pose-driven deformation field is designed to establish correspondence from diverse poses to a canonical space, where the pose- and shape-disentangled NeRFs are optimized. To enhance the geometry and texture cues in rarely-observed areas for interacting hands, we establish a connection between the interacting hands by proposing the adaptive hand-sharing technique for cross-hand augmentation. Meanwhile, we further leverage the hand poses to generate fine-grained density priors, serving as valuable guidance for occlusion-aware geometry learning. Furthermore, a neural feature distillation method and a neural refiner are proposed to facilitate color optimization and further polish the renderings. With the collaboration of all the modules and strategies, our HandNeRF++ significantly advances the capabilities of NeRF-based 3D reconstruction in the context of interacting hands. Extensive experiments are conducted to validate the merits of the proposed frameworks. We report a series of state-of-the-art results both qualitatively and quantitatively.

Abstract:
Temporal action detection (TAD) is a vital challenge in computer vision and the Internet of Things, aiming to detect and identify actions within temporal sequences. While TAD has primarily been associated with video data, its applications can also be extended to sensor data, opening up opportunities for various real-world applications. However, applying existing TAD models to sensory signals presents distinct challenges such as varying sampling rates, intricate pattern structures, and subtle, noise-prone patterns. In response to these challenges, we propose a Sensory Temporal Action Detection (STADe) model. STADe leverages Fourier kernels and adaptive frequency filtering to adaptively capture the nuanced interplay of temporal and frequency features underlying complex patterns. Moreover, STADe embraces adaptability by employing deep fusion at varying resolutions and scales, making it versatile enough to accommodate diverse data characteristics, such as the wide spectrum of sampling rates and action durations encountered in sensory signals. Unlike conventional models with unidirectional category-to-proposal dependencies, STADe adopts a cross-cascade predictor to introduce bidirectional and temporal dependencies within categories. To extensively evaluate STADe and promote future research in sensory TAD, we establish three diverse datasets using various sensors, featuring diverse sensor types, action categories, and sampling rates. Experiments across one public and our three new datasets demonstrate STADe’s superior performance over state-of-the-art TAD models in sensory TAD tasks.

Abstract:
Interactive 3D segmentation in radiance fields is crucial for advanced 3D scene understanding and manipulation. However, existing methods often struggle to achieve both volumetric completeness and segmentation accuracy, primarily because they fail to consider the critical links between 2D prompt-based segmentations across multiple views. Motivated by this gap, we introduce Gaussian Prompter, a novel approach specifically designed for 3D Gaussian Splatting. The core idea behind Gaussian Prompter is to seamlessly integrate a Gaussian-centric segmentation paradigm by effectively linking various 2D prompts from multi-view segmentations to ensure consistent 3D segmentation. To realize this, we employ two tailored approaches: GaussBlend and PinPrompt. GaussBlend aggregates multi-view 2D segmentation masks into a cohesive 3D segmentation, ensuring both accuracy and completeness. PinPrompt leverages high-confidence prompts from adjacent views to enhance segmentation precision further. Additionally, to address the lack of complex datasets in 3D segmentation, we introduce the SegMip-360 dataset, which includes over 350 precisely annotated masks across seven scenes. Extensive experiments demonstrate that the Gaussian Prompter significantly outperforms state-of-the-art methods in both segmentation accuracy and completeness. Our code and video demonstrations can be found at our repository and project page.

Abstract:
Guided image super-resolution (GISR) aims to reconstruct a high-resolution (HR) target image from its low-resolution (LR) counterpart with the guidance of a HR image from another modality. Existing learning-based methods typically employ symmetric two-stream networks to extract features from both the guidance and target images, and then fuse these features at either an early or late stage through manually designed modules to facilitate joint inference. Despite significant performance, these methods still face several issues: i) the symmetric architectures treat images from different modalities equally, which may overlook the inherent differences between them; ii) lower-level features contain detailed information while higher-level features capture semantic structures. However, determining which layers should be fused and which fusion operations should be selected remain unresolved; iii) most methods achieve performance gains at the cost of increased computational complexity, so balancing the trade-off between computational complexity and model performance remains a critical issue. To address these issues, we propose a Dual-level Cross-modality Neural Architecture Search (DCNAS) framework to automatically design efficient GISR models. Specifically, we propose a dual-level search space that enables the NAS algorithm to identify effective architectures and optimal fusion strategies. Moreover, we propose a supernet training strategy that employs a pairwise ranking loss trained performance predictor to guide the supernet training process. To the best of our knowledge, this is the first attempt to introduce the NAS algorithm into GISR tasks. Extensive experiments demonstrate that the discovered model family, DCNAS-Tiny and DCNAS, achieve significant improvements on several GISR tasks, including guided depth map super-resolution, guided saliency map super-resolution, guided thermal image super-resolution, and pan-sharpening. Furthermore, we analyze the architectures searched by our method and provide some new insights for future research.

Abstract:
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential, particularly in zero-shot classification across diverse distribution shifts. Building on existing evaluations of overall classification robustness, this work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives. First, we investigate their robustness to variations in specific visual factors. Second, we assess two critical safety objectives—confidence uncertainty and out-of-distribution detection—beyond mere classification accuracy. Third, we evaluate the finesse with which CLIP models bridge the image and text modalities. Fourth, we extend our examination to 3D awareness in CLIP models, moving beyond traditional 2D image understanding. Finally, we explore the interaction between vision and language encoders within modern large multimodal models (LMMs) that utilize CLIP as the visual backbone, focusing on how this interaction impacts classification robustness. In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts. Our study uncovers several previously unknown insights into CLIP. For instance, the architecture of the visual encoder in CLIP plays a significant role in their robustness against 3D corruption. CLIP models tend to exhibit a bias towards shape when making predictions. Moreover, this bias tends to diminish after fine-tuning on ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision encoder, could exhibit benefits in classification performance for challenging categories over CLIP alone. Our findings are poised to offer valuable guidance for enhancing the robustness and reliability of CLIP models.

Abstract:
This paper introduces a one-stage deep uncalibrated photometric stereo (UPS) network, namely Fourier Uncalibrated Photometric Stereo Network (FUPS-Net), for non-Lambertian objects under unknown light directions. It departs from traditional two-stage methods that first explicitly learn lighting information and then estimate surface normals. Two-stage methods were deployed because the interplay of lighting with shading cues presents challenges for directly estimating surface normals without explicit lighting information. However, these two-stage networks are disjointed and separately trained so that the error in explicit light calibration will propagate to the second stage and cannot be eliminated. In contrast, the proposed FUPS-Net utilizes an embedded Fourier transform network to implicitly learn lighting features by decomposing inputs, rather than employing a disjointed light estimation network. Our approach is motivated from observations in the Fourier domain of photometric stereo images: lighting information is mainly encoded in amplitudes, while geometry information is mainly associated with phases. Leveraging this property, our method “decomposes” geometry and lighting in the Fourier domain as guidance, via the proposed Fourier Embedding Extraction (FEE) block and Fourier Embedding Aggregation (FEA) block, which generate lighting and geometry features for the FUPS-Net to implicitly resolve the geometry-lighting ambiguity. Furthermore, we propose a Frequency-Spatial Weighted (FSW) block that assigns weights to combine features extracted from the frequency domain and those from the spatial domain for enhancing surface reconstructions. FUPS-Net overcomes the limitations of two-stage UPS methods, offering better training stability, a concise end-to-end structure, and avoiding accumulated errors in disjointed networks. Experimental results on synthetic and real datasets demonstrate the superior performance of our approach, and its simpler training setup, potentially paving the way for a new strategy in deep learning-based UPS methods.

Abstract:
Although deep learning-based methods have shown great success in spatiotemporal predictive learning, the frameworks of those models are mainly designed by intuition. How to make spatiotemporal forecasting with theoretical guarantees is still a challenging issue. In this work, we tackle this problem by applying domain knowledge from the dynamical system to the framework design of deep learning models. An observer theory-guided deep learning architecture, called Spatiotemporal Observer, is designed for predictive learning of high dimensional data. The characteristics of the proposed framework are twofold: first, it provides the generalization error bound and convergence guarantee for spatiotemporal prediction; second, dynamical regularization is introduced to enable the model to learn system dynamics better during training. Further experimental results demonstrate that this framework could effectively model the spatiotemporal dynamics and make accurate predictions in both one-step-ahead and multi-step-ahead forecasting scenarios.

Abstract:
Reconstructing and editing 3D objects and scenes both play crucial roles in computer graphics and computer vision. Neural radiance fields (NeRFs) can achieve realistic reconstruction and editing results but suffer from inefficiency in rendering. Gaussian splatting significantly accelerates rendering by rasterizing Gaussian ellipsoids. However, Gaussian splatting utilizes a single Spherical Harmonic (SH) function to model both texture and lighting, limiting independent editing capabilities of these components. Recently, attempts have been made to decouple texture and lighting with the Gaussian splatting representation but may fail to produce plausible geometry and decomposition results on reflective scenes. Additionally, the forward shading technique they employ introduces noticeable blending artifacts during relighting, as the geometry attributes of Gaussians are optimized under the original illumination and may not be suitable for novel lighting conditions. To address these issues, we introduce DeferredGS, a method for decoupling and relighting the Gaussian splatting representation using deferred shading. To achieve successful decoupling, we model the illumination with a learnable environment map and define additional attributes such as texture parameters and normal direction on Gaussians, where the normal is distilled from a jointly trained signed distance function. More importantly, we apply deferred shading, resulting in more realistic relighting effects compared to previous methods. Both qualitative and quantitative experiments demonstrate the superior performance of DeferredGSin novel view synthesis and relighting tasks.

Abstract:
While convolution and self-attention mechanisms have dominated architectural design in deep learning, this survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations. The Hadamard product’s ability to model nonlinear interactions with linear computational complexity makes it particularly valuable for resource-constrained deployments and edge computing scenarios. We demonstrate its natural applicability in multimodal fusion tasks, such as visual question answering, and its effectiveness in representation masking for applications including image inpainting and pruning. This systematic review not only consolidates existing knowledge about the Hadamard product’s role in deep learning architectures but also establishes a foundation for future architectural innovations. Our analysis reveals the Hadamard product as a versatile primitive that offers compelling trade-offs between computational efficiency and representational power, positioning it as a crucial component in the deep learning toolkit.

Abstract:
In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.

Abstract:
Vehicle-to-everything-aided autonomous driving (V2X-AD) has a huge potential to provide a safer driving solution. Despite extensive research in transportation and communication to support V2X-AD, the actual utilization of these infrastructures and communication resources in enhancing driving performances remains largely unexplored. This highlights the necessity of collaborative autonomous driving; that is, a machine learning approach that optimizes the information sharing strategy to improve the driving performance of each vehicle. This effort necessitates two key foundations: a platform capable of generating data to facilitate the training and testing of V2X-AD, and a comprehensive system that integrates full driving-related functionalities with mechanisms for information sharing. From the platform perspective, we present V2Xverse, a comprehensive simulation platform for collaborative autonomous driving. This platform provides a complete pipeline for collaborative driving: multi-agent driving dataset generation scheme, codebase for deploying full-stack collaborative driving systems, closed-loop driving performance evaluation with scenario customization. From the system perspective, we introduce CoDriving, a novel end-to-end collaborative driving system that properly integrates V2X communication over the entire autonomous pipeline, promoting driving with shared perceptual information. The core idea is a novel driving-oriented communication strategy, that is, selectively complementing the driving-critical regions in single-view using sparse yet informative perceptual cues. Leveraging this strategy, CoDriving improves driving performance while optimizing communication efficiency. We make comprehensive benchmarks with V2Xverse, analyzing both modular performance and closed-loop driving performance. Experimental results show that CoDriving: i) significantly improves the driving score by 62.49% and drastically reduces the pedestrian collision rate by 53.50% compared to the SOTA end-to-end driving method, and ii) achieves sustaining driving performance superiority over dynamic constraint communication conditions.

Abstract:
Radiographic images are similar to each other, making it challenging for diagnostic captioning to narrate fine-grained visual differences of clinical importance. In this paper, we propose a self-boosting framework integrating two novel strategies to learn tightly correlated image and text features for diagnostic captioning. The first strategy explicitly aligns image and text features through training an auxiliary task of image-text matching (ITM) jointly with the main task of report generation (RG) as two branches of a network model. The ITM branch explicitly learns image-text alignment and provides highly correlated visual and textual features for the RG branch to generate high-quality reports. The high-quality reports generated by RG branch, in turn, are utilized as additional harder negative samples to push the ITM branch to evolve towards better image-text alignment. These two branches help improve each other progressively, so that the whole model is self-boosted without requiring external resources. The second strategy aligns image-sample space and report-sample space to achieve consistent image and text feature embeddings. To achieve this, the sample graph of the embedded ground-truth reports is built and used as the target to train the sample graph of the embedded images so that the fine discrepancy in the ground-truth reports could be captured by the learned visual feature embeddings. Our proposed framework demonstrates its superiority on two medical report generation benchmarks, including the largest dataset MIMIC-CXR.

Abstract:
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.

Abstract:
Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence. In reinforcement learning, rationally reusing the policies acquired from other tasks or human experts is critical for tackling problems that are difficult to learn from scratch. In this work, we present a framework called Selective Myopic bEhavior Control (SMEC), which results from the insight that the short-term behaviors of prior policies are sharable across tasks. By evaluating the behaviors of prior policies via a hybrid value function architecture, SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions. Empirical results on a collection of manipulation and locomotion tasks demonstrate that SMEC outperforms existing methods, and validate the ability of SMEC to leverage related prior policies.

Abstract:
As one of the most critical components in modern LP solvers, presolve in linear programming (LP) employs a rich set of presolvers to remove different types of redundancy in input problems by equivalent transformations. We found from extensive experiments that the presolve routine—that is, the method determining (P1) which presolvers to select, (P2) in what order to execute, and (P3) when to stop—significantly impacts the efficiency of solving LPs. However, designing high-quality presolve routines is highly challenging due to the enormous search space, and further optimizing the routines on different tasks for high performance demands extensive domain knowledge and manual tuning. To tackle this problem, we propose the first learning based framework—that is, reinforcement learning for presolve (RL4Presolve)—to learn high-quality presolve routines. An appealing feature is that we employ a novel adaptive action sequence that learns complex routines efficiently by generating combinations of presolvers automatically at each step. Extensive experiments demonstrate that RL4Presolve achieves significant improvement (up to roughly 90% ) in the efficiency of solving LPs. Furthermore, we extract routines from learned policies for simple and efficient deployment without GPU resources to Huawei's supply chain, where extensive manual tuning for each separate task was required previously due to the high economic value.

Abstract:
The deployment of pre-trained models (PTMs) has greatly advanced the field of continual learning (CL), enabling positive knowledge transfer and resilience to catastrophic forgetting. To sustain these advantages for sequentially arriving tasks, a promising direction involves keeping the pre-trained backbone frozen while employing parameter-efficient tuning (PET) techniques to instruct representation learning. Despite the popularity of Prompt-based PET for CL, its empirical design often leads to sub-optimal performance in our evaluation of different PTMs and target tasks. To this end, we propose a unified framework for CL with PTMs and PET that provides both theoretical and empirical advancements. We first perform an in-depth theoretical analysis of the CL objective in a pre-training context, decomposing it into hierarchical components namely within-task prediction, task-identity inference and task-adaptive prediction. We then present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimizes the decomposed objective through incorporating task-specific and task-shared knowledge via mainstream PET techniques along with efficient recovery of pre-trained representations. Leveraging this framework, we delve into the distinct impacts of implementation strategy, PET technique and PET architecture, as well as adaptive knowledge accumulation amidst pronounced distribution changes. Finally, across various CL scenarios, our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.

Abstract:
As commonly used implicit geometry representations, the signed distance function (SDF) is limited to modeling watertight shapes, while the unsigned distance function (UDF) is capable of representing various surfaces. However, its inherent theoretical shortcoming, i.e., the non-differentiability at the zero-level set, would result in sub-optimal reconstruction quality. In this paper, we propose the scaled-squared distance function (S2DF), a novel implicit surface representation for modeling arbitrary surface types. S2DF does not distinguish between inside and outside regions while effectively addressing the non-differentiability issue of UDF at the zero-level set. We demonstrate that S2DF satisfies a second-order partial differential equation of Monge-Ampere-type, allowing us to develop a learning pipeline that leverages a novel Monge-Ampere regularization to directly learn S2DF from raw unoriented point clouds without supervision from ground-truth S2DF values. Extensive experiments across multiple datasets show that our method significantly outperforms state-of-the-art supervised approaches that require ground-truth surface information as supervision for training.

Abstract:
This paper proposes a Retinex-driven reinforced diffusion model for low-light image enhancement, termed Diff-Retinex++, to address various degradations caused by low light. Our main approach integrates the diffusion model with Retinex-driven restoration to achieve physically-inspired generative enhancement, making it a pioneering effort. To be detailed, Diff-Retinex++ consists of two-stage view modules, including the Denoising Diffusion Model (DDM), and the Retinex-Driven Mixture of Experts Model (RMoE). First, DDM treats low-light image enhancement as one type of image generation task, benefiting from the powerful generation ability of diffusion model to handle the enhancement. Second, we design the Retinex theory into the plug-and-play supervision attention module. It leverages the latent features in the backbone and knowledge distillation to learn Retinex rules, and further regulates these latent features through the attention mechanism. In this way, it couples the relationship between Retinex decomposition and image enhancement in a new view, achieving dual improvement. In addition, the Low-Light Mixture of Experts preserves the vividness of the diffusion model and fidelity of the Retinex-driven restoration to the greatest extent. Ultimately, the iteration of DDM and RMoE achieves the goal of Retinex-driven reinforced diffusion model. Extensive experiments conducted on real-world low-light datasets qualitatively and quantitatively demonstrate the effectiveness, superiority, and generalization of the proposed method.

Abstract:
Anchor-based multi-view clustering has garnered much attention for its effectiveness in handling massive datasets. However, current methods either fail to consider intra-view similarity or require (\mathcal O(N^3)O(N3)) for exploring intra-view similarity, making efficient large-scale multi-view clustering difficult. This paper introduces a novel tensor low-frequency component (TLFC) operator, which achieves smooth representation among samples. Furthermore, this TLFC operator, which explores intra-view similarity, incorporates tensor nuclear norm (TNN) operator and consensus regularization that explore inter-view correlations, resulting in the development of tensor low-rank and low-frequency for scalable multi-view clustering (TLRLF4MVC). Iteratively, as intra-view sample similarity and complementary information across views achieve balance, the learned embedding features are mapped into a smooth and compact subspace, ultimately leading to outstanding clustering performance. Extensive experiments on six large-scale multi-view datasets demonstrate that TLRLF4MVC not only significantly outperforms state-of-the-art methods in terms of clustering accuracy but also achieves remarkable computational efficiency, particularly when handling massive data.

Abstract:
Recent Mixup-based data augmentation methods have integrated saliency information for richer supervisory signals. However, they often face significant computational burdens, require additional modules, or are constrained by specific architectures. To overcome these limitations, we present GuidedMixup, a model-agnostic, saliency-aware mixup strategy. Unlike previous methods that struggle with random pairings of discordant source and target images, we focus on matching harmonious pairs among mini-batch images and develop an efficient algorithm to identify image pairs with minimal conflict in salient regions. Thanks to these effective pairs, GuidedMixup employs simplified but fine-grained mask generation and adjusts the pixel-wise mixing ratio based solely on the relative saliency strength of paired images, avoiding complex optimization. Additionally, we introduce GuidedMixup++, which incorporates an optimal location search for efficiently relocating target images. GuidedMixup++ resizes target images and calculates minimal conflict for each pair candidate by considering all possible positions of target images, which is remarkably efficient powered by convolution operations. This information is then used to select pairs for mixing. Experimental results demonstrate that the proposed methods surpass other saliency-based techniques in terms of efficiency, generalization performance, and robustness against corrupted or reduced datasets, as well as in downstream tasks like object detection and instance segmentation.

Abstract:
Missing node attributes pose a common problem in real-world graphs, impacting the performance of graph neural networks’ representation learning. Existing GNNs often struggle to effectively leverage incomplete attribute information, as they are not specifically designed for graphs with missing attributes. To address this issue, we propose a novel node representation learning framework called Wasserstein Graph Neural Network (WGNN). Our approach aims to maximize the utility of limited observed attribute information and account for uncertainty caused by missing values. We achieve this by representing nodes as low-dimensional distributions obtained through attribute matrix decomposition. Additionally, we enhance representation expressiveness by introducing a unique message-passing schema that aggregates distributional information from neighboring nodes in the Wasserstein space. We evaluate the performance of WGNN in node classification tasks using both synthetic and real-world datasets under two missing-attribute scenarios. Moreover, we demonstrate the applicability of WGNN in recovering missing values and tackling matrix completion problems, specifically in graphs involving users and items. Experimental results on both tasks convincingly demonstrate the superiority of our proposed method.

Abstract:
Symmetry is a widespread phenomenon in nature. Recognizing symmetry can minimize redundancy to improve computing efficiency. In this paper, we take permutation-related combinatorial optimization problems as a starting point and explore the symmetric structure of its solution space through group theory. From a new perspective of group action, we discover that the meaningful symmetric feature within the solution space is subject to two conditions regarding the form of objective function and the number of objects to be permuted. To exploit the symmetric features, we design a half-solution-space search strategy for various search operators, which are commonly used for permutation-related combinatorial optimization problems. The half-solution-space search strategy can make these operators explore more promising regions without additional computational effort. When the condition of object number for symmetry is unsatisfied, we propose two dimension mapping approaches to construct the symmetric feature, making the half-solution-space search strategy applicable. We evaluate the proposed strategy on three classes of popular 68 benchmark instances, including the single row facility layout problem (SRFLP), traveling salesman problem (TSP), and multi-objective traveling salesman problem (MOTSP). Experimental results show that algorithms embedded with the half-solution-space search strategy can achieve a more competitive performance than those not exploiting the symmetric features.

Abstract:
Stereo matching is a core component in many computer vision and robotics systems. Despite significant advances over the last decade, handling matching ambiguities in ill-posed regions and large disparities remains an open challenge. In this paper, we propose a new deep network architecture, called IGEV++, for stereo matching. The proposed IGEV++ constructs Multi-range Geometry Encoding Volumes (MGEV), which encode coarse-grained geometry information for ill-posed regions and large disparities, while preserving fine-grained geometry information for details and small disparities. To construct MGEV, we introduce an adaptive patch matching module that efficiently and effectively computes matching costs for large disparity ranges and/or ill-posed regions. We further propose a selective geometry feature fusion module to adaptively fuse multi-range and multi-granularity geometry features in MGEV. Then, we input the fused geometry features into ConvGRUs to iteratively update the disparity map. MGEV allows to efficiently handle large disparities and ill-posed regions, such as occlusions and textureless regions, and enjoys rapid convergence during iterations. Our IGEV++ achieves the best performance on the Scene Flow test set across all disparity ranges, up to 768px. Our IGEV++ also achieves state-of-the-art accuracy on the Middlebury, ETH3D, KITTI 2012, and 2015 benchmarks. Specifically, IGEV++ achieves a 3.23% 2-pixel outlier rate (Bad 2.0) on the large disparity benchmark, Middlebury, representing error reductions of 31.9% and 54.8% compared to RAFT-Stereo and GMStereo, respectively. We also present a real-time version of IGEV++ that achieves the best performance among all published real-time methods on the KITTI benchmarks.

Abstract:
Large foundational models, through upstream pre-training and downstream fine-tuning, have achieved immense success in the broad AI community due to improved model performance and significant reductions in repetitive engineering. By contrast, the transferable one-for-all models in the recommender system field, referred to as TransRec, have made limited progress. The development of TransRec has encountered multiple challenges, among which the lack of large-scale, high-quality transfer learning recommendation dataset and benchmark suites is one of the biggest obstacles. To this end, we introduce NineRec, a TransRec dataset suite that comprises a large-scale source domain recommendation dataset and nine diverse target domain recommendation datasets. Each item in NineRec is accompanied by a descriptive text and a high-resolution cover image. Leveraging NineRec, we enable the implementation of TransRec models by learning from raw multimodal features instead of relying solely on pre-extracted off-the-shelf features. Finally, we present robust TransRec benchmark results with several classical network architectures, providing valuable insights into the field.

Abstract:
Multimodal image fusion involves tasks like pan-sharpening and depth super-resolution. Both tasks aim to generate high-resolution target images by fusing the complementary information from the texture-rich guidance and low-resolution target counterparts. They are inborn with reconstructing high-frequency information. Despite their inherent frequency domain connection, most existing methods only operate solely in the spatial domain and rarely explore the solutions in the frequency domain. This study addresses this limitation by proposing solutions in both the spatial and frequency domains. We devise a Spatial-Frequency Information Integration Network, abbreviated as SFINet for this purpose. The SFINet includes a core module tailored for image fusion. This module consists of three key components: a spatial-domain information branch, a frequency-domain information branch, and a dual-domain interaction. The spatial-domain information branch employs the spatial convolution-equipped invertible neural operators to integrate local information from different modalities in the spatial domain. Meanwhile, the frequency-domain information branch adopts a modality-aware deep Fourier transformation to capture the image-wide receptive field for exploring global contextual information. In addition, the dual-domain interaction facilitates information flow and the learning of complementary representations. We further present an improved version of SFINet, SFINet++, that enhances the representation of spatial information by replacing the basic convolution unit in the original spatial domain branch with the information-lossless invertible neural operator. We conduct extensive experiments to validate the effectiveness of the proposed networks and demonstrate their outstanding performance against state-of-the-art methods in two representative multimodal image fusion tasks: pan-sharpening and depth super-resolution.

Abstract:
Recently, text-guided scalable vector graphics (SVG) synthesis has shown great promise in domains like iconography and sketching. However, existing Text-to-SVG methods often face challenges in editability, visual quality, and diversity. To address these issues, we propose a novel framework for text-guided SVG synthesis that significantly enhances editability, quality, and diversity. To enhance the editability of output SVGs, we introduce a Hierarchical Image VEctorization (HIVE) framework that operates at the semantic object level and supervises the optimization of components within the vector object. This approach facilitates the decoupling of vector graphics into distinct objects and component levels. Our proposed HIVE algorithm, informed by image segmentation priors, not only ensures a more precise representation of vector graphics but also enables fine-grained editing capabilities within vector objects. To improve the diversity of output SVGs, we present a Vectorized Particle-based Score Distillation (VPSD) approach. VPSD addresses over-saturation issues in existing methods and enhances sample diversity. A pre-trained reward model is incorporated to re-weight vector particles, improving aesthetic appeal and enabling faster convergence. Additionally, we design a novel adaptive vector primitives control strategy, which allows for the dynamic adjustment of the number of primitives, thereby enhancing the presentation of graphic details. Extensive experiments validate the effectiveness of the proposed method, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. We also show that our new method supports up to six distinct vector styles, capable of generating high-quality vector assets suitable for stylized vector design and poster design.

Abstract:
Federated learning (FL) has emerged as a significant distributed machine learning paradigm. It allows the training of a global model through user collaboration without the necessity of sharing their original data. Traditional FL generally assumes that each client’s data remains fixed or static. However, in real-world scenarios, data typically arrives incrementally, leading to a dynamically expanding data domain. In this study, we examine catastrophic forgetting within Federated Incremental Learning (FIL) and focus on the training resources, where edge clients may not have sufficient storage to keep all data or computational budget to implement complex algorithms designed for the server-based environment. We propose a general and low-cost framework for FIL named Re-Fed+, which is designed to help clients cache important samples for replay. Specifically, when a new task arrives, each client initially caches selected previous samples based on their global and local significance. The client then trains the local model using both the cached samples and the new task samples. From a theoretical perspective, we analyze how effectively Re-Fed+ can identify significant samples for replay to alleviate the catastrophic forgetting issue. Empirically, we show that Re-Fed+ achieves competitive performance compared to state-of-the-art methods.

Abstract:
Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of media, especially images and video. As a result, a growing need for efficient compression methods optimised for machine vision, rather than human vision, has emerged. To meet this growing demand, significant developments have been made in image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we greatly extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilise this newfound understanding to improve several methods for learned image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks – classification, instance and semantic segmentation, and object detection.

Abstract:
The divergence between labeled training data and unlabeled testing data is a significant challenge for recent deep learning models. Unsupervised domain adaptation (UDA) attempts to solve such problem. Recent works show that self-training is a powerful approach to UDA. However, existing methods have difficulty in balancing the scalability and performance. In this paper, we propose a hard-aware instance adaptive self-training framework for UDA on the task of semantic segmentation. To effectively improve the quality and diversity of pseudo-labels, we develop a novel pseudo-label generation strategy with an instance adaptive selector. We further enrich the hard class pseudo-labels with inter-image information through a skillfully designed hard-aware pseudo-label augmentation. Besides, we propose the region-adaptive regularization to smooth the pseudo-label region and sharpen the non-pseudo-label region. For the non-pseudo-label region, consistency constraint is also constructed to introduce stronger supervision signals during model optimization. Our method is so concise and efficient that it is easy to be generalized to other UDA methods. Experiments on GTA5 \rightarrow→ Cityscapes, SYNTHIA \rightarrow→ Cityscapes, and Cityscapes \rightarrow→ Oxford RobotCar demonstrate the superior performance of our approach compared with the state-of-the-art methods.

Abstract:
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based P nn P or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed.

Abstract:
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offline manner usually suffers from substantial domain gap between the VLN task and the LLM training corpus. This paper proposes a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision, leading to a significant mitigation of the domain gap in a cost-effective manner. Specifically, at each timestep, the LLM is prompted to forecast the navigational chain-of-thought by: 1) acting as a world model to imagine the next observation according to the instruction, 2) selecting the candidate observation that best aligns with the imagination, and 3) determining the action based on the reasoning from the prior steps. In this way, the action prediction can be effectively simplified benefiting from the disentangled reasoning. Through constructing formalized labels for training, the LLM can learn to generate desired and reasonable chain-of-thought outputs for improving the action decision. Experimental results across various training settings and popular VLN benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room (R4R)) show the significant superiority of NavCoT over the direct action prediction variants. Through simple parameter-efficient finetuning, our NavCoT outperforms a recent GPT4-based approach with ～∼7% relative improvement on the R2R dataset. We believe that NavCoT will help unlock more task-adaptive and scalable LLM-based embodied agents, which are helpful for developing real-world robotics applications.

Abstract:
Visible-thermal small object detection (RGBT SOD) is a significant yet challenging task with a wide range of applications, including video surveillance, traffic monitoring, search and rescue. However, existing studies mainly focus on either visible or thermal modality, while RGBT SOD is rarely explored. Although some RGBT datasets have been developed, the insufficient quantity, limited diversity, unitary application, misaligned images and large target size cannot provide an impartial benchmark to evaluate RGBT SOD algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93 K frames and 1.2 M manual annotations. RGBT-Tiny contains abundant objects (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of objects are smaller than 16×16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT image fusion, object detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large objects. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset, extensive evaluations have been conducted with IoU and SAFit metrics, including 30 recent state-of-the-art algorithms that cover four different types (i.e., visible generic object detection, visible SOD, thermal SOD and RGBT object detection).

Abstract:
We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how multi-branch features of the basic block and convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can significantly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our work, we train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7, RTMDet, and YOLO-v8. Taking the XS version of YOLO-MS as an example, it can achieve an AP score of 42+% on MS COCO, which is about 2% higher than RTMDet with the same model size. Furthermore, our work can also serve as a plug-and-play module for other YOLO models. Typically, our method significantly advances the APs, APl, and AP of YOLOv8-N from 18%+, 52%+, and 37%+ to 20%+, 55%+, and 40%+, respectively, with even fewer parameters and MACs.

Abstract:
Adversarial training has been proposed and widely recognized as a very effective method to defend against adversarial noise. However, the label flipping pattern on different classes still need deeper exploration to identify potential problems and assist in further enhancing robustness. In this work, we model the class-flipping distribution via statistical investigations and find this distribution reveals two shortcomings: the highly misleading category is present in the model's predictions for data in each class, and the trend in class flipping are significantly different across classes. Based on these observations, we propose a Class-Flipping-aware Adversarial Training (CFAT) method. On the one hand, we obtain the most misleading categories for the data in each class by counting the samples flipped to different wrong categories, and utilize them as the target to construct corresponding targeted adversarial samples, respectively. On the other hand, we take the proportions of samples flipped to the most misleading category as factors to scale the perturbation budgets of adversarial training samples for the data with corresponding classes. Experimental results on datasets with different class number validate the effectiveness of the proposed method.

Abstract:
Autonomous driving simulation system plays a crucial role in enhancing self-driving data and simulating complex and rare traffic scenarios, ensuring navigation safety. However, traditional simulation systems, which often heavily rely on manual modeling and 2D image editing, struggled with scaling to extensive scenes and generating realistic simulation data. In this study, we present S-NeRF++, an innovative autonomous driving simulation system based on neural reconstruction. Trained on widely-used self-driving datasets, such as nuScenes and Waymo, S-NeRF++ can generate a large number of realistic street scenes and foreground objects with high rendering quality as well as offering considerable flexibility in manipulation and simulation. Specifically, S-NeRF++ is an enhanced neural radiance field for synthesizing large-scale scenes and moving vehicles, with improved scene parameterization and camera pose learning. The system effectively utilizes noisy and sparse LiDAR data to refine training and address depth outliers, ensuring high-quality reconstruction and novel-view rendering. It also provides a diverse foreground asset bank by reconstructing and generating different foreground vehicles to support comprehensive scenario creation. Moreover, we have developed an advanced foreground-background fusion pipeline that skillfully integrates illumination and shadow effects, further enhancing the realism of our simulations. With the high-quality simulated data provided by our S-NeRF++, we found the perception methods enjoy performance boosts on several autonomous driving downstream tasks, further demonstrating our proposed simulator's effectiveness.

Abstract:
Multi-modal learning aims to enhance performance by unifying models from various modalities but often faces the “modality imbalance” problem in real data, leading to a bias towards dominant modalities and neglecting others, thereby limiting its overall effectiveness. To address this challenge, the core idea is to balance the optimization of each modality to achieve a joint optimum. Existing approaches often employ a modal-level control mechanism for adjusting the update of each modal parameter. However, such a global-wise updating mechanism ignores the different importance of each parameter. Inspired by subnetwork optimization, we explore a uniform sampling-based optimization strategy and find it more effective than global-wise updating. According to the findings, we further propose a novel importance sampling-based, element-wise joint optimization method, called Adaptively Mask Subnetworks Considering Modal Significance (AMSS). Specifically, we incorporate mutual information rates to determine the modal significance and employ non-uniform adaptive sampling to select foreground subnetworks from each modality for parameter updates, thereby rebalancing multi-modal learning. Additionally, we demonstrate the reliability of the AMSS strategy through convergence analysis. Building upon theoretical insights, we further enhance the multi-modal mask subnetwork strategy using unbiased estimation, referred to as AMSS+. Extensive experiments reveal the superiority of our approach over comparison methods.

Abstract:
We propose Pixel2Pixel, a novel zero-shot image denoising framework that leverages the non-local self-similarity of images to generate a large number of training samples using only the input noisy image. This framework employs a compact convolutional neural network architecture to achieve high-quality image denoising. Given a single observed noisy image, we first aim to obtain multiple images with different noise versions. We ensure that the content remains as consistent as possible with the true signal of the noisy image while keeping the noise independent. Specifically, we construct a pixel bank tensor, where each pixel consists of the most similar pixels from the non-local region of the noisy image. Then, multiple training samples, also known as pseudo instances, can be derived from the pixel bank by randomly pixel sampling. By harnessing pixel-wise random sampling, Pixel2Pixel generates a large number of training pseudo instances, thus avoiding reliance on specific training data. In addition, this non-local pixel selection and random sampling strategy helps to break down the spatial correlation of real-world noise as well. Since the proposed method does not require accurate priors on the noise distribution and clean training images, it is suitable for a wide range of noise types and different noise levels, exhibiting strong generalization ability, especially in real noisy scenes. Extensive experiments across various noise types show that Pixel2Pixel outperforms existing methods.

Affiliations: Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China; Artificial Intelligence Lab., Recod.ai, Institute of Computing, University of Campinas, Campinas, Brazil; State Key Laboratory of Multimedia Information Processing and National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; IHPC and CFAR, Agency for Science, Technology and Research (A*STAR), Singapore; Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong

Abstract:
Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness.

Abstract:
We focus on a very challenging task: imaging at nighttime dynamic scenes. Conventional RGB cameras struggle with the trade-off between long exposure for low-light imaging and short exposure for capturing dynamic scenes. Event cameras react to dynamic changes, with their high temporal resolution (microsecond) and dynamic range (120 dB), and thus offer a promising alternative. However, existing methods are mostly based on simulated datasets due to the lack of paired event-clean image data for nighttime conditions, where the domain gap leads to performance limitations in real-world scenarios. Moreover, most existing event reconstruction methods are tailored for daytime data, overlooking issues unique to low-light events at night, such as strong noise, temporal trailing, and spatial non-uniformity, resulting in unsatisfactory reconstruction results. To address these challenges, we construct the first real paired low-light event dataset (RLED) through a co-axial imaging system, comprising 80,400 spatially and temporally aligned image GTs and low-light events, which provides a unified training and evaluation dataset for existing methods. We further conduct a comprehensive analysis of the causes and characteristics of strong noise, temporal trailing, and spatial non-uniformity in nighttime events, and propose a nighttime event reconstruction network (NER-Net+). It includes a learnable event timestamps calibration module (LETC) to correct the temporal trailing events and a non-stationary spatio-temporal information enhancement module (NSIE) to suppress sensor noise and spatial non-uniformity. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in visual quality and generalization on real-world nighttime datasets.

Abstract:
With only video-level event labels, this paper targets at the task of weakly-supervised audio-visual event perception (WS-AVEP), which aims to temporally localize and categorize events that belong to each modality. Despite the recent progress, most existing approaches either ignore the unsynchronized property of audio-visual tracks or discount the complementary modality for explicit enhancement. We argue that, a modality should provide ample presence evidence for an event, while the complementary modality offers absence evidence as a reference. However, to learn reliable evidence, we face challenging uncertainties caused by weak supervision and the complicated audio-visual data itself. To this end, we propose to collect Probabilistic Presence-Absence Evidence (PPAE) in a unified framework. Specifically, by leveraging uni-modal and cross-modal representations, a probabilistic presence-absence evidence collector (PAEC) is designed. To learn the evidence in a reliable range, we propose a joint-modal mutual learning (JML) process, which calibrates the evidence of diverse audible, visible, and audi-visible events adaptively and dynamically. Extensive experiments show that our method surpasses state-of-the-arts (e.g., absolute gains of 3.1% and 4.2% in terms of event-level audio and visual metrics on the LLP dataset).

Abstract:
As a cross-topic of multi-view learning and multi-label classification, multi-view multi-label classification has gradually gained traction in recent years. The application of multi-view contrastive learning has further facilitated this process; however, the existing multi-view contrastive learning methods crudely separate the so-called negative pair, which largely results in the separation of samples belonging to the same category or similar ones. Besides, plenty of multi-view multi-label learning methods ignore the possible absence of views and labels. To address these issues, in this paper, we propose an incomplete multi-view missing multi-label classification network named RANK. In this network, a label-driven multi-view contrastive learning strategy is proposed to leverage supervised information to preserve the intra-view structure and perform the cross-view consistency alignment. Furthermore, we break through the view-level weights inherent in existing methods and propose a quality-aware subnetwork to dynamically assign quality scores to each view of each sample. The label correlation information is fully utilized in the final multi-label cross-entropy classification loss, effectively improving the discriminative power. Last but not least, our model is not only able to handle complete multi-view multi-label data, but also works on datasets with missing instances and labels. Extensive experiments confirm that our RANK outperforms existing state-of-the-art methods.

Abstract:
Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased (underestimating actual values), while those for small-valued outcomes are positively biased (overestimating actual values). We refer to this linear central tendency warped bias as the “systematic bias of machine learning regression”. In this paper, we first demonstrate that this systematic prediction bias persists across various machine learning regression models, and then delve into its theoretical underpinnings. To address this issue, we propose a general constrained optimization approach designed to correct this bias and develop computationally efficient implementation algorithms. Simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning regression models, our method effectively addresses the longstanding issue of “systematic bias of machine learning regression” in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.

Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to scale large language or visual-language models efficiently, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts’ preferences, and 3) Tuning the whole Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization.

Abstract:
Monocular 3D human mesh estimation faces challenges due to depth ambiguity and the complexity of mapping images to complex parameter spaces. Recent methods propose to use 3D poses as a proxy representation, which often lose crucial body shape information, leading to mediocre performance. Conversely, advanced motion capture systems, though accurate, are impractical for markerless wild images. Addressing these limitations, we introduce an innovative intermediate representation as virtual markers, which are learned from large-scale mocap data, mimicking the effects of physical markers. Building upon virtual markers, we propose VMarker, which detects virtual markers from wild images, and the intact mesh with realistic shapes can be obtained by simply interpolation from these markers. To address occlusions that obscure 3D virtual marker estimation, we further enhance our method with VMarker-Pro, a probabilistic framework that models the distribution of 3D virtual marker positions using diffusion models, enabling the generation of multiple plausible meshes aligned with images for robust 3D mesh estimation. Our approaches surpass existing methods on three benchmark datasets, particularly demonstrating significant improvements on the SURREAL dataset, which features diverse body shapes. Additionally, VMarker-Pro excels in accurately modeling data distributions, significantly enhancing performance in occluded scenarios.

Abstract:
While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model’s ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.

Abstract:
There are two main methods that can be used to accelerate MRI reconstruction: parallel imaging and compressed sensing. To further accelerate the sampling process, the combination of these two methods has been extensively studied in recent years. However, existing MRI reconstruction methods often overlook the exploration of high-frequency information of images, leading to sub-optimal recovery of fine details in the reconstructed results. To address this issue, we conduct an in-depth analysis of image gradients and propose a novel MRI reconstruction model based on Maximum a Posteriori (MAP) estimation. We first establish the Cumulative Deviation from Maximum Gradient magnitude (CDMG) prior for fully sampled MR images through theoretical analysis, then incorporate this explicit CDMG prior along with an implicit deep prior to form the prior probability term. This combination of priors strikes a balance between physically informed constraints and data-driven adaptability, aiding in the recovery of meaningful high-frequency information. Additionally, we introduce a multi-order gradient operator to enhance the observation model, thereby improving the accuracy of the likelihood term. Through MAP estimation, we develop a novel accelerated MRI reconstruction model, the optimization of which is achieved by unrolling it into a convolutional neural network structure, referred to as DDGU-Net. Extensive experimental results demonstrate the effectiveness of our approach in reconstructing high-quality MR images and achieving state-of-the-art (SOTA) results, particularly at higher acceleration factors.

Abstract:
Differential equations have demonstrated intrinsic connections to network structures, linking discrete network layers through continuous equations. Most existing approaches focus on the interaction between ordinary differential equations (ODEs) and feature transformations, primarily working on input signals. In this paper, we study the partial differential equation (PDE) model of neural networks, viewing the neural network as a functional operating on a base model provided by the last layer of the classifier. Inspired by scale-space theory, we theoretically prove that this mapping can be formulated by a convection-diffusion equation, under interpretable and intuitive assumptions from both neural network and PDE perspectives. This theoretically certified framework covers various existing network structures and training techniques, offering a mathematical foundation and new insights into neural networks. Moreover, based on the convection-diffusion equation model, we design a new network structure that incorporates a diffusion mechanism into the network architecture from a PDE perspective. Extensive experiments on benchmark datasets and real-world applications confirm the effectiveness of the proposed model.

Abstract:
Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.

Abstract:
The vulnerability of 3D point cloud analysis to unpredictable rotations poses an open yet challenging problem: orientation-aware 3D domain generalization. Cross-domain robustness and adaptability of 3D representations are crucial but not easily achieved through rotation augmentation. Motivated by the inherent advantages of intricate orientations in enhancing generalizability, we propose an innovative rotation-adaptive domain generalization framework for 3D point cloud analysis. Our approach aims to alleviate orientational shifts by leveraging intricate samples in an iterative learning process. Specifically, we identify the most challenging rotation for each point cloud and construct an intricate orientation set by optimizing intricate orientations. Subsequently, we employ an orientation-aware contrastive learning framework that incorporates an orientation consistency loss and a margin separation loss, enabling effective learning of categorically discriminative and generalizable features with rotation consistency. Extensive experiments and ablations conducted on 3D cross-domain benchmarks firmly establish the state-of-the-art performance of our proposed approach in the context of orientation-aware 3D domain generalization.

Abstract:
Cross-modal 3D shape retrieval is a crucial and widely applied task in the field of 3D vision. Its goal is to construct retrieval representations capable of measuring the similarity between instances of different 3D modalities. However, existing methods face challenges due to the performance bottlenecks of single-modal representation extractors and the modality gap across 3D modalities. To tackle these issues, we propose a Heterogeneous Dynamic Graph Representation (HDGR) network, which incorporates context-dependent dynamic relations within a heterogeneous framework. By capturing correlations among diverse 3D objects, HDGR overcomes the limitations of ambiguous representations obtained solely from instances. Within the context of varying mini-batches, dynamic graphs are constructed to capture proximal intra-modal relations, and dynamic bipartite graphs represent implicit cross-modal relations, effectively addressing the two challenges above. Subsequently, message passing and aggregation are performed using Dynamic Graph Convolution (DGConv) and Dynamic Bipartite Graph Convolution (DBConv), enhancing features through heterogeneous dynamic relation learning. Finally, intra-modal, cross-modal, and self-transformed features are redistributed and integrated into a heterogeneous dynamic representation for cross-modal 3D shape retrieval. HDGR establishes a stable, context-enhanced, structure-aware 3D shape representation by capturing heterogeneous inter-object relationships and adapting to varying contextual dynamics. Extensive experiments conducted on the ModelNet10, ModelNet40, and real-world ABO datasets demonstrate the state-of-the-art performance of HDGR in cross-modal and intra-modal retrieval tasks. Moreover, under the supervision of robust loss functions, HDGR achieves remarkable cross-modal retrieval against label noise on the 3D MNIST dataset. The comprehensive experimental results highlight the effectiveness and efficiency of HDGR on cross-modal 3D shape retrieval.

Abstract:
Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains. However, these advancements are not solely attributable to novel feature transformation designs; certain benefits also arise from advanced network-level and block-level architectures. This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach, known as the “spatial token mixer” (STM). To facilitate an impartial comparison, we introduce a unified architecture to neutralize the impact of divergent network-level and block-level designs. Subsequently, various STMs are integrated into this unified framework for comprehensive comparative analysis. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs. Our detailed analysis also reveals various findings about different STMs, including effective receptive fields, invariance, and adversarial robustness tests.

Abstract:
Recommender systems have been widely employed on various online platforms to improve user experience. In these systems, recommendation models are often learned from the users’ historical behaviors that are automatically collected. Notably, recommender systems differ slightly from ordinary supervised learning tasks. In recommender systems, there is an exposure mechanism that decides which items could be presented to each specific user, which breaks the i.i.d assumption of supervised learning and brings biases into the recommendation models. In this paper, we focus on unbiased ranking loss weighted by inversed propensity scores (IPS), which are widely used in recommendations with implicit feedback labels. More specifically, we first highlight the fact that there is a gap between theory and practice in IPS-weighted unbiased loss. The existing pairwise loss could be theoretically unbiased by adopting an IPS weighting scheme. Unfortunately, the propensity scores are hard to estimate due to the inaccessibility of each user-item pair's true exposure status. In practical scenarios, we can only approximate the propensity scores. In this way, the theoretically unbiased loss would be still practically biased. To solve this problem, we first construct a theoretical framework to obtain a generalization upper bound of the current theoretically unbiased loss. The bound illustrates that we can ensure the theoretically unbiased loss's generalization ability if we lower its implementation loss and practical bias at the same time. To that aim, we suggest treating feedback label Y_uiYui as a noisy proxy for exposure result O_uiOui for each user-item pair (u, i)(u,i). Here we assume the noise rate meets the condition that \hatP(O_ui=1, Y_ui\ne O_ui) < 1/2P^(Oui=1,Yui≠Oui)<1/2. According to our analysis, this is a mild assumption that can be satisfied by many real-world applications. Based on this, we could train an accurate propensity model directly by leveraging a noise-resistant loss function. Then we could construct a practically unbiased recommendation model weighted by precise propensity scores. Lastly, experimental findings on public datasets demonstrate our suggested method's effectiveness.

Abstract:
The performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) persists due to the lack of inductive bias, notably when training from scratch with limited datasets. This paper identifies two crucial shortcomings in ViTs: spatial relevance and diverse channel representation. Thus, ViTs struggle to grasp fine-grained spatial features and robust channel representation due to insufficient data. We propose the Dynamic Hybrid Vision Transformer (DHVT) to address these challenges. Regarding the spatial aspect, DHVT introduces convolution in the feature embedding phase and feature projection modules to enhance spatial relevance. Regarding the channel aspect, the dynamic aggregation mechanism and a groundbreaking design “head token” facilitate the recalibration and harmonization of disparate channel representations. Moreover, we investigate the choices of the network meta-structure and adopt the optimal multi-stage hybrid structure without the conventional class token. The methods are then modified with a novel dimensional variable residual connection mechanism to leverage the potential of the structure sufficiently. This updated variant, called DHVT2, offers a more computationally efficient solution for vision-related tasks. DHVT and DHVT2 achieve state-of-the-art image recognition results, effectively bridging the performance gap between CNNs and ViTs. The downstream experiments further demonstrate their strong generalization capacities.

Abstract:
Contrastive learning, a discriminative self-learning framework, is one of the most popular representation learning methods which has a wide range of application scenarios. Although relative techniques have been continuously updated in recent years, designing and seeking positive pairs are still inevitable. Just because of the requirement of explicit positive pairs, the utilization of contrastive learning is restricted in dense, multi-modal, and other scenarios where positive pairs are difficult to obtain. To solve this problem, in this paper, we design an auto-pairing mechanism called Implicit Relation Circulation (IRC) for discriminative self-learning frameworks. Its core idea is to conduct a random walk among multiple feature groups we want to contrast but without explicit matchup, which we call the complex task (Task C). By linking the head and tail of the random walk to form a circulation with a simple task (task S) containing easy-obtaining pairs, we can apply cycle consistency as supervision guidance to gradually learn the wanted positive pairs among the random walk of feature groups automatically. We provide several amazing applications of IRC: we can learn 1) effective dense image pixel relations and representation with only image-level pairs; 2) 3D temporal point-level multi-modal point cloud relations and representation; and 3) even image representation with the help of language without off-the-shelf vision-language pairs. As an easy-to-use plug-and-play mechanism, we evaluate its universality and robustness with multiple self-learning algorithms, tasks, and datasets, achieving stable and significant improvements. As an illustrative example, IRC improves the SOTA performance by about 3.0 mIoU on image semantic segmentation, 1.5 mIoU on 3D segmentation, 1.3 mAP on 3D detection, and an average of 1.2 top1 accuracy on image classification with the help of the auto-learned positive pairs. Importantly, these improvements are achieved with little parameter and computation overhead. We hope IRC can provide the community with new insight into discriminative self-learning.

Abstract:
Recovering whole-body mesh by inferring the abstract pose and shape parameters from visual content can obtain 3D bodies with realistic structures. However, the inferring process is highly non-linear and suffers from image-mesh misalignment, resulting in inaccurate reconstruction. In contrast, 3D keypoint estimation methods utilize the volumetric representation to achieve pixel-level accuracy but may predict unrealistic body structures. To address these issues, this paper presents a novel hybrid inverse kinematics solution, HybrIK, that integrates the merits of 3D keypoint estimation and body mesh recovery in a unified framework. HybrIK directly transforms accurate 3D joints to body-part rotations via twist-and-swing decomposition. The swing rotations are analytically solved with 3D joints, while the twist rotations are derived from visual cues through neural networks. To capture comprehensive whole-body details, we further develop a holistic framework, HybrIK-X, which enhances HybrIK with articulated hands and an expressive face. HybrIK-X is fast and accurate by solving the whole-body pose with a one-stage model. Experiments demonstrate that HybrIK and HybrIK-X preserve both the accuracy of 3D joints and the realistic structure of the parametric human model, leading to pixel-aligned whole-body mesh recovery. The proposed method significantly surpasses the state-of-the-art methods on various benchmarks for body-only, hand-only, and whole-body scenarios.

Abstract:
Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR (Liu et al. 2023), the image enhancement model and the visual recognition model can mutually enhance each other's quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact–different levels of vision tasks have varying and sometimes conflicting requirements of image features. To address this problem, this paper proposes a generalized gradient promotion (GradProm) training strategy for task-driven IQE of medical images. Specifically, we partition a task-driven IQE system into two sub-models, i.e., a mainstream model for image enhancement and an auxiliary model for visual recognition. During training, GradProm updates only parameters of the image enhancement model using gradients of the visual recognition model and the image enhancement model, but only when gradients of these two sub-models are aligned in the same direction, which is measured by their cosine similarity. In case gradients of these two sub-models are not in the same direction, GradProm only uses the gradient of the image enhancement model to update its parameters. Theoretically, we have proved that the optimization direction of the image enhancement model will not be biased by the auxiliary visual recognition model under the implementation of GradProm. Empirically, extensive experimental results on four public yet challenging medical image datasets demonstrated the superior performance of GradProm over existing state-of-the-art methods.

Abstract:
Camouflaged Object Detection (COD) poses a significant challenge in computer vision, playing a critical role in applications. Existing COD methods often exhibit challenges in accurately predicting nuanced boundaries with high-confidence predictions. In this work, we introduce CamoDiffusion, a new learning method that employs a conditional diffusion model to generate masks that progressively refine the boundaries of camouflaged objects. In particular, we first design an adaptive transformer conditional network, specifically designed for integration into a Denoising Network, which facilitates iterative refinement of the saliency masks. Second, based on the classical diffusion model training, we investigate a variance noise schedule and a structure corruption strategy, which aim to enhance the accuracy of our denoising model by effectively handling uncertain input. Third, we introduce a Consensus Time Ensemble technique, which integrates intermediate predictions using a sampling mechanism, thus reducing overconfidence and incorrect predictions. Finally, we conduct extensive experiments on three benchmark datasets that show that: 1) the efficacy and universality of our method is demonstrated in both camouflaged and salient object detection tasks. 2) compared to existing state-of-the-art methods, CamoDiffusion demonstrates superior performance 3) CamoDiffusion offers flexible enhancements, such as an accelerated version based on the VQ-VAE model and a skip approach.

Abstract:
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.

Abstract:
The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5%, 79.8%, 84.0%, and 86.2% with models containing 10 M, 19 M, 83 M, and 173 M parameters, respectively. For instance, the 10 M model outperforms the best existing SNN by 7.2% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5× and 3.9×, respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone.

Abstract:
Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch (Yang et al. 2023) improves its precedents tremendously by amplifying the practice of weak-to-strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small-scale ImageNet-1 K pre-training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet-based encoders to more capable ViT-based encoders (e.g., DINOv2) that are pre-trained on massive data. A simple update on the encoder (even using 2× fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets.

Abstract:
Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

Abstract:
Event cameras are novel bio-inspired sensors, where individual pixels operate independently and asynchronously, generating intensity changes as events. Leveraging the microsecond resolution (no motion blur) and high dynamic range (compatible with extreme light conditions) of events, there is considerable promise in directly segmenting objects from sparse and asynchronous event streams in various applications. However, different from the rich cues in video object segmentation, it is challenging to segment complete objects from the sparse event stream. In this paper, we present the first framework for continuous-time object segmentation from event stream. Given the object mask at the initial time, our task aims to segment the complete object at any subsequent time in event streams. Specifically, our framework consists of a Recurrent Temporal Embedding Extraction (RTEE) module based on a novel ResLSTM, a Cross-time Spatiotemporal Feature Modeling (CSFM) module which is a transformer architecture with long-term and short-term matching modules, and a segmentation head. The historical events and masks (reference sets) are recurrently fed into our framework along with current-time events. The temporal embedding is updated as new events are input, enabling our framework to continuously process the event stream. To train and test our model, we construct both real-world and simulated event-based object segmentation datasets, each comprising event streams, APS images, and object annotations. Extensive experiments on our datasets demonstrate the effectiveness of the proposed recurrent architecture.

Abstract:
Neural-symbolic computing (NeSy), which pursues the integration of the symbolic and statistical paradigms of cognition, has been an active research area of Artificial Intelligence (AI) for many years. As NeSy shows promise of reconciling the advantages of reasoning and interpretability of symbolic representation and robust learning in neural networks, it may serve as a catalyst for the next generation of AI. In the present paper, we provide a systematic overview of the recent developments and important contributions of NeSy research. First, we introduce study history of this area, covering early work and foundations. We further discuss background concepts and identify key driving factors behind the development of NeSy. Afterward, we categorize recent landmark approaches along several main characteristics that underline this research paradigm, including neural-symbolic integration, knowledge representation, knowledge embedding, and functionality. Next, we briefly discuss the successful application of modern NeSy approaches in several domains. Then, we benchmark several NeSy methods on three representative application tasks. Finally, we identify the open problems together with potential future research directions. This survey is expected to help new researchers enter this rapidly evolving field and accelerate the progress towards data-and knowledge-driven AI.

Abstract:
Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect – the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This work extends the VATr (Pippi et al. 2023) Styled-HTG approach by addressing the pre-processing and training issues that it faces, which are common to many HTG models. In particular, we propose generally applicable strategies for input preparation and training regularization that allow the model to achieve better performance and generalization capabilities. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research – the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.

Abstract:
Offset-based representation has emerged as a promising approach for modeling semantic relations between pixels and object motion, demonstrating efficacy across various computer vision tasks. In this paper, we introduce a novel one-stage multi-tasking network tailored to extend the offset-based approach to MOTS. Our proposed framework, named OffsetNet, is designed to concurrently address amodal bounding box detection, instance segmentation, and tracking. It achieves this by formulating these three tasks within a unified pixel-offset-based representation, thereby achieving excellent efficiency and encouraging mutual collaborations. OffsetNet achieves several remarkable properties: first, the encoder is empowered by a novel Memory Enhanced Linear Self-Attention (MELSA) block to efficiently aggregate spatial-temporal features; second, all tasks are decoupled fairly using three lightweight decoders that operate in a one-shot manner; third, a novel cross-frame offsets prediction module is proposed to enhance the robustness of tracking against occlusions. With these merits, OffsetNet achieves 76.83% HOTA on KITTI MOTS benchmark, which is the best result without relying on 3D detection. Furthermore, OffsetNet achieves 74.83% HOTA at 50 FPS on the KITTI MOT benchmark, which is nearly 3.3 times faster than CenterTrack with better performance. We hope our approach will serve as a solid baseline and encourage future research in this field.

Abstract:
The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors, such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.

Abstract:
With the remarkable success of deep neural networks, there is a growing interest in research aimed at providing clear interpretations of their decision-making processes. In this paper, we introduce Attribution Equilibrium, a novel method to decompose output predictions into fine-grained attributions, balancing positive and negative relevance for clearer visualization of the evidence behind a network decision. We carefully analyze conventional approaches to decision explanation and present a different perspective on the conservation of evidence. We define the evidence as a gap between positive and negative influences among gradient-derived initial contribution maps. Then, we incorporate antagonistic elements and a user-defined criterion for the degree of positive attribution during propagation. Additionally, we consider the role of inactivated neurons in the propagation rule, thereby enhancing the discernment of less relevant elements such as the background. We conduct various assessments in a verified experimental environment with PASCAL VOC 2007, MS COCO 2014, and ImageNet datasets. The results demonstrate that our method outperforms existing attribution methods both qualitatively and quantitatively in identifying the key input features that influence model decisions.

Abstract:
We present a novel deep camera path optimization framework for minimum latency online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimation, path smoothing, and novel view synthesis. Most previous methods concentrate on motion estimation while path optimization receives less attention, particularly in the crucial online setting where future frames are inaccessible. In this work, we adopt off-the-shelf high-quality deep motion models for motion estimation and focus only on the path optimization. Specifically, our camera path smoothing network takes a short 2D camera path in a sliding window as input and outputs the stabilizing warp field of the last frame, which warps the coming frame to its stabilized position. We explore three motion densities: a global single camera path, local mesh-based bundled paths, and dense flow paths. A hybrid loss and an efficient motion smoothing attention (EMSA) module are proposed for spatially and temporally consistent path smoothing. Moreover, we build a motion dataset that contains stable and unstable motion pairs for training. Extensive experiments demonstrate that our method surpasses state-of-the-art online stabilization methods and rivals the performance of offline methods, offering compelling advancements in the field of video stabilization.

Abstract:
Graphs are the most ubiquitous data structures for representing relational datasets and performing inferences in them. They model, however, only pairwise relations between nodes and are not designed for encoding the higher-order relations. This drawback is mitigated by hypergraphs, in which an edge can connect an arbitrary number of nodes. Most hypergraph learning approaches convert the hypergraph structure to that of a graph and then deploy existing geometric deep learning methods. This transformation leads to information loss, and sub-optimal exploitation of the hypergraph's expressive power. We present HyperMSG, a novel hypergraph learning framework that uses a modular two-level neural message passing strategy to accurately and efficiently propagate information within each hyperedge and across the hyperedges. HyperMSG adapts to the data and task by learning an attention weight associated with each node's degree centrality. Such a mechanism quantifies both local and global importance of a node, capturing the structural properties of a hypergraph. HyperMSG is inductive, allowing inference on previously unseen nodes. Further, it is robust and outperforms state-of-the-art hypergraph learning methods on a wide range of tasks and datasets. Finally, we demonstrate the effectiveness of HyperMSG in learning multimodal relations through detailed experimentation on a challenging multimedia dataset.

Abstract:
Event cameras are innovative neuromorphic sensors that asynchronously capture the scene dynamics. Due to the event-triggering mechanism, such cameras record event streams with much shorter response latency and higher intensity sensitivity compared to conventional cameras. On the basis of these features, previous works have attempted to reconstruct high dynamic range (HDR) videos from events, but have either suffered from unrealistic artifacts or failed to provide sufficiently high frame rates. In this paper, we present a recurrent convolutional neural network that reconstruct high-speed HDR videos from event sequences, with a key frame guidance to prevent potential error accumulation caused by the sparse event data. Additionally, to address the problem of severely limited real dataset, we develop a new optical system to collect a real-world dataset with paired high-speed HDR videos and event streams, facilitating future research in this field. Our dataset provides the first real paired dataset for event-to-HDR reconstruction, avoiding potential inaccuracies from simulation strategies. Experimental results demonstrate that our method can generate high-quality, high-speed HDR videos. We further explore the potential of our work in cross-camera reconstruction and downstream computer vision tasks, including object detection, panoramic segmentation, optical flow estimation, and monocular depth estimation under HDR scenarios.

Abstract:
Generalized zero-shot learning (GZSL) endeavors to identify the unseen categories using knowledge from the seen domain, necessitating the intrinsic interactions between the visual features and attribute semantic features. However, GZSL suffers from insufficient visual-semantic correspondences due to the attribute diversity and instance diversity. Attribute diversity refers to varying semantic granularity in attribute descriptions, ranging from low-level (specific, directly observable) to high-level (abstract, highly generic) characteristics. This diversity challenges the collection of adequate visual cues for attributes under a uni-granularity. Additionally, diverse visual instances corresponding to the same sharing attributes introduce semantic ambiguity, leading to vague visual patterns. To tackle these problems, we propose a multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency. PSVMA+ explores semantic-visual interactions at different granularity levels, enabling awareness of multi-granularity in both visual and semantic elements. At each granularity level, the dual semantic-visual transformer module (DSVTM) recasts the sharing attributes into instance-centric attributes and aggregates the semantic-related visual regions, thereby learning unambiguous visual features to accommodate various instances. Given the diverse contributions of different granularities, PSVMA+ employs selective cross-granularity learning to leverage knowledge from reliable granularities and adaptively fuses multi-granularity features for comprehensive representations. Experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

Abstract:
While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on four classical IR tasks, namely image super-resolution, image inpainting, blind face restoration, and image deblurring, even only with four sampling steps.

Abstract:
In recent years, 3D models have been utilized in many applications, such as auto-drivers, 3D reconstruction, VR, and AR. However, the scarcity of 3D model data does not meet its practical demands. Thus, generating high-quality 3D models efficiently from textual descriptions is a promising but challenging way to solve this problem. In this paper, inspired by the creative mechanisms of human imagination, which concretely supplement the target model from ambiguous descriptions built upon human experiential knowledge, we propose a novel text-3D generation model (T2TD). T2TD aims to generate the target model based on the textual description with the aid of experiential knowledge. Its target creation process simulates the imaginative mechanisms of human beings. In this process, we first introduce the text-3D knowledge graph to preserve the relationship between 3D models and textual semantic information, which provides related shapes like humans’ experiential information. Second, we propose an effective causal inference model to select useful feature information from these related shapes, which can remove the unrelated structure information and only retain solely the feature information strongly related to the textual description. Third, we adopt a novel multi-layer transformer structure to progressively fuse this strongly related structure information and textual information, compensating for the lack of structural information, and enhancing the final performance of the 3D generation model. The final experimental results demonstrate that our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.

Abstract:
Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as “learning to learn,” meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.

Abstract:
Deep learning models have emerged as strong and efficient tools that can be applied to a broad spectrum of complex learning problems and many real-world applications. However, more and more works show that deep models are vulnerable to adversarial examples. Compared to vanilla attack settings, this paper advocates a more practical setting of data-free black-box attack, for which the attackers can completely not access the structures and parameters of the target model, as well as the intermediate features and any training data associated with the model. To tackle this task, previous methods generate transferable adversarial examples from a transparent substitute model to the target model. However, we found that these works have the limitations of taking static substitute model structure for different targets, only using hard synthesized examples once, and still relying on data statistics of the target model. This may potentially harm the performance of attacking the target model. To this end, we propose a novel Dynamic Routing and Knowledge Re-Learning framework (DraKe) to effectively learn a dynamic substitute model from the target model. Specifically, given synthesized training samples, a dynamic substitute structure learning strategy is proposed to adaptively generate optimal substitute model structure via a policy network according to different target models and tasks. To facilitate the substitute training, we present a graph-based structure information learning to capture the structural knowledge learned from the target model. For the inherent limitation that online data generation can only be learned once, a dynamic knowledge re-learning strategy is proposed to adjust the weights of optimization objectives and re-learn hard samples. Extensive experiments on four public image classification datasets and one face recognition benchmark are conducted to evaluate the efficacy of our Drake. We can obtain significant improvement compared with state-of-the-art competitors. More importantly, our DraKe consistently achieves attack superiority for different target models (e.g., residual networks, and vision transformers), showing great potential for complex real-world applications.

Abstract:
The restoration of hyperspectral image (HSI) plays a pivotal role in subsequent hyperspectral image applications. Despite the remarkable capabilities of deep learning, current HSI restoration methods face challenges in effectively exploring the spatial non-local self-similarity and spectral low-rank property inherently embedded with HSIs. This paper addresses these challenges by introducing a latent diffusion enhanced rectangle Transformer for HSI restoration, tackling the non-local spatial similarity and HSI-specific latent diffusion low-rank property. In order to effectively capture non-local spatial similarity, we propose the multi-shape spatial rectangle self-attention module in both horizontal and vertical directions, enabling the model to utilize informative spatial regions for HSI restoration. Meanwhile, we propose a spectral latent diffusion enhancement module that generates the image-specific latent dictionary based on the content of HSI for low-rank vector extraction and representation. This module utilizes a diffusion model to generatively obtain representations of global low-rank vectors, thereby aligning more closely with the desired HSI. A series of comprehensive experiments were carried out on four common hyperspectral image restoration tasks, including HSI denoising, HSI super-resolution, HSI reconstruction, and HSI inpainting. The results of these experiments highlight the effectiveness of our proposed method, as demonstrated by improvements in both objective metrics and subjective visual quality.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions of seen primitives. Prior studies have attempted to either learn primitives individually (non-connected) or establish dependencies among them in the composition (fully-connected). In contrast, human comprehension of composition diverges from the aforementioned methods as humans possess the ability to make composition-aware adaptation for these primitives, instead of inferring them rigidly through the aforementioned methods. However, developing a comprehension of compositions akin to human cognition proves challenging within the confines of real space. This arises from the limitation of real-space-based methods, which often categorize attributes, objects, and compositions using three independent measures, without establishing a direct dynamic connection. To tackle this challenge, we expand the CZSL distance metric scheme to encompass complex spaces to unify the independent measures, and we establish an imaginary-connected embedding in complex space to model human understanding of attributes. To achieve this representation, we introduce an innovative visual bias-based attribute extraction module that selectively extracts attributes based on object prototypes. As a result, we are able to incorporate phase information in training and inference, serving as a metric for attribute-object dependencies while preserving the independent acquisition of primitives. We evaluate the effectiveness of our proposed approach on three benchmark datasets, illustrating its superiority compared to baseline methods.

Abstract:
The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20×20× fewer training epochs.

Abstract:
Recently quanta image sensors (QIS) – ultra-fast, zero-read-noise binary image sensors– have demonstrated remarkable imaging capabilities in many challenging scenarios. Despite their potential, the adoption of these sensors is severely hampered by (a) high data rates and (b) the need for new computational pipelines to handle the unconventional raw data. We introduce a simple, low-bandwidth computational pipeline to address these challenges. Our approach is based on a novel streaming representation with a small memory footprint, efficiently capturing intensity information at multiple temporal scales. Updating the representation requires only 24floating-point operations/pixel, which can be efficiently computed online at the native frame rate of the binary frames. We use a neural network operating on this representation to reconstruct videos in real-time (10-30 fps). We illustrate why such representation is well-suited for these emerging sensors, and how it offers low latency and high frame rate while retaining flexibility for downstream computer vision. Our approach results in significant data bandwidth reductions (～ 100×∼100×) and real-time image reconstruction and computer vision -10^4\text-10^5 ×-104-105× reduction in computation than existing state-of-the-art approach (Ma et al. 2020), while maintaining comparable quality. To the best of our knowledge, our approach is the first to achieve online, real-time image reconstruction on QIS.

Abstract:
Domain Adaptation (DA) is used to reduce cross-domain differences between the labeled source and unlabeled target domains. As the existing semantic-based DA approaches mainly focus on extracting consistent knowledge under semantic guidance, they may fail in acquiring: (a) personalized knowledge between intra-class samples and (b) local knowledge of neighbor samples from different categories. Hence, a multi-semantic-granularity and target-sample oriented approach, called Adaptive Graph Learning with Semantic Promotability (AGLSP), is proposed, which consists of three parts: (a) Adaptive Graph Embedding with Semantic Guidance (AGE-SG) that adaptively estimates the promotability of target samples and learns variant semantic and geometrical components from the source and those semantically promotable target samples; (b) Semantically Promotable Sample Enhancement (SPSE) that further increases the discriminability and adaptability of tag granularity by mining the features of intra-class source and semantically promotable target samples with multi-granularities; and (c) Adaptive Graph Learning with Implicit Semantic Preservation (AGL-ISP) that forms the tag granularity by extracting commonalities between the source and those semantically non-promotable target samples. As AGLSP learns more semantics from the two domains, more cross-domain knowledge is transferred. Mathematical proofs and extensive experiments on seven datasets demonstrate the performance of AGLSP.

Abstract:
Unsupervised methods have received increasing attention in homography learning due to their promising performance and label-free training. However, existing methods do not explicitly consider the plane-induced parallax, making the prediction compromised on multiple planes. In this work, we propose a novel method HomoGAN to guide unsupervised homography estimation to focus on the dominant plane. First, a multi-scale transformer is designed to predict homography from the feature pyramids of input images in a coarse-to-fine fashion. Moreover, we propose an unsupervised GAN to impose coplanarity constraint on the predicted homography, which is realized by using a generator to predict a mask of aligned regions, and then a discriminator to check if two masked feature maps are induced by a single homography. Based on the global homography framework, we extend it to the local mesh-grid homography estimation, namely, MeshHomoGAN, where plane constraints can be enforced on each mesh cell to go beyond a single dominant plane, such that scenes with multiple depth planes can be better aligned. To validate the effectiveness of our method and its components, we conduct extensive experiments on large-scale datasets. Results show that our matching error is 22% lower than previous SOTA methods.

Abstract:
Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building an LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of 13 representative LVLMs such as InstructBLIP and LLaVA, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates five categories of multimodal capabilities of LVLMs such as visual question answering and object hallucination on 42 in-domain text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study investigates how specific features of LVLMs such as model configurations, modality alignment mechanisms, and training data affect the multimodal understanding. By conducting a comprehensive comparison of these features on quantitative and arena evaluation, our study uncovers several innovative findings, which establish a fundamental framework for the development and evaluation of innovative strategies aimed at enhancing multimodal techniques.

Abstract:
Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., “chopping trees” to long-horizon ones, e.g., “obtaining a diamond pickaxe”. JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

Abstract:
RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB and Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities, thereby leading to suboptimal performance in complex scenarios. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a “Divide-and-Conquer” strategy. This framework utilizes a unified encoder with specialized decoders, each addressing different subtasks of exploring modality-specific and modality-complementary information for RGB-T SOD, thereby enhancing the final saliency map prediction. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator (MFM) in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module (RASPM) in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module (MDAM) in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios, even when dealing with incomplete modality data.

Abstract:
Recent studies in reinforcement learning have explored brain-inspired function approximators and learning algorithms to simulate brain intelligence and adapt to neuromorphic hardware. Among these approaches, reward-modulated spike-timing-dependent plasticity (R-STDP) is biologically plausible and energy-efficient, but suffers from a gap between its local learning rules and the global learning objectives, which limits its performance and applicability. In this paper, we design a recurrent winner-take-all network and propose the spiking variational policy gradient (SVPG), a new R-STDP learning method derived theoretically from the global policy gradient. Specifically, the policy inference is derived from an energy-based policy function using mean-field inference, and the policy optimization is based on a last-step approximation of the global policy gradient. These fill the gap between the local learning rules and the global target. In experiments including a challenging ViZDoom vision-based navigation task and two realistic robot control tasks, SVPG successfully solves all the tasks. In addition, SVPG exhibits better inherent robustness to various kinds of input, network parameters, and environmental perturbations than compared methods.

Abstract:
Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks.

Abstract:
In this paper, we study the problem of procedure planning in instructional videos, which aims to make a plan (i.e. a sequence of actions) given the current visual observation and the desired goal. Previous works cast this as a sequence modeling problem and leverage either intermediate visual observations or language instructions as supervision to make autoregressive planning, resulting in complex learning schemes and expensive annotation costs. To avoid intermediate supervision annotation and error accumulation caused by planning autoregressively, we propose a diffusion-based framework, coined as PDPP (Projected Diffusion model for Procedure Planning), to directly model the whole action sequence distribution with task label as supervision instead. Our core idea is to treat procedure planning as a distribution fitting problem under the given observations, thus transform the planning problem to a sampling process from this distribution during inference. The diffusion-based modeling approach also effectively addresses the uncertainty issue in procedure planning. Based on PDPP, we further apply joint training to our framework to generate plans with varying horizon lengths using a single model and reduce the number of training parameters required. We instantiate our PDPP with three popular diffusion models and investigate a serious of condition-introducing methods in our framework, including condition embeddings, Mixture-of-Experts (MoEs), two-stage prediction and Classifier-Free Guidance strategy. Finally, we apply our PDPP to the Visual Planners for human Assistance (VPA) problem which requires the goal specified in natural language rather than visual observation. We conduct experiments on challenging datasets of different scales and our PDPP model achieves the state-of-the-art performance on multiple metrics, even compared with those strongly-supervised counterparts. These results further demonstratethe effectiveness and generalization ability of our model.

Abstract:
Social images are often associated with rich but noisy tags from community contributions. Although social tags can potentially provide valuable semantic training information for image retrieval, existing studies all fail to effectively filter noises by exploiting the cross-modal correlation between image content and tags. The current cross-modal vision-and-language representation learning methods, which selectively attend to the relevant parts of the image and text, show a promising direction. However, they are not suitable for social image retrieval since: (1) they deal with natural text sequences where the relationships between words can be easily captured by language models for cross-modal relevance estimation, while the tags are isolated and noisy; (2) they take (image, text) pair as input, and consequently cannot be employed directly for unimodal social image retrieval. This paper tackles the challenge of utilizing cross-modal interactions to learn precise representations for unimodal retrieval. The proposed framework, dubbed CGVR (Cross-modal Guided Visual Representation), extracts accurate semantic representations of images from noisy tags and transfers this ability to image-only hashing subnetwork by a carefully designed training scheme. To well capture correlated semantics and filter noises, it embeds a priori common-sense relationship among tags into attention computation for joint awareness of textual and visual context. Experiments show that CGVR achieves approximately 8.82 and 5.45 points improvement in MAP over the state-of-the-art on two widely used social image benchmarks. CGVR can serve as a new baseline for the image retrieval community.

Abstract:
Lifelong person re-identification (LReID) suffers from the catastrophic forgetting problem when learning from non-stationary data streams. Existing exemplar-based and knowledge distillation-based LReID methods encounter data privacy and limited acquisition capacity, respectively. In this paper, we introduce the prototype, which is under-investigated in LReID, to better balance knowledge retention and acquisition. Previous prototype-based works primarily focused on the classification task, where prototypes were modeled as discrete points or statistical distributions. However, they either discarded the distribution information or omitted instance-level diversity, which are crucial fine-grained clues for LReID. Furthermore, the domain shifts between data sources result in a feature gap between the new and old data, which restricts the utilization of the fine-grained information in prototypes. To address these challenges, we propose Distribution-aware Knowledge Aligning and Prototyping (DKP++), a novel framework for modeling and leveraging prototypes in LReID. First, an Instance-level Distribution Modeling network is introduced to capture the local diversity of each instance. Next, a Distribution-oriented Prototype Generation algorithm transforms the instance-level diversity into identity-level distributions which are stored as prototypes. Then, a Prototype-based Knowledge Transfer module distills the knowledge within the prototypes to the new model. To mitigate the impact of domain shifts during knowledge transfer, we introduce a privacy-friendly Distribution Aligning module that transforms new input data to fit the historical distribution, which is incorporated with feature-level alignment constraints to enhance the coherence between new and old knowledge, effectively improving historical prototype utilization. Extensive experiments demonstrate that our method achieves a superior balance between plasticity and stability, outperforming state-of-the-art LReID methods by a large margin.

Abstract:
Adversarial patches present significant challenges to the robustness of deep learning models, making the development of effective defenses become critical for real-world applications. This paper introduces DIFFender, a novel DIFfusion-based DeFender framework that leverages the power of a text-guided diffusion model to counter adversarial patch attacks. At the core of our approach is the discovery of the Adversarial Anomaly Perception (AAP) phenomenon, which enables the diffusion model to accurately detect and locate adversarial patches by analyzing distributional anomalies. DIFFender seamlessly integrates the tasks of patch localization and restoration within a unified diffusion model framework, enhancing defense efficacy through their close interaction. Additionally, DIFFender employs an efficient few-shot prompt-tuning algorithm, facilitating the adaptation of the pre-trained diffusion model to defense tasks without the need for extensive retraining. Our comprehensive evaluation, covering image classification and face recognition tasks, as well as real-world scenarios, demonstrates DIFFender’s robust performance against adversarial attacks. The framework’s versatility and generalizability across various settings, classifiers, and attack methodologies mark a significant advancement in adversarial patch defense strategies. Except for the popular visible domain, we have identified another advantage of DIFFender: its capability to easily expand into the infrared domain. Consequently, we demonstrate the good flexibility of DIFFender, which can defend against both infrared and visible adversarial patch attacks alternatively using a universal defense framework.

Abstract:
Imaging through scattering is challenging, as even a thin layer can randomly perturb light propagation and obscure hidden objects. Accurate closed-form modeling of forward scattering remains difficult, particularly for dynamically varying or thick layers. Here, we introduce a plug-and-play inverse solver based on video diffusion models with a physically grounded forward model tailored to dynamic scattering layers. Our method extends Diffusion Posterior Sampling (DPS) to the spatio-temporal domain, thereby capturing statistical correlations between video frames and scattered signals more effectively. Leveraging these temporal correlations, our approach recovers high-resolution spatial details that spatial-only methods typically fail to reconstruct. We also propose an inference-time optimization with a lightweight mapping network, enabling joint estimation of low-dimensional forward-model parameters without additional training. This joint optimization significantly enhances adaptability to unknown, time-varying degradations, making our method suitable for blind inverse scattering problems. We validate across diverse conditions, including different scene types, layer thicknesses, and scene-layer distances. And real-world experiments using multiple datasets confirm the robustness and effectiveness of our approach, even under real noise and forward-model approximation mismatches. Finally, we validate our method as a general video-restoration framework across dehazing, deblurring, inpainting, and blind restoration under complex optical aberrations.

Abstract:
Depth completion and super-resolution are crucial tasks for comprehensive RGB-D scene understanding, as they involve reconstructing the precise 3D geometry of a scene from sparse or low-resolution depth measurements. However, most existing methods either rely solely on 2D depth representations or directly incorporate raw 3D point clouds for compensation, which are still insufficient to capture the fine-grained 3D geometry of the scene. In this paper, we introduce Tri-Perspective View Decomposition (TPVD) frameworks that can explicitly model 3D geometry. To this end, (1) TPVD ingeniously decomposes the original 3D point cloud into three 2D views, one of which corresponds to the sparse or low-resolution depth input. (2) For sufficient geometric interaction, TPV Fusion is designed to update the 2D TPV features through recurrent 2D-3D-2D aggregation. (3) By adaptively searching for TPV affinitive neighbors, two additional refinement heads are developed for these two tasks to further improve the geometric consistency. Meanwhile, we build novel datasets named TOFDC for depth completion and TOFDSR for depth super-resolution. Both datasets are acquired using time-of-flight (TOF) sensors and color cameras on smartphones. Extensive experiments on TOFDC, KITTI, NYUv2, SUN RGBD, VKITTI, TOFDSR, RGB-D-D, Lu, and Middlebury datasets indicate that our TPVD outperforms previous depth completion and super-resolution methods, reaching the state of the art.

Abstract:
This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects’ motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes.

Abstract:
In this work, we present 3DCoMPaT++, a multimodal 2D/3D dataset with 160 million rendered views of more than 10 million stylized 3D shapes carefully annotated at the part-instance level, alongside matching RGB point clouds, 3D textured meshes, depth maps, and segmentation masks. 3DCoMPaT++ covers 42 shape categories, 275 fine-grained part categories, and 293 fine-grained material classes that can be compositionally applied to parts of 3D objects. We render a subset of one million stylized shapes from four equally spaced views as well as four randomized views, leading to a total of 160 million renderings. Parts are segmented at the instance level, with coarse-grained and fine-grained semantic levels. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. Additionally, we report the outcomes of a data challenge organized at the CVPR conference, showcasing the winning method’s utilization of a modified PointNet++ model trained on 6D inputs, and exploring alternative techniques for GCR enhancement. We hope our work will help ease future research on compositional 3D Vision.

Abstract:
Data Augmentation (DA) is a technique to increase the quantity and diversity of the training data, and by that alleviate overfitting and improve generalisation. However, standard DA produces synthetic data for augmentation with limited diversity. Generative Adversarial Networks (GANs) may unlock additional information in a dataset by generating synthetic samples having the appearance of real images. However, these models struggle to simultaneously address three key requirements: fidelity and high-quality samples; diversity and mode coverage; and fast sampling. Indeed, GANs generate high-quality samples rapidly, but have poor mode coverage, limiting their adoption in DA applications. We propose LatentAugment, a DA strategy that overcomes the low diversity of GANs, opening up for use in DA applications. Without external supervision, LatentAugment modifies latent vectors and moves them into latent space regions to maximise the synthetic images’ diversity and fidelity. It is also agnostic to the dataset and the downstream task. A wide set of experiments shows that LatentAugment improves the generalisation of a deep model translating from MRI-to-CT beating both standard DA as well GAN-based sampling. We further demonstrate its effectiveness when translating from low-energy mammograms to dual-energy subtracted images in contrast-enhanced spectral mammography. Moreover, still in comparison with GAN-based sampling, LatentAugment synthetic samples show superior mode coverage and diversity.

Abstract:
The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly provide guesses with assorted types and unknown interdependences that can result in unreliable predictions. Furthermore, existing techniques for programmatic weak supervision cannot provide assessments for the reliability of the probabilistic predictions for labels. This paper presents a methodology for programmatic weak supervision that can provide confidence intervals for label probabilities and obtain more reliable predictions. In particular, the methods proposed use uncertainty sets of distributions that encapsulate the information provided by LFs with unrestricted behavior and typology. Experiments on multiple benchmark datasets show the improvement of the presented methods over the state-of-the-art and the practicality of the confidence intervals presented.

Abstract:
Traditional classification problems assume that features and labels are fixed. However, this assumption is easily violated in open environments. For example, the exponential growth of web pages leads to an expanding feature space with the accumulation of keywords. At the same time, rapid refresh makes it difficult to obtain accurate labels for web pages, often resulting in rough annotations containing potentially correct labels, i.e., partial label set. In such cases, the coupling between the incremental feature space and the partial label set introduces more complex real-world challenges, which deserve attention but have not been fully explored. In this paper, we address this issue by introducing a novel incremental learning approach with Simultaneous Incremental Feature and Partial Label (SIFPL). SIFPL models the data evolution in dynamic and open environments in a two-stage way, consisting of a previous stage and an adapting stage, to deal with the associated challenges. Specifically, to ensure the reusability of the model during adaptation, we impose classifier consistency constraints to enhance the stability of the current model. This constraint leverages historical information from the previous stage to improve the generalization ability of the current model, providing a reliable foundation for further refining the model with new features. Regarding label disambiguation, we filter out incorrect candidate labels based on the principle of minimizing classifier loss, ensuring that the new features and labels effectively support the model’s adaptation to the incremental feature space, thereby further refining its performance. Furthermore, we also provide a solid theoretical analysis of the model’s generalization bounds, which can validate the efficiency of model inheritance. Experiments on benchmark and real-world datasets validate that the proposed method achieves better accuracy performance than the baseline methods in most cases.

Abstract:
Locating 3D objects from a single RGB image via Perspective-n-Point (PnP) is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest interpreting PnP as a differentiable layer, allowing for partial learning of 2D-3D point correspondences by backpropagating the gradients of pose loss. Yet, learning the entire correspondences from scratch is highly challenging, particularly for ambiguous pose solutions, where the globally optimal pose is theoretically non-differentiable w.r.t. the points. In this paper, we propose the EPro-PnP, a probabilistic PnP layer for general end-to-end pose estimation, which outputs a distribution of pose with differentiable probability density on the SE(3) manifold. The 2D-3D coordinates and corresponding weights are treated as intermediate variables learned by minimizing the KL divergence between the predicted and target pose distribution. The underlying principle generalizes previous approaches, and resembles the attention mechanism. EPro-PnP can enhance existing correspondence networks, closing the gap between PnP-based method and the task-specific leaders on the LineMOD 6DoF pose estimation benchmark. Furthermore, EPro-PnP helps to explore new possibilities of network design, as we demonstrate a novel deformable correspondence network with the state-of-the-art pose accuracy on the nuScenes 3D object detection benchmark.

Abstract:
Human faces contain rich semantic information that could hardly be described without a large vocabulary and complex sentence patterns. However, most existing text-to-image synthesis methods could only generate meaningful results based on limited sentence templates with words contained in the training set, which heavily impairs the generalization ability of these models. In this paper, we define a novel ‘free-style’ text-to-face generation and manipulation problem, and propose an effective solution, named AnyFace++, which is applicable to a much wider range of open-world scenarios. The CLIP model is involved in AnyFace++ for learning an aligned language-vision feature space, which also expands the range of acceptable vocabulary as it is trained on a large-scale dataset. To further improve the granularity of semantic alignment between text and images, a memory module is incorporated to convert the description with arbitrary length, format, and modality into regularized latent embeddings representing discriminative attributes of the target face. Moreover, the diversity and semantic consistency of generation results are improved by a novel semi-supervised training scheme and a series of newly proposed objective functions. Compared to state-of-the-art methods, AnyFace++ is capable of synthesizing and manipulating face images based on more flexible descriptions and producing realistic images with higher diversity.

Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Vincent Cartillier, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Devansh Kukreja, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina González, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jáchym Kolár, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran K. Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbeláez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard A. Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

Affiliations: FAIR, Menlo Park, CA, USA; University of Minnesota, Minneapolis, MN, USA; University of Catania, Catania, Italy; Meta Reality Labs, Menlo Park, CA, USA; Meta, London, U.K.; Carnegie Mellon University, Pittsburgh, PA, USA; UC Berkeley, Berkeley, CA, USA; Georgia Tech, Atlanta, GA, USA; University of Bristol, Bristol, U.K.; King Abdullah University of Science and Technology, Thuwal, Saudi Arabia; National University of Singapore, Singapore; Carnegie Mellon University Africa, Kigali, Rwanda; Universidad de los Andes, Santiago, Chile; University of Tokyo, Tokyo, Japan; Indiana University, Bloomington, IN, USA; International Institute of Information Technology, Hyderabad, Hyderabad, Telangana, USA; Massachusetts Institute of Technology, Cambridge, MA, USA; University of Pennsylvania, Philadelphia, PA, USA

Abstract:
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.

Abstract:
Recent progress in Text-to-Image (T2I) generative models has enabled high-quality image generation. As performance and accessibility increase, these models are gaining significant attraction and popularity: ensuring their fairness and safety is a priority to prevent the dissemination and perpetuation of biases. However, existing studies in bias detection focus on closed sets of predefined biases (e.g., gender, ethnicity). In this paper, we propose a general framework to identify, quantify, and explain biases in an open set setting, i.e. without requiring a predefined set. This pipeline leverages a Large Language Model (LLM) to propose biases starting from a set of captions. Next, these captions are used by the target generative model for generating a set of images. Finally, Vision Question Answering (VQA) is leveraged for bias evaluation. We show two variations of this framework: OpenBias and GradBias. OpenBias detects and quantifies biases, while GradBias determines the contribution of individual prompt words on biases. OpenBias effectively detects both well-known and novel biases related to people, objects, and animals and highly aligns with existing closed-set bias detection methods and human judgment. GradBias shows that neutral words can significantly influence biases and it outperforms several baselines, including state-of-the-art foundation models.

Abstract:
Weakly supervised video anomaly detection has gained attention for its effective performance and cost-efficient annotation, using video-level labels to distinguish between normal and abnormal patterns. However, challenges arise from the diversity and incompleteness of anomalous events, complicating feature learning. Vision-language models offer promising approaches, but designing precise prompts remains difficult. This is because accommodating the diverse range of normal and anomalous scenarios in real-world settings is challenging, and the workload is significant. To tackle these issues, we propose integrating multilingualism and multiple prompts to improve feature learning. By utilizing prompts in various languages to define “anomaly” and “normalcy,” we tackle these concepts across different linguistic domains. In each domain, multiple prompts are employed for adaptive top-K prompt selection of snippets. To enhance visual feature learning, a multi-granularity attention module combining Transformer and Mamba is designed. Mamba’s long-range adaptation selection builds fine-grained temporal correlations among coarse-grained snippets, while Transformer enhances fine-grained information guided by coarse-grained information. Alongside a multilingual prompt guidance loss, we introduce a gradual directional loss to jointly optimize visual feature distribution and the top-K prompt selection. Our method demonstrates effectiveness on four video datasets and provides generalizability analyses on two medical datasets, including EMG and ECG temporal data.

Abstract:
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20 K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8 M training data.

Abstract:
Pedestrian trajectory prediction plays a crucial and fundamental role in many computer vision tasks. Most existing works utilize recurrent neural networks to extract temporal features from trajectories because their recursive structure is inherently well-suited for time series data. However, previous methods overlook the forgetting characteristics of pedestrians when modeling historical trajectories, which may cause the model to focus on the wrong positions of historical information. In this paper, we propose a simple yet effective Adaptive Forgetting-Controlled Recurrent Neural Network (AFC-RNN) for pedestrian trajectory prediction. The core idea of AFC-RNN is a novel Adaptive Forgetting Controller (AFC), which controls the forgetting degree of the historical information at each time step explicitly and adaptively. Specifically, AFC first learns memory factors for each time step based on the temporal correlation of observed trajectories using the self-attention mechanism. Then, AFC-RNN applies these memory factors to regulate the forgetting degree of observed features at each time step from RNN. Extensive experiments and ablation studies on ETH, UCY, SDD, and NBA datasets demonstrate that our method outperforms existing state-of-the-art approaches. Additionally, we provide a mathematical analysis to demonstrate the superiority of our adaptive forgetting strategy in the AFC-RNN over traditional RNNs for trajectory forgetting modeling.

Abstract:
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called UM-ODTrack). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: Video-level Sampling. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. Video-level Association. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. Modality Scalable. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our UM-ODTrack achieves a new SOTA performance.

Abstract:
Recent advances in supervised learning have predominantly focused on regularizations, optimizers, and architectures, yet the potential of simultaneously optimizing data distributions and supervisory signals for training samples remains underexplored. In this paper, we propose a novel paradigm that leverages the benefits of image perturbations for rectifying data distributions. Our method, called DPL (Deep Perturbation Learning), introduces new insights into utilizing image perturbations and focuses on improving generalizability on normal samples, rather than resisting adversarial attacks. DPL formulates a differentiable function w.r.t. image perturbations and implements an alternative optimization process that seamlessly integrates with downstream tasks. However, the limitations of DPL stem from the inefficiency in employing differentiable targets caused by the exclusive optimization of image perturbations, while neglecting the critical role of supervisory signals in training effectiveness. These lead to the excessive necessity of DPL iterations and yield inferior performance-cost trade-off. To track this, we extend DPL to DPL++ with synchronous optimization for image perturbations and label perturbations. In our DPL++ paradigm, the post-hoc application of perturbations to images and labels endows amendments toward both data distributions and supervisory signals, significantly furthering the generalizability of models over various benchmarks. Crucially, the proposed synchronous optimization process shares key differentiable objectives to reduce computational complexity, thereby achieving enhanced effectiveness within fewer optimization iterations. Theoretically, as a generic and flexible approach, DPL++ can be applied to a variety of backbone architectures (e.g., ResNet, DenseNet, and ViT) and downstream tasks (e.g., image classification and object detection). To validate the efficacy of DPL++, we conduct extensive performance experiments and in-depth analytical studies on 2 visual tasks over 5 mainstream benchmarks across 13 backbone networks. The comprehensive results verify the superiority of DPL++ over DPL and demonstrate its promising capabilities for advancing decision-making capacity, risk minimization, class distinguishability, and training convergence.

Affiliations: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computing, Macquarie University, Sydney, NSW, Australia; University of Science and Technology of China, Hefei, China; Australian Institute for Machine Learning (AIML), School of Computer Science, The University of Adelaide, Adelaide, SA, Australia; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; EECS, University of California at Merced, Merced, CA, USA

Abstract:
Given a piece of text, a video clip, and reference audio, the movie dubbing (also known as Visual Voice Cloning, V2C) task aims to generate speeches that clone reference voice and align well with the video in both emotion and lip movement, which is more challenging than conventional text-to-speech synthesis tasks. To align the generated speech with the inherent lip motion of the given silent video, most existing works utilize each video frame to query textual phonemes. However, such an attention operation usually leads to mumble speech because different phonemes are fused for video frames corresponding to one phoneme (video frames are finer-grained than phonemes). To address this issue, we propose a diffusion-based movie dubbing architecture, which improves pronunciation by Hierarchical Phoneme Modeling (HPM) and generates better mel-spectrogram through Acoustic Diffusion Denoising (ADD). We term our model as HD-Dubber. Specifically, our HPM bridges the visual information and corresponding speech prosody from three aspects: (1) aligning lip movement with the speech duration based on each phoneme unit by contrastive learning; (2) conveying facial expression to phoneme-level energy and pitch; and (3) injecting global emotions captured from video scenes into prosody. On the other hand, ADD exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via a parameterized Markov chain conditioned on textual phonemes and reference audio. ADD has two novel denoisers, the Style-adaptive Residual Denoiser (SRD) and the Phoneme-enhanced U-net Denoiser (PUD), to enhance speaker similarity and improve pronunciation quality. Extensive experimental results on the three benchmark datasets demonstrate the state-of-the-art performance of the proposed method. The source code and trained models will be made available to the public.

Abstract:
The relation modeling between actors and scene context advances video action detection where the correlation of multiple actors makes their action recognition challenging. Existing studies model each actor and scene relation to improve action recognition. However, the scene variations and background interference limit their effectiveness. In this paper, we propose to select actor-related scene context, rather than directly laveraging raw video scenario, to improve relation modeling. We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form. Specifically, our CycleACR is constituted of two modules: 1) Actor-to-Context Reorganization (A2C-R), which adaptively collects actor features for context feature reorganizations, and 2) Context-to-Actor Enhancement (C2A-E), which dynamically utilizes the reorganized context features for actor feature enhancement. Stacking multiple CycleACR modules is able to effectively capture the high-order relation and efficiently exchange useful information between actors and context. To fully exploit time-dependent and holistic context information, we further design a parallel local and global temporal context modeling branch. The outputs of the two branches are integrated as the final context-enhanced actor feature representations. Finally, we propose a context-aware memory bank for long-term relation modeling. The proposed bank can effectively store actor-related scene context from other clips without additional memory overhead. Compared to existing designs that focus on C2A-E, our CycleACR introduces the core design of A2C-R for more effective relation modeling. This cycle modeling enablesour CycleACR to achieve state-of-the-art performance on two popular action detection datasets: AVA (40.6 mAP) and UCF101-24 (84.7 mAP). We also provide ablation studies and visualizations to show how our cycle actor-context relation modeling improves video action detection.

Abstract:
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.

Abstract:
Density peaks clustering (DPC) is an excellent clustering algorithm that does not need any prior knowledge. However, DPC still has the following shortcomings: (1) The Euclidean distance used by it is not applicable to manifold data with multiple peaks. (2) The local density calculation for DPC is too simple, and the final results may fluctuate due to the cutoff-distance dc. (3) Manually selected centers by decision-graph may lead to a wrong number of clusters and poor performance. To address these shortcomings and improve the performance, a robust density peaks clustering algorithm for manifold data with multiple peaks (RDPCM) is proposed to reduce the sensitivity of clustering results to parameters. Motivated by DPC-GD, RDPCM replaces the Euclidean distance with geodesic distance, which is optimized by the improved mutual K-nearest neighbors. It better considers the local manifold structure of the datasets and obtains excellent results. In addition, the Davies-Bouldin Index based on Minimum Spanning Tree (MDBI) is proposed to select the ideal number of classes adaptively. Numerous experiments have established that RDPCM is more effective and superior than other advanced clustering algorithms.

Abstract:
With the maturity of 3D capture technology, the explosive growth of point cloud data has burdened the storage and transmission process. Traditional hybrid point cloud compression (PCC) tools relying on handcrafted priors have limited compression performance and are increasingly weak in addressing the burden induced by data growth. Recently, deep learning-based PCC methods have been introduced to continue to push the PCC performance boundary. With the thriving of deep PCC, the community urgently demands a systematic overview to conclude the past progress and present future research directions. In this paper, we have a detailed review that covers popular point cloud datasets, algorithm evolution, benchmarking analysis, and future trends. Concretely, we first introduce several widely-used PCC datasets according to their major properties. Then the algorithm evolution of existing studies on deep PCC, including lossy ones and lossless ones proposed for various point cloud types, is reviewed. Apart from academic studies, we also investigate the development of relevant international standards (i.e., MPEG standards and JPEG standards). To help have an in-depth understanding of the advance of deep PCC, we select a representative set of methods and conduct extensive experiments on multiple datasets. Comprehensive benchmarking comparisons and analysis reveal the pros and cons of previous methods. Finally, based on the profound analysis, we highlight the challenges and future trends of deep learning-based PCC, paving the way for further study.

Abstract:
Graph classification has been a prominent problem in graph machine learning fields. This problem has been investigated by leveraging message passing neural networks (MPNNs) to learn powerful graph representations. However, MPNNs extract topological semantics implicitly under label supervision, which could suffer from domain shift and label scarcity in unsupervised domain adaptation settings. In this paper, we propose an effective solution named Dual Variational Semantics Graph Mining (DREAM) for unsupervised graph domain adaptation by combining graph structural semantics from complementary perspectives. Besides a message passing branch to learn implicit semantics, our DREAM trains a path aggregation branch, which can provide explicit high-order structural semantics as a supplement. To train these two branches conjointly, we employ an expectation-maximization (EM) style variational framework for the maximization of likelihood. In the E-step, we fix the message passing branch and construct a graph-of-graph to indicate the geometric correlation between source and target domains, which would be adopted for the optimization of the other branch. In the M-step, we train the message passing branch and update the graph neural networks on the graph-of-graph with the other branch fixed. The alternative optimization improves the collaboration of knowledge from two branches. Extensive experiments on several benchmark datasets validate the superiority of the proposed DREAM compared with various baselines.

Abstract:
The transition matrix reveals the transition relationship between clean labels and noisy labels. It plays an important role in building statistically consistent classifiers for learning with noisy labels. However, in real-world applications, the transition matrix is usually unknown and has to be estimated. It is a challenging task to accurately estimate the transition matrix which usually depends on the instance. With both instances and noisy labels at hand, the major difficulty of estimating the transition matrix comes from the absence of clean label information. Recent work suggests that self-supervised learning methods can effectively infer clean label information. These methods could even achieve comparable performance with supervised learning on many benchmark datasets but without requiring any labels. Motivated by this, our paper presents a practical approach that harnesses self-supervised learning to extract clean label information, which reduces the estimation error of the instance-dependent transition matrix. By exploiting the estimated transition matrix, the performance of classifiers is improved. Empirical results on different datasets illustrate that our proposed methodology outperforms existing state-of-the-art methods in terms of both classification accuracy and transition matrix estimation.

Affiliations: School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), School of Computer Science, Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, China; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Key Laboratory of Intelligent Interaction and Applications (Ministry of Industry and Information Technology), Northwestern Polytechnical University, Xi’an, China; Institute of Artificial Intelligence (TeleAI), China Telecom Corporation Ltd., Beijing, China

Abstract:
LiDAR-based 3D object detection has recently seen significant advancements through active learning (AL), attaining satisfactory performance by training on a small fraction of strategically selected point clouds. However, in real-world deployments where streaming point clouds may include unknown or novel objects, the ability of current AL methods to capture such objects remains unexplored. This paper investigates a more practical and challenging research task: Open World Active Learning for 3D Object Detection (OWAL-3D), aimed at acquiring informative point clouds with new concepts. To tackle this challenge, we propose a simple yet effective strategy called Open Label Conciseness (OLC), which mines novel 3D objects with minimal annotation costs. Our empirical results show that OLC successfully adapts the 3D detection model to the open world scenario with just a single round of selection. Any generic AL policy can then be integrated with the proposed OLC to efficiently address the OWAL-3D problem. Based on this, we introduce the Open-CRB framework, which seamlessly integrates OLC with our preliminary AL method, CRB, designed specifically for 3D object detection. We develop a comprehensive codebase for easy reproducing and future research, supporting 15 baseline methods (i.e., active learning, out-of-distribution detection and open world detection), 2 types of modern 3D detectors (i.e., one-stage SECOND and two-stage PV-RCNN) and 3 benchmark 3D datasets (i.e., KITTI, nuScenes and Waymo). Extensive experiments evidence that the proposed Open-CRB demonstrates superiority and flexibility in recognizing both novel and known classes with very limited labeling costs, compared to state-of-the-art baselines.

Abstract:
Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose Flow-Anything, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely FA-Flow Dataset. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.

Abstract:
Image resampling is a basic technique that is widely employed in daily applications, such as camera photo editing. Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors. Still, these methods are not the perfect substitute for interpolation, due to the drawbacks in efficiency and versatility. In this work, we propose a novel method of Learning Resampling Function (termed LeRF), which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption of interpolation. Specifically, LeRF assigns spatially varying resampling functions to input image pixels and learns to predict the hyper-parameters that determine the shapes of these resampling functions with a neural network. Based on the formulation of LeRF, we develop a family of models, including both efficiency-orientated and performance-orientated ones. To achieve interpolation-level efficiency, we adopt look-up tables (LUTs) to accelerate the inference of the learned neural network. Furthermore, we design a directional ensemble strategy and edge-sensitive indexing patterns to better capture local structures. On the other hand, to obtain DNN-level performance, we propose an extension of LeRF to enable it in cooperation with pre-trained upsampling models for cascaded resampling. Extensive experiments show that the efficiency-orientated version of LeRF runs as fast as interpolation, generalizes well to arbitrary transformations, and outperforms interpolation significantly, e.g., up to 3 dB PSNR gain over Bicubic for × 2×2 upsampling on Manga109. Besides, the performance-orientated version of LeRF reaches comparable performance with existing DNNs at much higher efficiency, e.g., less than 25% running time on a desktop GPU.

Abstract:
Spatiotemporal (ST) learning has become a crucial technique to enable smart cities and sustainable urban development. Current ST learning models capture the heterogeneity via various spatial convolution and temporal evolution blocks. However, rapid urbanization leads to fluctuating distributions in urban data and city structures, resulting in existing methods suffering generalization and data adaptation issues. Despite efforts, existing methods fail to deal with newly arrived observations, and the limitation of those methods with generalization capacity lies in the repeated training that leads to inconvenience, inefficiency and resource waste. Motivated by complementary learning in neuroscience, we introduce a prompt-based complementary spatiotemporal learning termed ComS2T, to empower the evolution of models for data adaptation. We first disentangle the neural architecture into two disjoint structures, a stable neocortex for consolidating historical memory, and a dynamic hippocampus for new knowledge update. Then we train the dynamic spatial and temporal prompts by characterizing distribution of main observations to enable prompts adaptive to new data. This data-adaptive prompt mechanism, combined with a two-stage training process, facilitates fine-tuning of the neural architecture conditioned on prompts, thereby enabling efficient adaptation during testing. Extensive experiments validate the efficacy of ComS2T in adapting various spatiotemporal out-of-distribution scenarios while maintaining effective inferences.

Abstract:
Spatiotemporal systems are ubiquitous in a large number of scientific areas, representing underlying knowledge and patterns in the data. Here, a fundamental question usually arises as how to understand and characterize these spatiotemporal systems with a certain data-driven machine learning framework. In this work, we introduce an unsupervised pattern discovery framework, namely, dynamic autoregressive tensor factorization. Our framework is essentially built on the fact that the spatiotemporal systems can be well described by the time-varying autoregression on multivariate or even multidimensional data. In the modeling process, tensor factorization is seamlessly integrated into the time-varying autoregression for discovering spatial and temporal modes/patterns from the spatiotemporal systems in which the spatial factor matrix is assumed to be orthogonal. To evaluate the framework, we apply it to several real-world spatiotemporal datasets, including fluid flow dynamics, international import/export merchandise trade, and urban human mobility. On the international trade dataset with dimensions country/region, product type, year, our framework can produce interpretable import/export patterns of countries/regions, while the low-dimensional product patterns are also important for classifying import/export merchandise and understanding systematical differences between import and export. On the ridesharing mobility dataset with dimensions origin, destination, time, our framework is helpful for identifying the shift of spatial patterns of urban human mobility that changed between 2019 and 2022. Empirical experiments demonstrate that our framework can discover interpretable and meaningful patterns from the spatiotemporal systems that are both time-varying and multidimensional.

Affiliations: Great Bay University, Dongguan, China; Harbin Institute of Technology (Shenzhen), Shenzhen, China; The Hong Kong Polytechnic University, Kowloon, Hong Kong; Guangdong Key Lab of Information Security, School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China; School of Electrical and Information Engineering, Tianjin University, Tianjin, China; University of Oxford, Oxford, U.K.; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China

Abstract:
Multimodal Large Language Models (MLLMs) have gained significant attention due to their rich internal implicit knowledge for cross-modal learning. Although advances in bringing audio-visuals into LLMs have resulted in boosts for a variety of Audio-Visual Question Answering (AVQA) tasks, they still face two crucial challenges: 1) audio-visual ambiguity, and 2) audio-visual hallucination. Existing MLLMs can respond to audio-visual content, yet sometimes fail to describe specific objects due to the ambiguity or hallucination of responses. To overcome the two aforementioned issues, we introduce the CAT+, which enhances MLLM to ensure more robust multimodal understanding. We first propose the Sequential Question-guided Module (SQM), which combines tiny transformer layers and cascades Q-Formers to realize a solid audio-visual grounding. After feature alignment and high-quality instruction tuning, we introduce Ambiguity Scoring Direct Preference Optimization (AS-DPO) to correct the problem of CAT+ bias toward ambiguous descriptions. To explore the hallucinatory deficits of MLLMs in dynamic audio-visual scenes, we build a new Audio-visual Hallucination Benchmark, named AVHbench. This benchmark detects the extent of MLLM’s hallucinations across three different protocols in the perceptual object, counting, and holistic description tasks. Extensive experiments across video-based understanding, open-ended, and close-ended AVQA demonstrate the superior performance of our method. The AVHbench is released at https://github.com/rikeilong/Bay-CAT.

Abstract:
The significance of depth estimation has spurred recent endeavors to enhance it through Multi-Sensor Fusion (MSF). However, prevailing MSF methods exhibit limitations concerning accuracy and resilience when confronted with sensor degradations. While certain forms of degradation, such as suboptimal lighting and adverse weather conditions, can be mitigated by collecting pertinent data in data-driven learning, this approach proves ineffective for Out-of-Distribution (OOD) sensor degradations. In this paper, we propose a novel approach termed Combinable and Separable Multi-Sensor Fusion (CSMSF) designed to bolster depth estimation robustness against multiple sensor degradations. CSMSF hinges on four core principles: i) improved performance is achieved with an increased number of valid sensors, ii) a single valid sensor can independently enable its own depth estimation, iii) maintaining a judicious equilibrium between accuracy and model complexity, and iv) autonomous diagnosis of sensor observation failure. Leveraging these advantages, CSMSF identifies and rejects degraded sensors, allowing autonomous selection of valid sensors for scene depth estimation. The experimental results demonstrate the superior robustness of the proposed CSMSF, underscoring its efficacy in addressing challenges associated with sensor degradations across diverse environmental conditions.

Affiliations: School of Electronics and Control Engineering, Chang’an University, Xi’an, China; National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, China; School of Mechanical Engineering, Sichuan University, Chengdu, China; Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China; College of Communication Engineering, JiLin University, Changchun, China

Abstract:
Deep neural networks (DNNs) are potent in LiDAR-based 3D object detection (LiDAR-3DOD), yet their deployment remains daunting due to their cumbersome parameters and computations. Knowledge distillation (KD) is promising for compressing DNNs in LiDAR-3DOD. However, most existing KD methods transfer inadequate knowledge between homogeneous detectors, and do not thoroughly explore optimal student architectures, resulting in insufficient gains for compact student detectors. To this end, we propose a category knowledge-driven compression framework to achieve efficient LiDAR-based 3D detectors. Firstly, we distill knowledge from two-stage teacher detectors to one-stage student detectors, overcoming the limitations of homogeneous pairs. To conduct KD in these heterogeneous pairs, we explore the gap between heterogeneous detectors, and introduce category knowledge-driven KD (CaKD), which includes both student-oriented distillation and two-stage-oriented label assignment distillation. Secondly, to search for the optimal architecture of compact student detectors, we introduce a masked category knowledge-driven structured pruning scheme. This scheme evaluates filter importance by analyzing the changes in category predictions related to foreground regions before and after filter removal, and prunes the less important filters accordingly. Finally, we propose a modified IoU-aware redundancy elimination module to remove redundant false positive samples, thereby further improving the accuracy of detectors. Experiments on various point cloud datasets demonstrate that our method delivers impressive results. For example, on KITTI, several compressed one-stage detectors outperform two-stage detectors in both efficiency and accuracy. Besides, on WOD-mini, our framework reduces the memory footprint of CenterPoint by 5.2× and improves the L2 mAPH by 0.55%%.

Affiliations: Department of Electrical and Electronic Engineering, University of Hong Kong, Hong Kong, SAR, China; College of Computing and Data Science, Nanyang Technological University, Singapore; School of Automation, Guangdong University of Technology, Guangzhou, China; School of Electronics, Electrical Engineering and Computer Science (EEECS), Queen’s University Belfast, Belfast, U.K.; School of Science and Engineering (SSE) and the Future Network of Intelligence Institute (FNii), Chinese University of Hong Kong (Shenzhen), Shenzhen, China; Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada; Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea

Abstract:
Distributed Artificial Intelligence-Generated Content (AIGC) has attracted significant attention, but two key challenges remain: maximizing subjective Quality of Experience (QoE) and improving energy efficiency, which are particularly pronounced in widely adopted Generative Diffusion Model (GDM)-based image generation services. In this paper, we propose a novel user-centric Interactive AI (IAI) approach for service management, with a distributed GDM-based AIGC framework that emphasizes efficient and cooperative deployment. The proposed method restructures the GDM inference process by allowing users with semantically similar prompts to share parts of the denoising chain. Furthermore, to maximize the users’ subjective QoE, we propose an IAI approach, i.e., Reinforcement Learning With Large Language Models Interaction (RLLI), which utilizes Large Language Model (LLM)-empowered generative agents to replicate users interactions, providing real-time and subjective QoE feedback aligned with diverse user personalities. Lastly, we present the GDM-based Deep Deterministic Policy Gradient (G-DDPG) algorithm, adapted to the proposed RLLI framework, to allocate communication and computing resources effectively while accounting for subjective user traits and dynamic wireless conditions. Simulation results demonstrate that G-DDPG improves total QoE by 15% compared with the standard DDPG algorithm.

Abstract:
This paper proposes a theoretical framework to evaluate and compare the performance of stochastic gradient algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers three interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minima and favor convergence toward flatter minima relative to the centralized solution. Second, in decentralized methods, the consensus strategy has a worse excess-risk performance than diffusion, giving it a better chance of escaping from local minima and favoring flatter minima. Third, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimum but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. In this regard, since diffusion has a lower excess-risk than consensus, when both algorithms are trained starting from random initial points, diffusion enhances the classification accuracy. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies deliver in general enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance compared to the centralized solution.

Abstract:
Convolutional Neural Networks (CNNs) have shown significant success in the low-light image enhancement task. However, most of existing works encounter challenges in balancing quality and efficiency simultaneously. This limitation hinders practical applicability in real-world scenarios and downstream vision tasks. To overcome these obstacles, we propose a Self-Calibrated Illumination (SCI) learning scheme, introducing a new perspective to boost the model’s capability. Based on a weight-sharing illumination estimation process, we construct an embedded self-calibrator to accelerate stage-level convergence, yielding gains that utilize only a single basic block for inference, which drastically diminishes computation cost. Additionally, by introducing the additivity condition on the basic block, we acquire a reinforced version dubbed SCI++, which disentangles the relationship between the self-calibrator and illumination estimator, providing a more interpretable and effective learning paradigm with faster convergence and better stability. We assess the proposed enhancers on standard benchmarks and in-the-wild datasets, confirming that they can restore clean images from diverse scenes with higher quality and efficiency. The verification on different levels of low-light vision tasks shows our applicability against other methods.

Abstract:
The differential equation-based image restoration approach aims to establish learnable trajectories connecting high-quality images to a tractable distribution, e.g., low-quality images or a Gaussian distribution. In this paper, we reformulate the trajectory optimization of this kind of method, focusing on enhancing both reconstruction quality and efficiency. Initially, we navigate effective restoration paths through a reinforcement learning process, gradually steering potential trajectories toward the most precise options. Additionally, to mitigate the considerable computational burden associated with iterative sampling, we propose cost-aware trajectory distillation to streamline complex paths into several manageable steps with adaptable sizes. Moreover, we fine-tune a foundational diffusion model (FLUX) with 12B parameters by using our algorithms, producing a unified framework for handling 7 kinds of image restoration tasks. Extensive experiments showcase the significant superiority of the proposed method, achieving a maximum PSNR improvement of 2.1 dB over state-of-the-art methods, while also greatly enhancing visual perceptual quality.

Abstract:
Unlike general visual classification (CLS) tasks, certain CLS problems are significantly more challenging as they involve recognizing professionally categorized or highly specialized images. Fine-Grained Visual Classification (FGVC) has emerged as a broad solution to address this complexity. However, most existing methods have been predominantly evaluated on a limited set of homogeneous benchmarks, such as bird species or vehicle brands. Moreover, these approaches often train separate models for each specific task, which restricts their generalizability. This paper proposes a scalable and explainable foundational model designed to tackle a wide range of FGVC tasks from a unified and generalizable perspective. We introduce a novel architecture named Pro-NeXt and reveal that Pro-NeXt exhibits substantial generalizability across diverse professional fields such as fashion, medicine, and art areas, previously considered disparate. Our basic-sized Pro-NeXt-B surpasses all preceding task-specific models across 12 distinct datasets within 5 diverse domains. Furthermore, we find its good scaling property that scaling up Pro-NeXt in depth and width with increasing GFlops can consistently enhance its accuracy. Beyond scalability and adaptability, the intermediate features of Pro-NeXt achieve reliable object detection and segmentation performance without extra training, highlighting its solid explainability. We will release the code to promote further research in this area.

Abstract:
As one of the most popular and sought-after generative models in recent years, diffusion models have sparked the interests of many researchers and steadily shown excellent advantage in various generative tasks such as image synthesis, video generation, bioinformatics engineering, 3D scene rendering and multimodal generation, relying on their dense theoretical principles and reliable application practices. The remarkable success of these recent efforts on diffusion models comes largely from progressive design principles and efficient architecture, training, inference, and deployment methodologies. However, there has not been a comprehensive and in-depth review to summarize these principles and practices to help the rapid understanding and application of diffusion models. In this survey, we provide a new efficiency-oriented perspective on these existing efforts, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way.

Abstract:
Recent learning-based multi-view stereo (MVS) still exhibits insufficient accuracy in large occlusion cases, such as environments with significant inter-camera distance or when capturing objects with complex shapes. This is because incorrect image features extracted from occluded areas serve as significant noise in the cost volume construction. To address this, we propose a visibility-aware MVS using surface normal weighting (SnowMVSNet) based on explicit 3D geometry. It selectively suppresses mismatched features in the cost volume construction by computing inter-view visibility. Additionally, we present a geometry-guided cost volume regularization that enhances true depth among depth hypotheses using a surface normal prior. We also propose intra-view visibility that distinguishes geometrically more visible pixels within a reference view. Using intra-view visibility, we introduce the visibility-weighted training and depth estimation methods. These methods enable the network to achieve accurate 3D point cloud reconstruction by focusing on visible regions. Based on simple inter-view and intra-view visibility computations, SnowMVSNet accomplishes substantial performance improvements relative to computational complexity, particularly in terms of occlusion robustness. To evaluate occlusion robustness, we constructed a multi-view human (MVHuman) dataset containing general human body shapes prone to self-occlusion. Extensive experiments demonstrated that SnowMVSNet significantly outperformed state-of-the-art methods in both low- and high-occlusion scenarios.

Abstract:
In this article, we present two fast and interpretable decomposition methods for 2D homography, which are named Similarity-Kernel-Similarity (SKS) and Affine-Core-Affine (ACA) transformations respectively. Under the minimal 4-point configuration, two similarity transformations in SKS are computed by two anchor points on source and target planes, respectively. Then, the other two point correspondences can be exploited to compute the middle kernel transformation with only four parameters. Furthermore, ACA uses three anchor points to compute the source and the target affine transformations, followed by computation of the middle core transformation utilizing the other one point correspondence. ACA can compute a homography up to a scale with only 85 floating-point operations (FLOPs), without even any division operations. Therefore, as a plug-in module, ACA facilitates various traditional feature-based Random Sample Consensus (RANSAC) pipelines, as well as deep homography pipelines estimating 4-point offsets. In addition to the advantages of geometric parameterization and computational efficiency, SKS and ACA can express each element of homography by a polynomial of input coordinates (7th degree to 9th degree), extend the existing essential Similarity-Affine-Projective (SAP) decomposition and calculate 2D affine transformations in a unified way.

Abstract:
Face Anti-Spoofing (FAS) is essential for securing face recognition systems against presentation attacks. Recent advances in sensor technology and multimodal learning have enabled the development of multimodal FAS systems. However, existing methods often struggle to generalize to unseen attacks and diverse environments due to two key challenges: (1) Modality unreliability, where sensors such as depth and infrared suffer from severe domain shifts, impairing the reliability of cross-modal fusion; and (2) Modality imbalance, where over-reliance on a dominant modality weakens the model’s robustness against attacks that affect other modalities. To overcome these issues, we propose MMDG++, a multimodal domain-generalized FAS framework built upon the vision-language model CLIP. In MMDG++, we design the Uncertainty-Guided Cross-Adapter++ (U-Adapter++) to filter out unreliable regions within each modality, enabling more reliable multimodal interactions. Additionally, we introduce Rebalanced Modality Gradient Modulation (ReGrad) for adaptive gradient modulation to balance modality convergence. To further enhance generalization, propose Asymmetric Domain Prompts (ADPs) that leverage CLIP’s language priors to learn generalized decision boundaries across modalities. We also develop a novel multimodal FAS benchmark to evaluate generalizability under various deployment conditions. Extensive experiments across this benchmark show our method outperforms state-of-the-art FAS methods, demonstrating superior generalization capability.

Abstract:
Recent studies on learning-based sound source localization have primarily focused on localization performance. However, prior work and existing benchmarks often overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. This interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or true sound sources among multiple objects. In this work, we comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. We identify the overlooked points of previous studies and make several contributions to address them. First, we propose a learning framework that incorporates retrieval-based and hand-crafted augmentation techniques, enhancing cross-modal interaction through cross-modal alignment. Second, we introduce new evaluation metrics to accurately and rigorously assess localization methods, focusing on both localization performance and cross-modal interaction. Third, to thoroughly analyze interactive sound source localization, we present a new semi-synthetic benchmark with diverse categorical combinations. Finally, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks, benchmarking competing methods alongside our own. Our new benchmark and evaluation metrics reveal that previous methods struggle with interactive sound source localization tasks, largely due to their limited cross-modal interaction capabilities. Our method, which features enhanced cross-modal alignment, demonstrates superior sound source localization and cross-modal interaction performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using both new and standard evaluation metrics.

Abstract:
Neural Radiance Fields (NeRF), initially developed for static scenes, have inspired many video novel view synthesis techniques. However, the challenge for video view synthesis arises from motion blur, a consequence of object or camera movements during exposure, which hinders the precise synthesis of sharp spatio-temporal views. In response, we propose a novel motion deblurring NeRF framework for blurry monocular video, called MoBluRF, consisting of a Base Ray Initialization (BRI) stage and a Motion Decomposition-based Deblurring (MDD) stage. In the BRI stage, we coarsely reconstruct dynamic 3D scenes and jointly initialize the base rays which are further used to predict latent sharp rays, using the inaccurate camera pose information from the given blurry frames. In the MDD stage, we introduce a novel Incremental Latent Sharp-rays Prediction (ILSP) approach for the blurry monocular video frames by decomposing the latent sharp rays into global camera motion and local object motion components. We further propose two loss functions for effective geometry regularization and decomposition of static and dynamic scene components without any mask supervision. Experiments show that MoBluRF outperforms qualitatively and quantitatively the recent state-of-the-art methods with large margins.

Abstract:
Lifelong person re-identification (LReID) aims to learn from streaming data sources step by step, which suffers from the catastrophic forgetting problem. In this paper, we investigate the exemplar-free LReID setting where no previous exemplar is available during the new step training. Existing exemplar-free LReID methods primarily adopt knowledge distillation to transfer knowledge from an old model to a new one without selection, inevitably introducing erroneous and detrimental information that hinders new knowledge learning. Furthermore, not all critical knowledge can be transferred due to the absence of old data, leading to the permanent loss of undistilled knowledge. To address these limitations, we propose a novel exemplar-free LReID method named Long Short-Term Knowledge Decomposition and Consolidation (LSTKC++). Specifically, an old knowledge rectification mechanism is developed to rectify the old model predictions based on new data annotations, ensuring correct knowledge transfer. Besides, a long-term knowledge consolidation strategy is designed, which first estimates the degree of old knowledge forgetting by leveraging the output difference between the old and new models. Then, a knowledge-guided parameter fusion strategy is developed to balance new and old knowledge, improving long-term knowledge retention. Upon these designs, considering LReID models tend to be biased on the latest seen domains, the fusion weights generated by this process often lead to sub-optimal knowledge balancing. To settle this, we further propose to decompose a single old model into two parts: a long-term old model containing multi-domain knowledge and a short-term model focusing on the latest short-term old knowledge. Then, the incoming new data are explored as an unbiased reference to adjust the old models’ fusion weight to achieve backward optimization. Furthermore, an extended complementary knowledge rectification mechanism is developed to mine and retain the correct knowledge in the decomposed models. Extensive experimental results demonstrate that LSTKC++ significantly outperforms state-of-the-art methods by large margins.

Abstract:
Realizing unified 3D object detection, including both indoor and outdoor scenes, holds great importance in applications like robot navigation. However, involving various scenarios of data to train models poses challenges due to their significantly distinct characteristics, e.g., diverse geometry properties and heterogeneous domain distributions. In this work, we propose to address the challenges from two perspectives, the algorithm perspective and data perspective. In terms of the algorithm perspective, we first build a monocular 3D object detector based on the bird’s-eye-view (BEV) detection paradigm, where the explicit feature projection is beneficial to addressing the geometry learning ambiguity. In this detector, we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by geometry difference between scenarios. Besides, we develop a sparse BEV feature projection strategy to reduce the computational cost and a unified domain alignment method to handle heterogeneous domains. From the data perspective, we propose to incorporate depth information to improve training robustness. Specifically, we build the first unified multi-modal 3D object detection benchmark MM-Omni3D and extend the aforementioned monocular detector to its multi-modal version, which is the first unified multi-modal 3D object detector. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively. The experimental results reveal several insightful findings highlighting the benefits of multi-modal data and confirm the effectiveness of all the proposed strategies.

Abstract:
Graph Neural Network (GNN) is a popular semi-supervised graph representation learning method, whose performance strongly relies on the quality and quantity of labeled nodes. Given the insufficiency of labeled nodes in many real applications, many multi-channel GNNs have been developed to extract self-supervised information by leveraging consistency and complementarity among augmented graphs from different channels. However, these methods often struggle to balance conflicting self-supervised constraints, enhancing certain types of information at the expense of others. To tackle this problem, we propose a Multi-channel Disentangled Graph Neural Network (MD-GraphNet), which effectively classifies self-supervised constraints by learning disentangled representations. Specifically, our model enforces consistency constraints for shared representations, graph reconstruction constraints for complementary (or private) representations, and aligning constraints for fused representations. Our model overcomes the confusion and loss problems of different types of self-supervised signals. Experimental results on benchmark datasets demonstrate the effectiveness of MD-GraphNet for semi-supervised node classification.

Abstract:
Existing fusion methods empirically design elaborate fusion losses to retain the specific features from source images. Since image fusion has no ground truth, the hand-crafted losses may not make the fused images cover all the vital features, and then affect the performance of the high-level tasks. Here, there are two main challenges: domain discrepancy among source images and semantic mismatch at different-level tasks. This paper proposes an infrared and visible image fusion via cross reconstruction learning, which doesn't using any hand-crafted fusion losses, but prompts the network to adaptively fuse complementary information of source images. Firstly, we design a cross reconstruction learning model that decouples the fusion features to reconstruct another-modality source image. Thus, the fusion network is forced to learn the domain-adaptive representations of two modal features, which enables their domain alignment in a latent space. Secondly, we propose a dynamic interactive fusion strategy that builds a correlation matrix between fusion features and object semantic features to overcome the semantic mismatch. Further, we enhance the strong correlation features and suppress the weak correlation features to improve the interactive ability. Extensive experiments on three datasets demonstrate the superior fusion performance compared to the state-of-the-art methods, concurrently facilitating the segmentation accuracy.

Abstract:
Large-scale graph data poses a training scalability challenge, which is generally treated by employing batch sampling methods to divide the graph into smaller subgraphs and train them in batches. However, such an approach introduces a topological bias in the local batches compared with the complete graph structure, missing either node features or edges. This topological bias is empirically shown to affect the generalization capabilities of graph neural networks (GNNs). To address this issue, we propose adaptive subgraph contrastive learning (AdaGCL) that bridges the gap between large-scale batch sampling and its generalization poorness. Specifically, AdaGCL augments graphs depending on the sampled batches and leverages a subgraph-granularity contrastive loss to learn the node embeddings invariant among the augmented imperfect graphs. To optimize the augmentation strategy for each downstream application, we introduce a node-centric information bottleneck (Node-IB) to control the trade-off regarding the similarity and diversity between the original and augmented graphs. This enhanced version of AdaGCL referred to as AdaGCL+, automates the graph augmentation process by dynamically adjusting graph perturbation parameters (e.g., edge dropping rate) to minimize the downstream loss. Extensive experimental results showcase the scalability of AdaGCL+ to graphs with millions of nodes using batch sampling methods. AdaGCL+ consistently outperforms existing methods on numerous benchmark datasets in terms of node classification accuracy and runtime efficiency.

Abstract:
3D Single Object Tracking (SOT) plays an important role in real-world visual applications such as autonomous driving and planning. How to realize effective 3D SOT is still a valuable challenge due to its carrier-sparse point clouds and its role-complex influencing factors. Inspired by the remote modeling of popular transformers, we further propose a Versatile Point Tracking Transformer (VPTT) method for 3D SOT, with object guidance from the template point cloud to the search area point cloud under the siamese-based tracking paradigm. Specifically, VPTT employs self- and cross- attention mechanisms and extends four matching operations, resulting in leveraging the contextual information of consecutive frames to improve the tracking results. By constructing a deep network VerFormer consisting of four successive transformer layers, which performs matching operations involving fusional transformation, separative discrimination, intersectional interaction, and unidirectional propagation from shallow to deep. Considering that the tracking task involves multiple processes, VPTT further learns how to forecast intermediate outputs including mask probability, trailing distance, and heading angle at each stage. Such a specialized design allows our VPTT to revisit the end-to-end training paradigm used for 3D tracking while developing a versatile transformer that is a perfect fit for the 3D SOT task. Experiments on three benchmarks, KITTI, nuScenes, and Waymo, show that VPTT achieves state-of-the-art tracking performance on siamese-based tracking running at ～∼62 FPS.

Abstract:
Computational approach to imaging around the corner, or non-line-of-sight (NLOS) imaging, is becoming a reality thanks to major advances in imaging hardware and reconstruction algorithms. A recent development towards practical NLOS imaging, (Nam et al. 2021) demonstrated a high-speed non-confocal imaging system that operates at 5Hz, 100x faster than the prior art. This enormous gain in acquisition rate, however, necessitates numerous approximations in light transport, breaking many existing NLOS reconstruction methods that assume an idealized image formation model. To bridge the gap, we present a novel deep model that incorporates the complementary physics priors of wave propagation and volume rendering into a neural network for high-quality and robust NLOS reconstruction. This orchestrated design regularizes the solution space by relaxing the image formation model, resulting in a deep model that generalizes well on real captures despite being exclusively trained on synthetic data. Further, we devise a unified learning framework that enables our model to be flexibly trained using diverse supervision signals, including target intensity images or even raw NLOS transient measurements. Once trained, our model renders both intensity and depth images at inference time in a single forward pass, capable of processing more than 5 captures per second on a high-end GPU. Through extensive qualitative and quantitative experiments, we show that our method outperforms prior physics and learning based approaches on both synthetic and real measurements. We anticipate that our method along with the fast capturing system will accelerate future development of NLOS imaging for real world applications that require high-speed imaging.

Abstract:
Open set recognition (OSR) requires models to classify known samples while detecting unknown samples for real-world applications. Existing studies show impressive progress using unknown samples from auxiliary datasets to regularize OSR models, but they have proved to be sensitive to selecting such known outliers. In this paper, we discuss the aforementioned problem from a new perspective: Can we regularize OSR models without elaborately selecting auxiliary known outliers? We first empirically and theoretically explore the role of foregrounds and backgrounds in open set recognition and disclose that: 1) backgrounds that correlate with foregrounds would mislead the model and cause failures when encounters ‘partially’ known images; 2) Backgrounds unrelated to foregrounds can serve as auxiliary known outliers and provide regularization via global average pooling. Based on the above insights, we propose a new method, Background Mix (BackMix), that mixes the foreground of an image with different backgrounds to remove the underlying fore-background priors. Specifically, BackMix first estimates the foreground with class activation maps (CAMs), then randomly replaces image patches with backgrounds from other images to obtain mixed images for training. With backgrounds de-correlated from foregrounds, the open set recognition performance is significantly improved. The proposed method is quite simple to implement, requires no extra operation for inferences, and can be seamlessly integrated into almost all of the existing frameworks.

Affiliations: Department of Computer Science, Hong Kong Baptist University, Hong Kong; School of Computing and Information Systems, University of Melbourne, Parkville, VIC, Australia; State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an, China; State Key Laboratory of High Performance Computing, School of Computer, National University of Defense Technology, Changsha, China; Sydney AI Centre, School of Computer Science, Faculty of Engineering, The University of Sydney, Camperdown, NSW, Australia

Abstract:
By abusing access to a well-trained classifier, model inversion (MI) attacks pose a significant threat as they can recover the original training data, leading to privacy leakage. Previous studies mitigated MI attacks by imposing regularization to reduce the dependency between input features and outputs during classifier training, a strategy known as unilateral dependency optimization. However, this strategy contradicts the objective of minimizing the supervised classification loss, which inherently seeks to maximize the dependency between input features and outputs. Consequently, there is a trade-off between improving the model’s robustness against MI attacks and maintaining its classification performance. To address this issue, we propose the bilateral dependency optimization strategy (BiDO), a dual-objective approach that minimizes the dependency between input features and latent representations, while simultaneously maximizing the dependency between latent representations and labels. BiDO is remarkable for its privacy-preserving capabilities. However, models trained with BiDO exhibit diminished capabilities in out-of-distribution (OOD) detection compared to models trained with standard classification supervision. Given the open-world nature of deep learning systems, this limitation could lead to significant security risks, as encountering OOD inputs—whose label spaces do not overlap with the in-distribution (ID) data used during training—is inevitable. To address this, we leverage readily available auxiliary OOD data to enhance the OOD detection performance of models trained with BiDO. This leads to the introduction of an upgraded framework, unknown-aware BiDO (BiDO+), which mitigates both privacy and security concerns. As a highlight, with comparable model utility, BiDO-HSIC+ reduces the FPR95 by 55.02% and enhances the AUCROC by 9.52% compared to BiDO-HSIC, while also providing superior MI robustness.

Abstract:
Residual networks have shown great success and become indispensable in recent deep neural network models. In this work, we aim to re-investigate the training process of residual networks from a novel perspective of loafing, and further propose a new training scheme as well as three improved strategies for boosting residual networks beyond their performance limits. Previous research has suggested that residual networks can be considered as ensembles of shallow networks, which implies that the final performance of a residual network is influenced by a group of subnetworks. Furthermore, we identify a previously overlooked problem, where subnetworks within a residual network are prone to exert less effort when working as part of a group compared to working alone. We define this problem as network loafing. Since network loafing may inevitably cause the sub-par performance of the residual network, we propose a novel training scheme called stimulative training, which randomly samples a residual subnetwork and calculates the KL divergence loss between the sampled subnetwork and the given residual network for extra supervision. In order to unleash the potential of stimulative training, we further propose three simple-yet-effective strategies, including a novel KL- loss that only aligns the network logits direction, random smaller inputs for subnetworks, and inter-stage sampling rules. Comprehensive experiments and analysis verify the effectiveness of stimulative training as well as its three improved strategies. For example, the proposed method can boost the performance of ResNet50 on ImageNet to 80.5% Top1 accuracy without using any extra data, model, trick, or changing the structure. With only uniform augment, the performance can be further improved to 81.0% Top1 accuracy, better than the best training recipes provided by Timm library and PyTorch official version. We also verify its superiority on various typical models, datasets, and tasks and give some theoretical analysis. As such, we advocate utilizing the proposed method as a general and next-generation technology to train residual networks.

Abstract:
This paper proposes an efficient HOT algorithm for solving the optimal transport (OT) problems with finite supports. We particularly focus on an efficient implementation of the HOT algorithm for the case where the supports are in \mathbb R^2R2 with ground distances calculated by L_2^2L22-norm. Specifically, we design a Halpern accelerating algorithm to solve the equivalent reduced model of the discrete OT problem. Moreover, we derive a novel procedure to solve the involved linear systems in the HOT algorithm in linear time complexity. Consequently, we can obtain an \varepsilonɛ-approximate solution to the optimal transport problem with MM supports in O(M^1.5/\varepsilon )O(M1.5/ɛ) flops, which significantly improves the best-known computational complexity. We further propose an efficient procedure to recover an optimal transport plan for the original OT problem based on a solution to the reduced model, thereby overcoming the limitations of the reduced OT model in applications that require the transport plan. We implement the HOT algorithm in PyTorch and extensive numerical results show the superior performance of the HOT algorithm compared to existing state-of-the-art algorithms for solving the OT problems.

Abstract:
Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets–CAS(ME)^22, CAS(ME)^33, and SAMM-LV–demonstrate that MC-WES is comparable to state-of-the-art fully supervised methods.

Abstract:
Event cameras are emerging imaging technology that offer advantages over conventional frame-based imaging sensors in dynamic range and sensing speed. Complementing the rich texture and color perception of traditional image frames, the hybrid camera system of event and frame-based cameras enables high-performance imaging. With the assistance of event cameras, high-quality image/video enhancement methods make it possible to break the limits of traditional frame-based cameras, especially exposure time, resolution, dynamic range, and frame rate limits. This paper focuses on five event-aided image and video enhancement tasks (i.e., event-based video reconstruction, event-aided high frame rate video reconstruction, image deblurring, image super-resolution, and high dynamic range image reconstruction), provides an analysis of the effects of different event properties, a real-captured and ground truth labeled benchmark dataset, a unified benchmarking of state-of-the-art methods, and an evaluation for two mainstream event simulators. In detail, this paper collects a real-captured evaluation dataset EventAid for five event-aided image/video enhancement tasks, by using “Event-RGB” multi-camera hybrid system, taking into account scene diversity and spatiotemporal synchronization. We further perform quantitative and visual comparisons for state-of-the-art algorithms, provide a controlled experiment to analyze the performance limit of event-aided image deblurring methods, and discuss open problems to inspire future research.

Abstract:
Class-incremental learning (CIL) aims to continually recognize new classes while preserving the discriminability of previously learned ones. Most existing CIL methods are exemplar-based, relying on the storage and replay of a subset of old data during training. Without access to such data, these methods typically suffer from catastrophic forgetting. In this paper, we identify two fundamental causes of forgetting in CIL: representation bias and classifier bias. To address these challenges, we propose a simple yet effective dual-bias reduction framework, which leverages self-supervised transformation (SST) in the input space and prototype augmentation (protoAug) in the feature space. On one hand, SST mitigates representation bias by encouraging the model to learn generic, diverse representations that generalize across tasks. On the other hand, protoAug tackles classifier bias by explicitly or implicitly augmenting the prototypes of old classes in the feature space, thereby imposing stronger constraints to preserve decision boundaries. We further enhance the framework with hardness-aware prototype augmentation and multi-view ensemble strategies, yielding significant performance gains. The proposed framework can be easily integrated with pre-trained models. Without storing any samples of old classes, our method performs comparably to state-of-the-art exemplar-based approaches that rely on extensive data storage. We hope to draw the attention of researchers back to non-exemplar CIL by rethinking the necessity of storing old samples.

Abstract:
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively.

Abstract:
Clustering is an essential analytical tool across a wide range of scientific fields, including biology, chemistry, astronomy, and pattern recognition. This paper introduces a novel clustering algorithm, called Torque Clustering, as a competitive alternative to existing methods, based on the intuitive principle that a cluster should merge with its nearest neighbor with a higher mass, unless both clusters have relatively large masses and the distance between them is also substantial. By identifying peaks in mass and distance, the algorithm effectively detects and removes incorrect mergers. The proposed method is entirely parameter-free, enabling it to autonomously recognize various cluster types, determine the optimal number of clusters, and identify noise. Extensive experiments on synthetic and real-world data sets demonstrate the algorithm's versatility and consistently strong performance compared to other state-of-the-art methods.

Abstract:
Open set recognition (OSR) effectively enhances the reliability of pattern recognition systems by accurately identifying samples of unknown classes. However, the decision-making process in most existing OSR methods adheres to an ill-considered pipeline, where classification probabilities are inferred directly from overall feature representations, neglecting the reasoning about inherent relations. Besides, the handling of identified unknown samples is typically restricted to the assignment of a generic “unknown” class label but fails to explore underlying category information. To tackle the above challenges, we propose a new paradigm for OSR, entitled Reason and Discovery (RAD), which comprises two main modules: the Reason Module and the Discovery Module. Specifically, in the Reason Module, the distinction between known and unknown is performed from the perspective of reasoning the matching relations between topological information and appearance characteristics of discriminative regions. Then, the mixture and recombination of relation representations across classes are employed to provide diverse estimations of unknown distribution, thereby recalibrating OSR decision boundaries. Moreover, in the Discovery Module, the identified unknown samples are semantically grouped through a biased deep clustering process for discovering novel category information. Experimental results on various datasets indicate that the proposed method can achieve outstanding OSR performance and good novel category discovery efficacy.

Abstract:
Survival prediction on histopathology whole slide images (WSIs) involves the analysis of multi-level complex correlations, such as inter-correlations among patients and intra-correlations within gigapixel histopathology images. However, the current graph-based methods for WSI analysis mainly focus on the exploration of pairwise correlations, resulting in the loss of high-order correlations. Hypergraph-based methods can handle such high-order correlations, while existing hypergraph-based methods fail to integrate multi-level high-order correlations into a unified framework, which limits the representation capability of WSIs. In this work, we propose an inter-intra hypergraph computation (I^22HGC) framework to address this issue. The I^22HGC framework implements multi-level hypergraph computation for survival prediction on WSIs, namely intra-hypergraph computation and inter-hypergraph computation. Specifically, the intra-hypergraph computation considers each patch sampled from the histopathology WSI as a vertex of the intra-hypergraph and models the high-order correlations among all patches of an individual WSI in both topology and semantic feature spaces using a hypergraph structure. Then, the intra-hypergraph module generates the intra-embedding and intra-risk for each patient. Subsequently, the inter-hypergraph computation employs these intra-embeddings as features for each patient to form the population-level high-order correlations using data- and knowledge-driven hypergraph modeling strategies. Finally, the intra-risks and the inter-risks are fused for the final survival prediction of each patient. Extensive experimental results on four widely used TCGA carcinoma datasets are presented. We demonstrate that the hypergraph structure captures significantly richer correlations than the graph structure, encompassing all pairwise correlations as well as higher-order interactions through hyperedges. For WSIs with a vast number of pixels and complex correlations, hypergraph-based methods effectively capture topological and semantic information while mitigating the exponential growth of pairwise edges, offering practical advantages for large-scale medical image analysis.

Abstract:
Generalized category discovery (GCD) is a pragmatic but underexplored problem, which requires models to automatically cluster and discover novel categories by leveraging the labeled samples from old classes. The challenge is that unlabeled data contain both old and new classes. Early works leveraging pseudo-labeling with parametric classifiers handle old and new classes separately, which brings about imbalanced accuracy between them. Recent methods employing contrastive learning neglect potential positives and are decoupled from the clustering objective, leading to biased representations and sub-optimal results. To address these issues, we introduce a unified and unbiased prototype learning framework, namely ProtoGCD, wherein old and new classes are modeled with joint prototypes and unified learning objectives, enabling unified modeling between old and new classes. Specifically, we propose a dual-level adaptive pseudo-labeling mechanism to mitigate confirmation bias, together with two regularization terms to collectively help learn more suitable representations for GCD. Moreover, for practical considerations, we devise a criterion to estimate the number of new classes. Furthermore, we extend ProtoGCD to detect unseen outliers, achieving task-level unification. Comprehensive experiments show that ProtoGCD achieves state-of-the-art performance on both generic and fine-grained datasets.

Abstract:
Tracking the object 6-DoF pose is crucial for various downstream robot tasks and real-world applications. In this paper, we investigate the real-world robot task of aerial vision guidance for aerial robotics manipulation, utilizing category-level 6-DoF pose tracking. Aerial conditions inevitably introduce special challenges, such as rapid viewpoint changes in pitch and roll and inter-frame differences. To support these challenges in task, we first introduce a robust category-level 6-DoF pose tracker (Robust6DoF). This tracker leverages shape and temporal prior knowledge to explore optimal inter-frame keypoint pairs, generated under a priori structural adaptive supervision in a coarse-to-fine manner. Notably, our Robust6DoF employs a Spatial-Temporal Augmentation module to deal with the problems of the inter-frame differences and intra-class shape variations through both temporal dynamic filtering and shape-similarity filtering. We further present a Pose-Aware Discrete Servo strategy (PAD-Servo), serving as a decoupling approach to implement the final aerial vision guidance task. It contains two servo action policies to better accommodate the structural properties of aerial robotics manipulation. Exhaustive experiments on four well-known public benchmarks demonstrate the superiority of our Robust6DoF. Real-world tests directly verify that our Robust6DoF along with PAD-Servo can be readily used in real-world aerial robotic applications. The project homepage is released at Robust6DoF.

Abstract:
Image completion is a challenging task, particularly when ensuring that generated content seamlessly integrates with existing parts of an image. While recent diffusion models have shown promise, they often struggle with maintaining coherence between known and unknown (missing) regions. This issue arises from the lack of explicit spatial and semantic alignment during the diffusion process, resulting in content that does not smoothly integrate with the original image. Additionally, diffusion models typically rely on global learned distributions rather than localized features, leading to inconsistencies between the generated and existing image parts. In this work, we propose ConFill, a novel framework that introduces a Context-Adaptive Discrepancy (CAD) model to ensure that intermediate distributions of known and unknown regions are closely aligned throughout the diffusion process. By incorporating CAD, our model progressively reduces discrepancies between generated and original images at each diffusion step, leading to contextually aligned completion. Moreover, ConFill uses a new Dynamic Sampling mechanism that adaptively increases the sampling rate in regions with high reconstruction complexity. This approach enables precise adjustments, enhancing detail and integration in restored areas. Extensive experiments demonstrate that ConFill outperforms current methods, setting a new benchmark in image completion.

Abstract:
Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects.

Abstract:
With the prevalence of short texts in various forms such as news headlines, tweets, and reviews, short text analysis has gained significant interest in recent times. However, modeling short texts remains a challenging task due to its sparse and noisy nature. In this paper, we propose a new Spherical Correlated Topic Model (SCTM), which takes into account the correlation between topics. Our model integrates word and knowledge graph embeddings to better capture the semantic relationships among short texts. We adopt the von Mises-Fisher distribution to model the high-dimensional word and entity embeddings on a hypersphere, enabling better preservation of the angular relationships between topic vectors. Moreover, knowledge graph embeddings are incorporated to further enrich the semantic meaning of short texts. Experimental results on several datasets demonstrate that our proposed SCTM model outperforms existing models in terms of both topic coherence and document classification. In addition, our model is capable of providing interpretable topics and revealing meaningful correlations among short texts.

Abstract:
With growing concerns about information security, protecting the privacy of user-sensitive data has become crucial. The rapid development of multi-modal retrieval technologies poses new threats, making sensitive data more vulnerable to leakage and malicious mining. To address this, we introduce a Proactive Adversarial Multi-modal Learning (PAML) approach that transforms sensitive data into adversarial counterparts, evading malicious multi-modal retrieval and ensuring privacy. Our method starts by sending queries to a knowledge-agnostic retrieval system and analyzing the results to understand the retrieval feedback mechanism. Using a U-Net-based diffusion model, we create a semantic perturbation network that subtly alters the implicit semantics of sensitive data. This, combined with multi-modal retrieved results and random noise, shifts the data's semantics towards outliers, preventing retrieval as neighbors to relevant queries. Additionally, a discriminator and pre-trained model enhance the visual realism and outlier generalization of protected data. Extensive experiments show that PAML outperforms potential baselines in data privacy protection. Ablation analysis validates each component's effectiveness, and our approach's variants are applicable to diverse retrieval systems.

Abstract:
Currently, deep neural networks (DNNs) are widely adopted in different applications. Despite its commercial values, training a well-performing DNN is resource-consuming. Accordingly, the well-trained model is valuable intellectual property for its owner. However, recent studies revealed the threats of model stealing, where the adversaries can obtain a function-similar copy of the victim model, even when they can only query the model. In this paper, we propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously, without introducing new security risks. In general, we conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. Specifically, we embed the external features by modifying a few training samples with style transfer. We then train a meta-classifier to determine whether a model is stolen from the victim. This approach is inspired by the understanding that the stolen models should contain the knowledge of features learned by the victim model. In particular, we develop our MOVE method under both glass-boxand closed-box settings and analyze its theoretical foundation to provide comprehensive model protection. Extensive experiments on benchmark datasets verify the effectiveness of our method and its resistance to potential adaptive attacks.

Abstract:
Understanding human emotions is crucial for a myriad of applications, from psychological research to advancements in Natural Language Processing (NLP). Traditionally, emotions are categorized into distinct basic groups, which has led to the development of various emotion detection tasks within NLP. However, these tasks typically rely on one-hot vectors to represent emotions, a method that fails to capture the relations between different emotion categories. In this study, we challenge the assumption that emotion categories are mutually exclusive and argue that the connections and boundaries between them are complex and often blurred. To better represent these nuanced interconnections, we introduce an innovative framework as well as two algorithms to learn distributed representations of emotion categories by leveraging soft labels from trained neural network models. For the first time, our approach enables the detection of emotion relations across different languages through an NLP lens, a feat unattainable with traditional one-hot representations. Validation experiments confirm the superior ability of our distributed representation algorithms to articulate these emotional connections. Moreover, application experiments corroborate several interdisciplinary insights into cross-linguistic emotion relations, findings that align with research in psychology and linguistics. This work not only presents a breakthrough in emotion detection but also bridges the gap between computational models and humanistic understanding of emotions.

Abstract:
To address the communication burden issues associated with Federated Learning (FL), Decentralized Federated Learning (DFL) discards the central server and establishes a decentralized communication network, where each client communicates only with neighboring clients. However, existing DFL methods still suffer from two major challenges: local inconsistency and local heterogeneous overfitting, which existing DFL methods have not fundamentally addressed. To tackle these issues, we propose novel DFL algorithms, DFedADMM and its enhanced version DFedADMM-SAM, to improve the performance for DFL. The DFedADMM algorithm employs primal-dual optimization (ADMM) by utilizing dual variables to control the model inconsistency raised from the decentralized heterogeneous data distributions. The DFedADMM-SAM algorithm further improves on DFedADMM by employing a Sharpness-Aware Minimization (SAM) optimizer, which uses gradient perturbations to generate locally flat models and searches for models with uniformly low loss values to mitigate local heterogeneous overfitting. Theoretically, we derive convergence rates of \mathcal O(\frac1\sqrtKT+\frac1KT(1-\psi )^2)O(1KT+1KT(1-ψ)2) and \mathcal O(\frac1\sqrtKT+\frac1KT(1-\psi )^2+ \frac1T^3/2K^1/2)O(1KT+1KT(1-ψ)2+1T3/2K1/2) in the non-convex setting for DFedADMM and DFedADMM-SAM, respectively, where 1 - \psi1-ψ represents the spectral gap of the gossip matrix. Empirically, extensive experiments on MNIST, CIFAR10, and CIFAR100 datasets demonstrate that our algorithms exhibit superior performance in terms of generalization, convergence speed, and communication overhead compared to existing state-of-the-art (SOTA) optimizers in DFL.

Affiliations: School of Information and Communication Technology, Griffith University, Southport, QLD, Australia; Department of Chemical and Biological Engineering, Faculty of Engineering, Monash University, Melbourne, VIC, Australia; Department of Data Science and AI, Monash University, Melbourne, VIC, Australia; International Max Plank Research School for Intelligent Systems, University of Stuttgart, Stuttgart, Germany; Alibaba Group, Hangzhou, China; School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, Australia; Department of Computer Science, Emory University, Atlanta, GA, USA; Anytime.AI, New York, NY, USA; Squirrel AI Learning, Bellevue, WA, USA

Abstract:
Time series forecasting has remained a focal point due to its vital applications in sectors such as energy management and transportation planning. Spectral-temporal graph neural network is a promising abstraction underlying most time series forecasting models that are based on graph neural networks (GNNs). However, more is needed to know about the underpinnings of this branch of methods. In this paper, we establish a theoretical framework that unravels the expressive power of spectral-temporal GNNs. Our results show that linear spectral-temporal GNNs are universal under mild assumptions, and their expressive power is bounded by our extended first-order Weisfeiler–Leman algorithm on discrete-time dynamic graphs. To make our findings useful in practice on valid instantiations, we discuss related constraints in detail and outline a theoretical blueprint for designing spatial and temporal modules in spectral domains. Building on these insights and to demonstrate how powerful spectral-temporal GNNs are based on our framework, we propose a simple instantiation named Temporal Graph Gegenbauer Convolution (TGGC), which significantly outperforms most existing models with only linear components and shows better model efficiency.

Abstract:
In this paper, we propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, SVBRDF, and 3D spatially-varying lighting. While multi-view images have been widely used for object-level inverse rendering, scene-level inverse rendering has primarily been studied using single-view images due to the lack of a dataset containing high dynamic range multi-view images with ground-truth geometry, material, and spatially-varying lighting. To improve the quality of scene-level inverse rendering, a novel framework called Multi-view Attention Inverse Rendering (MAIR) was recently introduced. MAIR performs scene-level multi-view inverse rendering by expanding the OpenRooms dataset, designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Although MAIR showed impressive results, its lighting representation is fixed to spherical Gaussians, which limits its ability to render images realistically. Consequently, MAIR cannot be directly used in applications such as material editing. Moreover, its multi-view aggregation networks have difficulties extracting rich features because they only focus on the mean and variance between multi-view features. In this paper, we propose its extended version, called MAIR++. MAIR++ addresses the aforementioned limitations by introducing an implicit lighting representation that accurately captures the lighting conditions of an image while facilitating realistic rendering. Furthermore, we design a directional attention-based multi-view aggregation network to infer more intricate relationships between views. Experimental results show that MAIR++ not only outperforms MAIR and single-view-based methods but also demonstrates robust performance on unseen real-world scenes.

Abstract:
In recent years, sparse voxel-based methods have become the state-of-the-arts for 3D semantic segmentation of indoor scenes, thanks to the powerful 3D CNNs. Nevertheless, being oblivious to the underlying geometry, voxel-based methods suffer from ambiguous features on spatially close objects and struggle with handling complex and irregular geometries due to the lack of geodesic information. In view of this, we present Voxel-Mesh Network (VMNet), a novel 3D deep architecture that operates on the voxel and mesh representations leveraging both the euclidean and geodesic information. Intuitively, the euclidean information extracted from voxels can offer contextual cues representing interactions between nearby objects, while the geodesic information extracted from meshes can help separate objects that are spatially close but have disconnected surfaces. To incorporate such information from the two domains, we design an intra-domain attentive module for effective feature aggregation and an inter-domain attentive module for adaptive feature fusion. Experimental results validate the effectiveness of VMNet: specifically, on the challenging ScanNet dataset for large-scale segmentation of indoor scenes, it outperforms the state-of-the-art SparseConvNet and MinkowskiNet (74.6% versus 72.5% and 73.6% in mIoU) with a simpler network structure (17M versus 30M and 38M parameters).

Abstract:
We propose a novel method applicable in many scene understanding problems that adapts the Monte Carlo Tree Search (MCTS) algorithm, originally designed to learn to play games of high-state complexity. From a generated pool of proposals, our method jointly selects and optimizes proposals that minimize the objective term. In our first application for floor plan reconstruction from point clouds, our method selects and refines the room proposals, modelled as 2D polygons, by optimizing on an objective function combining the fitness as predicted by a deep network and regularizing terms on the room shapes. We also introduce a novel differentiable method for rendering the polygonal shapes of these proposals. Our evaluations on the recent and challenging Structured3D and Floor-SP datasets show significant improvements over the state-of-the-art both in speed and quality of reconstructions, without imposing hard constraints nor assumptions on the floor plan configurations. In our second application, we extend our approach to reconstruct general 3D room layouts from a color image and obtain accurate room layouts. We also show that our differentiable renderer can easily be extended for rendering 3D planar polygons and polygon embeddings. Our method shows high performance on the Matterport3D-Layout dataset, without introducing hard constraints on room layout configurations.

Abstract:
In numerous reinforcement learning (RL) problems involving safety-critical systems, a key challenge lies in balancing multiple objectives while simultaneously meeting all stringent safety constraints. To tackle this issue, we propose a primal-based framework that orchestrates policy optimization between multi-objective learning and constraint adherence. Our method employs a novel natural policy gradient manipulation method to optimize multiple RL objectives and overcome conflicting gradients between different objectives, since the simple weighted average gradient direction may not be beneficial for specific objectives due to misaligned gradients of different objectives. When there is a violation of a hard constraint, our algorithm steps in to rectify the policy to minimize this violation. Particularly, We establish theoretical convergence and constraint violation guarantees, and our proposed method also outperforms prior state-of-the-art methods on challenging safe multi-objective RL tasks.

Abstract:
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image. The default approach of previous GAN-based methods on this problem is to train multiple models at progressive growing scales, which leads to the accumulation of errors and causes characteristic artifacts in generated results. In this paper, we uncover that multiple models at progressive growing scales are not essential for learning from a single image and propose SinDiffusion, a single diffusion-based model trained on a single scale, which is better-suited for this task. Furthermore, we identify that a patch-level receptive field is crucial and effective for diffusion models to capture the image’s patch statistics, therefore we redesign an patch-wise denoising network for SinDiffusion. Coupling these two designs enables SinDiffusion to generate more photorealistic and diverse images from a single image compared with GAN-based approaches. SinDiffusion can also be applied to various applications, i.e., text-guided image generation, and image outpainting beyond the capability of SinGAN. Extensive experiments on a wide range of images demonstrate the superiority of SinDiffusion for modeling the patch distribution.

Abstract:
Estimating the 3DoF rotation from a single RGB image is an important yet challenging problem. As a popular approach, probabilistic rotation modeling additionally carries prediction uncertainty information, compared to single-prediction rotation regression. For modeling probabilistic distribution over \textSO(3)SO(3), it is natural to use Gaussian-like Bingham distribution and matrix Fisher, however they are shown to be sensitive to outlier predictions, e.g., 180^\circ180∘ error and thus are unlikely to converge with optimal performance. In this paper, we draw inspiration from multivariate Laplace distribution and propose a novel rotation Laplace distribution on \textSO(3)SO(3). Our rotation Laplace distribution is robust to the disturbance of outliers and enforces much gradient to the low-error region that it can improve. In addition, we show that our method also exhibits robustness to small noises and thus tolerates imperfect annotations. With this benefit, we demonstrate its advantages in semi-supervised rotation regression, where the pseudo labels are noisy. To further capture the multi-modal rotation solution space for symmetric objects, we extend our distribution to rotation Laplace mixture model and demonstrate its effectiveness. Our extensive experiments show that our proposed distribution and the mixture model achieve State-of-the-Art performance in all the rotation regression experiments over both probabilistic and non-probabilistic baselines.

Abstract:
Existing scene text recognition methods leverage large-scale labeled synthetic data (LSD) to reduce reliance on labor-intensive annotation tasks and improve recognition capability in real-world scenarios. However, the emergence of a synth-to-real domain gap still limits their efficiency and robustness. Consequently, harvesting the meaningful intrinsic qualities of unlabeled real data (URD) is of great importance, given the prevalence of text-laden images. Toward the target, recent efforts have focused on pre-training on URD through sequence-to-sequence self-supervised learning, followed by fine-tuning on LSD via supervised learning. Nevertheless, they encounter three important issues: coarse representation learning units, inflexible data augmentation, and an emerging real-to-synth domain drift. To overcome these challenges, we propose CCDPlus, an accurate character-to-character distillation method for scene text recognition with a joint supervised and self-supervised learning framework. Specifically, tailored for text images, CCDPlus delineates the fine-grained character structures on URD as representation units by transferring knowledge learned from LSD online. Without requiring extra bounding box or pixel-level annotations, this process allows CCDPlus to enable character-to-character distillation flexibly with versatile data augmentation, which effectively extracts general real-world character-level feature representations. Meanwhile, the unified framework combines self-supervised learning on URD with supervised learning on LSD, effectively solving the domain inconsistency and enhancing the recognition performance. Extensive experiments demonstrate that CCDPlus outperforms previous state-of-the-art (SOTA) supervised, semi-supervised, and self-supervised methods by an average of 1.8%, 0.6%, and 1.1% on standard datasets, respectively. Additionally, it achieves a 6.1% improvement on the more challenging Union14M-L dataset.

Abstract:
Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4 V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-\alphaα, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

Abstract:
Low-level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high-level deep features. In this paper, we aim to re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co-occurrence TIEM (C-TIEM) and generic segmentation frameworks, namely STLNet++ and U-SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state-of-the-art performance on seven popular benchmark datasets, respectively.

Abstract:
Recent advancements in learning techniques that employ coordinate-based neural representations have yielded remarkable results in multi-view 3D reconstruction tasks. However, these approaches often require a substantial number of input views (typically several tens) and computationally intensive optimization procedures to achieve their effectiveness. In this paper, we address these limitations specifically for the problem of few-shot full 3D head reconstruction. We accomplish this by incorporating a probabilistic shape and appearance prior into coordinate-based representations, enabling faster convergence and improved generalization when working with only a few input images (even as low as a single image). During testing, we leverage this prior to guiding the fitting process of a signed distance function using a differentiable renderer. By incorporating the statistical prior alongside parallelizable ray tracing and dynamic caching strategies, we achieve an efficient and accurate approach to few-shot full 3D head reconstruction. Moreover, we extend the H3DS dataset, which now comprises 60 high-resolution 3D full-head scans and their corresponding posed images and masks, which we use for evaluation purposes. By leveraging this dataset, we demonstrate the remarkable capabilities of our approach in achieving state-of-the-art results in geometry reconstruction while being an order of magnitude faster than previous approaches.

Abstract:
Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in an intensity frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. Our tracker is designed to operate in two distinct configurations: solely with events or in a hybrid mode incorporating both events and frames. The hybrid model offers two setups: an aligned configuration where the event and frame cameras share the same viewpoint, and a hybrid stereo configuration where the event camera and the standard camera are positioned side-by-side. This side-by-side arrangement is particularly valuable as it provides depth information for each feature track, enhancing its utility in applications such as visual odometry and simultaneous localization and mapping.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality retrieval task due to the large modality gap. While numerous efforts have been devoted to the supervised setting with a large amount of labeled cross-modality correspondences, few studies have tried to mitigate the modality gap by mining cross-modality correspondences in an unsupervised manner. However, existing works failed to capture the intrinsic relations among samples across two modalities, resulting in limited performance outcomes. In this paper, we propose a novel Progressive Graph Matching (PGM) approach to globally model the cross-modality relationships and instance-level affinities. PGM formulates cross-modality correspondence mining as a graph matching procedure, aiming to integrate global information by minimizing global matching costs. Considering that samples in wrong clusters cannot find reliable cross-modality correspondences by PGM, we further introduce a robust Dual-Level Matching (DLM) mechanism, combining the cluster-level PGM and Nearest Instance-Cluster Searching (NICS) with instance-level affinity optimization. Additionally, we design an Outlier Filter Strategy (OFS) to filter out unreliable cross-modality correspondences based on the dual-level relation constraints. To mitigate false accumulation in cross-modal correspondence learning, an Alternate Cross Contrastive Learning (ACCL) module is proposed to alternately adjust the dominated matching, i.e., visible-to-infrared or infrared-to-visible matching. Empirical results demonstrate the superiority of our unsupervised solution, achieving comparable performance with supervised counterparts.

Abstract:
The goal of continual learning (CL) is to learn from a series of continuously arriving new tasks without forgetting previously learned old tasks. To avoid catastrophic forgetting of old tasks, orthogonal gradient projection (OGP) based CL methods constrain the gradients of new tasks to be orthogonal to the space spanned by old tasks. This strict gradient constraint will limit the learning ability of new tasks, resulting in lower performance on new tasks. In this paper, we first establish a unified framework for OGP-based CL methods. We then revisit OGP-based CL methods from a new perspective on the loss landscape, where we find that when relaxing projection constraints to improve performance on new tasks, the unflatness of the loss landscape can lead to catastrophic forgetting of old tasks. Based on our findings, we propose a new Dual Flatness-aware OGD framework that optimizes the flatness of the loss landscape from both data and weight levels. Our framework consists of three modules: data and weight perturbation, flatness-aware optimization, and gradient projection. Specifically, we first perform perturbations on the task's data and current model weights to make the task's loss reach the worst-case. Next, we optimize the loss and loss landscape on the original data and the worst-case perturbed data to obtain a flatness-aware gradient. Finally, the flatness-aware gradient will update the network in directions orthogonal to the space spanned by the old tasks. Extensive experiments on four benchmark datasets show that the framework improves the flatness of the loss landscape and performance on new tasks, and achieves state-of-the-art (SOTA) performance on average accuracy across all tasks.

Abstract:
This paper studies kernel PCA in a decentralized setting, where data are distributively observed with full features in local nodes, and a fusion center is prohibited. Compared with linear PCA, the use of kernel brings challenges to the design of decentralized consensus optimization: the local projection directions are data-dependent. As a result, the consensus constraint in distributed linear PCA is no longer valid. To overcome this problem, we propose a projection consensus constraint and obtain an effective decentralized consensus framework, where local solutions are expected to be the projection of the global solution on the column space of the local dataset. We also derive a fully non-parametric, fast, and convergent algorithm based on the alternative direction method of multiplier, of which each iteration is analytic and communication-efficient. Experiments on a truly parallel architecture are conducted on real-world data, showing that the proposed decentralized algorithm is effective in utilizing information from other nodes and takes great advantages in running time over the central kernel PCA.

Abstract:
Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model, while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper, we propose WeCLIP and its advanced version WeCLIP+, to build the single-stage pipeline for weakly supervised semantic segmentation. For WeCLIP, the frozen CLIP model is applied as the backbone for semantic feature extraction, and a new light decoder is designed to interpret extracted semantic features for final prediction. Meanwhile, we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels are fixed during training. We then propose a refinement module (RFM) to optimize them dynamically. For WeCLIP+, we introduce the frozen DINO model to achieve more comprehensive semantic feature extraction. The frozen DINO is combined with the frozen CLIP as the backbone, followed by a shared decoder to make predictions with less training cost. Moreover, a strengthened refinement module (RFM+) is designed to revise online pseudo labels with extra guidance from DINO features. Extensive experiments show that both WeCLIP and WeCLIP+ significantly outperform other approaches with less training cost. Particularly, WeCLIP+ gets mIoU of 83.9% on VOC 2012 test set and 56.3% on COCO val set. Additionally, these two approaches also obtain promising results for fully supervised settings.

Abstract:
The rising importance of 3D representation learning, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach.

Abstract:
Consistency and complementarity are two key ingredients for boosting multi-view clustering (MVC). Recently with the introduction of popular contrastive learning, the consistency learning of views has been further enhanced in MVC, leading to promising performance. However, by contrast, the complementarity has not received sufficient attention except just in the feature facet, where the Hilbert Schmidt Independence Criterion term or the independent encoder-decoder network is usually adopted to capture view-specific information. This motivates us to reconsider the complementarity learning of views comprehensively from multiple facets including the feature-, view-label- and contrast- facets, while maintaining the view consistency. We empirically find that all the facets contribute to the complementarity learning, especially the view-label facet, which is usually neglected by existing methods. Based on this, a simple yet effective Multifacet Complementarity learning framework for Multi-View Clustering (MCMVC) is naturally developed, which fuses multifacet complementarity information, especially explicitly embedding the view-label information. To our best knowledge, it is the first time to use view-labels explicitly to guide the complementarity learning of views. Compared with the SOTA baselines, MCMVC achieves remarkable improvements, e.g., by average margins over 5.00% and 7.00% respectively in complete and incomplete MVC settings on Caltech101-20 in terms of three evaluation metrics.

Abstract:
Recent methods for synthesizing 3D-aware face images have achieved rapid development thanks to neural radiance fields, allowing for high quality and fast inference speed. However, existing solutions for editing facial geometry and appearance independently usually require retraining and are not optimized for the recent work of generation, thus tending to lag behind the generation process. To address these issues, we introduce NeRFFaceEditing, which enables editing and decoupling geometry and appearance in the pretrained tri-plane-based neural radiance field while retaining its high quality and fast inference speed. Our key idea for disentanglement is to use the statistics of the tri-plane to represent the high-level appearance of its corresponding facial volume. Moreover, we leverage a generated 3D-continuous semantic mask as an intermediary for geometry editing. We devise a geometry decoder (whose output is unchanged when the appearance changes) and an appearance decoder. The geometry decoder aligns the original facial volume with the semantic mask volume. We also enhance the disentanglement by explicitly regularizing rendered images with the same appearance but different geometry to be similar in terms of color distribution for each facial component separately. Our method allows users to edit via semantic masks with decoupled control of geometry and appearance. Both qualitative and quantitative evaluations show the superior geometry and appearance control abilities of our method compared to existing and alternative solutions.

Abstract:
The deep unfolding network represents a promising research avenue in image restoration. However, most current deep unfolding methodologies are anchored in first-order optimization algorithms, which suffer from sluggish convergence speed and unsatisfactory learning efficiency. In this paper, to address this issue, we first formulate an improved second-order semi-smooth Newton (ISN) algorithm, transforming the original nonlinear equations into an optimization problem amenable to network implementation. After that, we propose an innovative network architecture based on the ISN algorithm for blind image restoration, namely DeepSN-Net. To the best of our knowledge, DeepSN-Net is the first successful endeavor to design a second-order deep unfolding network for image restoration, which fills the blank of this area. Furthermore, it offers several distinct advantages: 1) DeepSN-Net provides a unified framework to a variety of image restoration tasks in both synthetic and real-world contexts, without imposing constraints on the degradation conditions. 2) The network architecture is meticulously aligned with the ISN algorithm, ensuring that each module possesses robust physical interpretability. 3) The network exhibits high learning efficiency, superior restoration accuracy and good generalization ability across 11 datasets on three typical restoration tasks. The success of DeepSN-Net on image restoration may ignite many subsequent works centered around the second-order optimization algorithms, which is good for the community.

Abstract:
Tensorial Multi-view Clustering (TMC), a prominent approach in multi-view clustering, leverages low-rank tensor learning to capture high-order correlation among views for consistent clustering structure identification. Despite its promising performance, the TMC algorithms face three key challenges: 1). The severe computational burden makes it difficult for TMC methods to handle large-scale datasets. 2). Estimation bias problem caused by the convex surrogate of the tensor rank. 3). Lack of explicit balance of consistency and complementarity. Being aware of these, we propose a basic framework Efficient and Scalable Tensorial Multi-View Subspace Clustering (ESTMC) for large-scale multi-view clustering. ESTMC integrates anchor representation learning and non-convex function-based low-rank tensor learning with a Generalized Non-convex Tensor Rank (GNTR) into a unified objective function, which enhances the efficiency of the existing subspace-based TMC framework. Furthermore, a novel model ESTMC-C^22 with the proposed Enhanced Tensor Rank (ETR), Consistent Geometric Regularization (CGR), and Tensorial Exclusive Regularization (TER) is extended to balance the learning of consistency and complementarity among views, delivering divisible representations for the clustering task. Efficient iterative optimization algorithms are designed to solve the proposed ESTMC and ESTMC-C^22, which enjoy time-economical complexity and exhibit theoretical convergence. Extensive experimental results on various datasets demonstrate the superiority of the proposed algorithms as compared to state-of-the-art methods.

Abstract:
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities due to its minimal data needs and high time efficiency. However, many current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance. This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially. The first step, Activation quantization error reduction (Aqer), first applies Reparameterization Initialization aimed at mitigating initial quantization errors in high-variance activations. Then, it further mitigates the errors by formulating a Ridge Regression problem, which updates the weights maintained at full-precision using a closed-form solution. The second step, Weight quantization error reduction (Wqer), first applies Dual Uniform Quantization to handle weights with numerous outliers, which arise from adjustments made during Reparameterization Initialization, thereby reducing initial weight quantization errors. Then, it employs an iterative approach to further tackle the errors. In each iteration, it adopts Rounding Refinement that uses an empirically derived, efficient proxy to refine the rounding directions of quantized weights, complemented by a Ridge Regression solver to reduce the errors. Comprehensive experimental results demonstrate ERQ’s superior performance across various ViTs variants and tasks. For example, ERQ surpasses the state-of-the-art GPTQ by a notable 36.81% in accuracy for W3A4 ViT-S.

Abstract:
Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.

Abstract:
Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.

Abstract:
Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in Machine Learning (ML), as it necessitates the Incremental Learning (IL) of new classes from sparsely labeled training samples without forgetting previous knowledge. While this field has seen recent progress, it remains an active exploration area. This paper aims to provide a comprehensive and systematic review of FSCIL. In our in-depth examination, we delve into various facets of FSCIL, encompassing the problem definition, the discussion of the primary challenges of unreliable empirical risk minimization and the stability-plasticity dilemma, general schemes, and relevant problems of IL and Few-shot Learning (FSL). Besides, we offer an overview of benchmark datasets and evaluation metrics. Furthermore, we introduce the Few-shot Class-incremental Classification (FSCIC) methods from data-based, structure-based, and optimization-based approaches and the Few-shot Class-incremental Object Detection (FSCIOD) methods from anchor-free and anchor-based approaches. Beyond these, we present several promising research directions within FSCIL that merit further investigation.

Abstract:
Detecting 3D objects from a monocular camera in mobile applications, such as on a vehicle, drone, or robot, is a crucial but challenging task. The monocular vision’s near-far disparity and the camera’s constantly changing position make it difficult to achieve high accuracy, especially for distant objects. In this paper, we propose a new Mono3D framework named MoGDE, which takes inspiration from the observation that an object’s depth can be inferred from the ground’s depth underneath it. MoGDE estimates the corresponding ground depth of an image and utilizes this information to guide Mono3D. We use a pose detection network to estimate the camera’s orientation and construct a feature map that represents pixel-level ground depth based on the 3D-to-2D perspective geometry. To further improve Mono3D with the estimated ground depth, we design an RGB-D feature fusion network based on transformer architecture. The long-range self-attention mechanism is utilized to identify ground-contacting points and pin the corresponding ground depth to the image feature map. We evaluate MoGDE on the KITTI dataset, and the results show that it significantly improves the accuracy and robustness of Mono3D for both near and far objects. MoGDE outperforms state-of-the-art methods and ranks first among the pure image-based methods on the KITTI 3D benchmark.

Abstract:
Sample efficiency remains a key challenge for the deployment of deep reinforcement learning (RL) in real-world scenarios. A common approach is to learn efficient representations through future prediction tasks, facilitating the agent to make farsighted decisions that benefit its long-term performance. Existing methods extract predictive features by predicting multi-step future state signals. However, they do not fully exploit the structural information inherent in sequential state signals, which can potentially improve the quality of long-term decision-making but is difficult to discern in the time domain. To tackle this problem, we introduce a new perspective that leverages the frequency domain of state sequences to extract the underlying patterns in time series data. We theoretically show that state sequences contain structural information closely tied to policy performance and signal regularity and analyze the fitness of the frequency domain for extracting these two types of structural information. Inspired by that, we propose a novel representation learning method, State Sequences Prediction via Fourier Transform (SPF), which extracts long-term features by predicting the Fourier transform of infinite-step future state sequences. The appealing features of our frequency prediction objective include: 1) simple to implement due to a recursive relationship; 2) providing an upper bound on the performance difference between the optimal policy and the latent policy in the representation space. Experiments on standard and goal-conditioned RL tasks demonstrate that the proposed method outperforms several state-of-the-art algorithms in terms of both sample efficiency and performance.

Abstract:
Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.

Abstract:
Weakly-supervised semantic segmentation (WSSS) methods, reliant on image-level labels indicating object presence, lack explicit correspondence between labels and regions of interest (ROIs), posing a significant challenge. Despite this, WSSS methods have attracted attention due to their much lower annotation costs compared to fully-supervised segmentation. Leveraging reinforcement learning (RL) self-play, we propose a novel WSSS method that gamifies image segmentation of a ROI. We formulate segmentation as a competition between two agents that compete to select ROI-containing patches until exhaustion of all such patches. The score at each time-step, used to compute the reward for agent training, represents likelihood of object presence within the selection, determined by an object presence detector pre-trained using only image-level binary classification labels of object presence. Additionally, we propose a game termination condition that can be called by either side upon exhaustion of all ROI-containing patches, followed by the selection of a final patch from each. Upon termination, the agent is incentivised if ROI-containing patches are exhausted or disincentivised if a ROI-containing patch is found by the competitor. This competitive setup ensures minimisation of over- or under-segmentation, a common problem with WSSS methods. Extensive experimentation across four datasets demonstrates significant performance improvements over recent state-of-the-art methods.

Abstract:
Geometry plays a significant role in monocular 3D object detection. It can be used to estimate object depth by using the perspective projection between object’s physical size and 2D projection in the image plane, which can introduce mathematical priors into deep models. However, this projection process also introduces error amplification, where the error of the estimated height is amplified and reflected into the projected depth. It leads to unreliable depth inferences and also impairs training stability. To tackle this problem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++) by modeling geometry projection in a probabilistic manner. This ensures depth predictions are well-bounded and associated with a reasonable uncertainty. The significance of introducing such geometric uncertainty is two-fold: (1). It models the uncertainty propagation relationship of the geometry projection during training, improving the stability and efficiency of the end-to-end model learning. (2). It can be derived to a highly reliable confidence to indicate the quality of the 3D detection result, enabling more reliable detection inference. Experiments show that the proposed approach not only obtains (state-of-the-art) SOTA performance in image-based monocular 3D detection but also demonstrates superiority in efficacy with a simplified framework. The code and model will be released at https://github.com/SuperMHP/GUPNet_Plus.

Abstract:
Graph-based methods have demonstrated exceptional performance in semi-supervised classification. However, existing graph-based methods typically construct either a predefined graph in the original space or an adaptive graph within the output space, which often limits their ability to fully utilize prior information and capture the optimal intrinsic data distribution, particularly in high-dimensional data with abundant redundant and noisy features. This paper introduces a novel approach: Semi-Supervised Classification with Optimized Graph Construction (SSC-OGC). SSC-OGC leverages both predefined and adaptive graphs to explore intrinsic data distribution and effectively employ prior information. Additionally, a graph constraint regularization term (GCR) and a collaborative constraint regularization term (CCR) are incorporated to further enhance the quality of the adaptive graph structure and the learned subspace, respectively. To eliminate the negative effect of constructing a predefined graph in the original data space, we further propose a Hybrid Subspace Ensemble-enhanced framework based on the proposed Optimized Graph Construction method (HSE-OGC). Specifically, we construct multiple hybrid subspaces, which consist of meticulously chosen features from the original data to achieve high-quality and diverse space representations. Then, HSE-OGC constructs multiple predefined graphs within hybrid subspaces and trains multiple SSC-OGC classifiers to complement each other, significantly improving the overall performance. Experimental results conducted on various high-dimensional datasets demonstrate that HSE-OGC exhibits outstanding performance.

Abstract:
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied to the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to five different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, homologous RNA sequences from specific taxonomies and real classical piano pieces classified by their composer.

Abstract:
Monocular 3D object detection plays a crucial role In the field of self-driving cars, estimating the size and location of objects solely based on input images. However, a notable disparity exists between the training and inference of 3D object detectors. This discrepancy arises because during inference, monocular 3D detectors depend solely on images captured by cameras; while during training, these methods require 3D ground truths labeled on point cloud data, which is obtained using specialized devices like LiDAR. This discrepancy creates a break in the data loop, preventing the feedback data from production cars from being utilized to enhance the robustness of the detectors. To address this issue and establish a connection in the data loop, we present a weakly-supervised solution that trains monocular 3D object detectors solely using 2D labels, eliminating the requirement for 3D ground truths. Our approach considers two view consistency: spatial and temporal view consistency, which play a crucial role in regulating the prediction of 3D bounding boxes. Spatial view consistency is achieved by employing projection and multi-view consistency techniques to guide the optimization of the target’s location and size. We leverage temporal viewpoint consistency to provide temporal multi-view image pairs, and we further introduce temporal movement consistency to tackle the challenge of dynamic scenes. With only 2D ground truths, our method achieves comparable performance to fully supervised methods. Additionally, our method can be employed as a pre-training method and achieves significant improvement when fine-tuned with a small proportion of fully supervised labels.

Abstract:
Although data-driven methods usually have noticeable performance on disease diagnosis and treatment, they are suspected of leakage of privacy due to collecting data for model training. Recently, federated learning provides a secure and trustable alternative to collaboratively train model without any exchange of medical data among multiple institutes. Therefore, it has draw much attention due to its natural merit on privacy protection. However, when heterogenous medical data exists between different hospitals, federated learning usually has to face with degradation of performance. In the paper, we propose a new personalized framework of federated learning to handle the problem. It successfully yields personalized models based on awareness of similarity between local data, and achieves better tradeoff between generalization and personalization than existing methods. After that, we further design a differentially sparse regularizer to improve communication efficiency during procedure of model training. Additionally, we propose an effective method to reduce the computational cost, which improves computation efficiency significantly. Furthermore, we collect five real medical datasets, including two public medical image datasets and three private multi-center clinical diagnosis datasets, and evaluate its performance by conducting nodule classification, tumor segmentation, and clinical risk prediction tasks. Comparing with 14 existing related methods, the proposed method successfully achieves the best model performance, and meanwhile up to 60% improvement of communication efficiency.

Abstract:
The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target’s appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances.

Abstract:
Deep networks for 3D point clouds have achieved remarkable success in classification task but remain vulnerable to geometric variations resulting from inconsistent data acquisition procedures. This leads to significant performance degradation when models trained on a source domain are tested on out-of-distribution target domains, highlighting the challenges of 3D domain generalization and adaptation. In this paper, we introduce a novel Multi-Scale Part-based feature Representation, dubbed MSPR, as a generalizable representation for point cloud domain generalization and adaptation. Rather than relying on global shape feature, we align the part-level features of shapes at different scales to a set of learnable part-template features that encode local geometric structures shared between the source and the target domains. Specifically, shapes from different domains are organized into part-level features at various scales and then aligned to the part-template features. To balance the generalization and discrimination abilities of parts at different scales, we further design a cross-scale feature fusion module to exchange information between aligned part-based features at different scales. The fused part-based representations are finally aggregated by a part-based feature aggregation module. To improve the robustness of the aligned part-based representations and global shape representation to geometry variations, we further propose a Contrastive Learning framework on Shape Representation (CLSR). Experiments are conducted on 3D domain generalization and adaptation benchmarks for point cloud classification. Extensive experiments on 3D domain generalization and adaptation benchmarks demonstrate that proposed approach outperforms previous state-of-the-art methods in both tasks. Ablation studies confirm the effectiveness of the components in our model.

Abstract:
In this paper, we propose a novel image retrieval network named Correlation Verification Network (CVNet) to replace the conventional geometric re-ranking with a 4D convolutional neural network that learns diverse geometric matching possibilities. To enable efficient cross-scale matching, we construct feature pyramids and establish cross-scale feature correlations in a single inference, thereby replacing the costly multi-scale inference. Additionally, we employ curriculum learning with the Hide-and-Seek strategy to handle challenging samples. Our proposed CVNet demonstrates state-of-the-art performance on several image retrieval benchmarks by a large margin. From an implementation perspective, however, CVNet has one drawback: it requires high memory usage because it needs to store dense features of all database images. This high memory requirement can be a significant limitation in practical applications. To address this issue, we introduce an extension of CVNet called Dense-to-Sparse CVNet (CVNet^DSDS), which can significantly reduce memory usage by sparsifying the features of the database images. The sparsification module in CVNet^DSDS learns to select the relevant parts of image features end-to-end using a Gumbel estimator. Since the sparsification is performed offline, CVNet^DSDS does not increase online extraction and matching times. CVNet^DSDS dramatically reduces the memory footprint while preserving performance levels nearly identical to CVNet.

Abstract:
The significant success of graph learning has provoked a meaningful but challenging task of extracting the precise causal subgraphs that can interpret and improve the predictions. Unfortunately, current works merely center on partially eliminating either the spurious or the noisy parts, while overlook the fact that in more practical and general situations, both the spurious and noisy subgraph coexist with the causal one. This brings great challenges and makes previous methods fail to extract the true causal substructure. Unlike existing studies, in this paper, we propose a more reasonable problem formulation that hypothesizes the graph is a mixture of causal, spurious, and noisy subgraphs. With this regard, an Information Bottleneck-constrained denoised Causal Subgraph (IBCS) learning model is developed, which is capable of simultaneously excluding the spurious and noisy parts. Specifically, for the spurious correlation, we design a novel causal learning objective, in which beyond minimizing the empirical risks of causal and spurious subgraph classification, the intervention is further conducted on spurious features to cut off its correlation with the causal part. On this basis, we further impose the information bottleneck constraint to filter out label-irrelevant noise information. Theoretically, we prove that the causal subgraph extracted by our IBCS can approximate the ground-truth. Empirically, extensive evaluations on nine benchmark datasets demonstrate our superiority over state-of-the-art baselines.

Abstract:
In this paper, we focus on the challenging task of monocular 3D lane detection. Previous methods typically adopt inverse perspective mapping (IPM) to transform the Front-Viewed (FV) images or features into the Bird-Eye-Viewed (BEV) space for lane detection. However, IPM's dependence on flat ground assumption and context information loss in BEV representations lead to inaccurate 3D information estimation. Though efforts have been made to bypass BEV and directly predict 3D lanes from FV representations, their performances still fall behind BEV-based methods due to a lack of structured modeling of 3D lanes. In this paper, we propose a novel BEV-free method named Anchor3DLane++ which defines 3D lane anchors as structural representations and makes predictions directly from FV features. We also design a Prototype-based Adaptive Anchor Generation (PAAG) module to generate sample-adaptive sparse 3D anchors dynamically. In addition, an Equal-Width (EW) loss is developed to leverage the parallel property of lanes for regularization. Furthermore, camera-LiDAR fusion is also explored based on Anchor3DLane++ to leverage complementary information. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane++ outperforms previous state-of-the-art methods.

Abstract:
In the graph-based semi-supervised learning, the Green-function method is a classical method that works by computing the Green's function in the graph space. However, when applied to large graphs, especially those sparse ones, this method performs unstably and unsatisfactorily. We make a detailed analysis on it and propose a novel method from the perspective of optimization. On fully connected graphs, the method is equivalent to the Green-function method and can be seen as another interpretation with physical meanings, while on non-fully connected graphs, it helps to explain why the Green-function method causes a mess on large sparse graphs. To solve this dilemma, we propose a workable approach to improve our proposed method. Unlike the original method, our improved method can also apply two accelerating techniques, Gaussian Elimination, and Anchored Graphs to become more efficient on large graphs. Finally, the extensive experiments prove our conclusions and the efficiency, accuracy, and stability of our improved Green's function method.

Abstract:
Facial recognition (FR) technology offers convenience in our daily lives, but it also raises serious privacy issues due to unauthorized FR applications. To protect facial privacy, existing methods have proposed adversarial face examples that can fool FR systems. However, most of these methods work only in the digital domain and do not consider natural physical protections. In this paper, we present NatMask, a 3D-based method for creating natural and realistic adversarial face masks that can preserve facial identity in the physical world. Our method utilizes 3D face reconstruction and differentiable rendering to generate 2D face images with natural-looking facial masks. Moreover, we propose an identity-aware style injection (IASI) method to improve the naturalness and transferability of the mask texture. We evaluate our method on two face datasets to verify its effectiveness in protecting face identity against four state-of-the-art (SOTA) FR models and three commercial FR APIs in both digital and physical domains under black-box impersonation and dodging strategies. Experiments show that our method can generate adversarial masks with superior naturalness and physical realizability to safeguard face identity, outperforming SOTA methods by a large margin.

Abstract:
Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems. Existing methods implicitly model object joint distributions and express object relations, hindering generation’s controllability. We introduce InstructLayout, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 2D and 3D layout synthesis. The proposed semantic graph prior learns layout appearances and object distributions simultaneously, demonstrating versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 2D and 3D scene synthesis, we respectively curate two high-quality datasets of layout-instruction pairs from public Internet resources with large language and multimodal models. Extensive experimental results reveal that the proposed method outperforms existing state-of-the-art approaches by a large margin in both 2D and 3D layout synthesis tasks. Thorough ablation studies confirm the efficacy of crucial design components.

Abstract:
Emerging 3D scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated their effectiveness in Simultaneous Localization and Mapping (SLAM) for photo-realistic rendering, particularly when using high-quality video sequences as input. However, existing methods struggle with motion-blurred frames, which are common in real-world scenarios like low-light or long-exposure conditions. This often results in a significant reduction in both camera localization accuracy and map reconstruction quality. To address this challenge, we propose a dense visual SLAM pipeline (i.e., MBA-SLAM) to handle severe motion-blurred inputs. Our approach integrates an efficient motion blur-aware tracker with either neural radiance fields or Gaussian Splatting based mapper. By accurately modeling the physical image formation process of motion-blurred images, our method simultaneously learns 3D scene representation and estimates the cameras’ local trajectory during exposure time, enabling proactive compensation for motion blur caused by camera movement. In our experiments, we demonstrate that MBA-SLAM surpasses previous state-of-the-art methods in both camera localization and map reconstruction, showcasing superior performance across a range of datasets, including synthetic and real datasets featuring sharp images as well as those affected by motion blur, highlighting the versatility and robustness of our approach.

Abstract:
In practical scenarios, time series forecasting necessitates not only accuracy but also efficiency. Consequently, the exploration of model architectures remains a perennially trending topic in research. To address these challenges, we propose a novel backbone architecture named Time Evidence Fusion Network (TEFN) from the perspective of information fusion. Specifically, we introduce the Basic Probability Assignment (BPA) Module based on evidence theory to capture the uncertainty of multivariate time series data from both channel and time dimensions. Additionally, we develop a novel multi-source information fusion method to effectively integrate the two distinct dimensions from BPA output, leading to improved forecasting accuracy. Lastly, we conduct extensive experiments to demonstrate that TEFN achieves performance comparable to state-of-the-art methods while maintaining significantly lower complexity and reduced training time. Also, our experiments show that TEFN exhibits high robustness, with minimal error fluctuations during hyperparameter selection. Furthermore, due to the fact that BPA is derived from fuzzy theory, TEFN offers a high degree of interpretability. Therefore, the proposed TEFN balances accuracy, efficiency, stability, and interpretability, making it a desirable solution for time series forecasting.

Abstract:
Humans can easily deduce the relative pose of a previously unseen object, without labeling or training, given only a single query-reference image pair. This is arguably achieved by incorporating i) 3D/2.5D shape perception from a single image, ii) render-and-compare simulation, and iii) rich semantic cue awareness to furnish (coarse) reference-query correspondence. Motivated by this, we propose a novel 3D generalizable relative pose estimation method by elaborating 3D/2.5D shape perception with a 2.5D shape from an RGB-D reference, fulfilling the render-and-compare paradigm with an off-the-shelf differentiable renderer, and leveraging the semantic cues from a pretrained model like DINOv2. Specifically, our differentiable renderer takes the 2.5D rotatable mesh textured by the RGB and the semantic maps (obtained by DINOv2 from the RGB input), then renders new RGB and semantic maps (with back-surface culling) under a novel rotated view. The refinement loss comes from comparing the rendered RGB and semantic maps with the query ones, back-propagating the gradients through the differentiable renderer to refine the 3D relative pose. As a result, our method can be readily applied to unseen objects, given only a single RGB-D reference, without labeling or training. Extensive experiments on LineMOD, LM-O, and YCB-V show that our training-free method significantly outperforms the state-of-the-art supervised methods, especially under the rigorous Acc@5/10/15^\circ∘ metrics and the challenging cross-dataset settings.

Abstract:
The need for improved diagnostic methods in ophthalmology is acute, especially in the underdeveloped regions with limited access to specialists and advanced equipment. Therefore, we introduce VisionUnite, a novel vision-language foundation model for ophthalmology enhanced with clinical knowledge. VisionUnite has been pretrained on an extensive dataset comprising 1.24 million image-text pairs, and further refined using our proposed MMFundus dataset, which includes 296,379 high-quality fundus image-text pairs and 889,137 simulated doctor-patient dialogue instances. Our experiments indicate that VisionUnite outperforms existing generative foundation models such as GPT-4V and Gemini Pro. It also demonstrates diagnostic capabilities comparable to junior ophthalmologists. VisionUnite performs well in various clinical scenarios including open-ended multi-disease diagnosis, clinical explanation, and patient interaction, making it a highly versatile tool for initial ophthalmic disease screening. VisionUnite can also serve as an educational aid for junior ophthalmologists, accelerating their acquisition of knowledge regarding both common and underrepresented ophthalmic conditions. VisionUnite represents a significant advancement in ophthalmology, with broad implications for diagnostics, medical education, and understanding of disease mechanisms.

Abstract:
Modern generative models, particularly denoising diffusion probabilistic models (DDPMs), provide high-quality synthetic images, enabling users to generate diverse images and videos that are realistic. However, in a number of situations, edge devices or individual institutions may possess locally collected data that is highly sensitive and should ensure data privacy, such as in the field of healthcare and finance. Under such federated learning (FL) settings, various methods on training generative models have been studied, but most of them assume generative adversarial networks (GANs), and the algorithms are specific to GANs and not other forms of generative models such as DDPM. This paper proposes a new algorithm for training DDPMs under federated learning settings, VQ-FedDiff, which provides a personalized algorithm for training diffusion models that can generate higher-quality images FID while still keeping risk of breaching sensitive information as low as locally-trained secure models. We demonstrate that VQ-FedDiff shows state-of-the-art performance on existing federated learning of diffusion models in both IID and non-IID settings, and in benchmark photorealistic and medical image datasets. Our results show that diffusion models can efficiently learn with decentralized, sensitive data, generating high-quality images while preserving data privacy.

Abstract:
Image restoration (IR) seeks to recover high-quality images from degraded observations caused by a wide range of factors, including noise, blur, compression, and adverse weather. While traditional IR methods have made notable progress by targeting individual degradation types, their specialization often comes at the cost of generalization, leaving them ill-equipped to handle the multifaceted distortions encountered in real-world applications. In response to this challenge, the all-in-one image restoration (AiOIR) paradigm has recently emerged, offering a unified framework that adeptly addresses multiple degradation types. These innovative models enhance the convenience and versatility by adaptively learning degradation-specific features while simultaneously leveraging shared knowledge across diverse corruptions. In this survey, we provide the first in-depth and systematic overview of AiOIR, delivering a structured taxonomy that categorizes existing methods by architectural designs, learning paradigms, and their core innovations. We systematically categorize current approaches and assess the challenges these models encounter, outlining research directions to propel this rapidly evolving field. To facilitate the evaluation of existing methods, we also consolidate widely-used datasets, evaluation protocols, and implementation practices, and compare and summarize the most advanced open-source models. As the first comprehensive review dedicated to AiOIR, this paper aims to map the conceptual landscape, synthesize prevailing techniques, and ignite further exploration toward more intelligent, unified, and adaptable visual restoration systems.

Abstract:
Black-Box Knowledge Distillation (B2KD) is a conservative task in cloud-to-edge model compression, emphasizing the protection of data privacy and model copyrights on both the cloud and edge. With invisible data and models hosted on the server, B2KD aims to utilize only the API queries of the teacher model’s inference results in the cloud to effectively distill a lightweight student model deployed on edge devices. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity in data distribution. To address these issues, we theoretically provide a new optimization direction from logits to cell boundary, different from direct logits alignment, and formalize a workflow comprising deprivatization, distillation, and adaptation at test time. Guided by this, we propose a method, Mapping-Emulation KD (MEKD), to enhance the robust prediction and anti-interference capabilities of the student model on edge devices for any unknown data distribution in real-world scenarios. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator, 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points, and 3) adaptation: correcting the student’s online prediction bias through a graph propagation-based only-forward test-time adaptation algorithm. Our method demonstrates inspiring performance for edge model distillation and adaptation across different teacher-student pairs. We validate the effectiveness of our method on multiple image recognition benchmarks and various Deep Neural Network models, achieving state-of-the-art performance and showcasing its practical value in remote sensing image recognition applications.

Abstract:
Graph data in real-world scenarios undergo rapid and frequent changes, making it challenging for existing graph models to effectively handle the continuous influx of new data and accommodate data withdrawal requests. The approach to frequently retraining graph models is resource intensive and impractical. To address this pressing challenge, this paper introduces a new concept of graph memory learning. Its core idea is to enable a graph model to selectively remember new knowledge but forget old knowledge. Building on this approach, the paper presents a novel graph memory learning framework - Brain-inspired Graph Memory Learning (BGML), inspired by brain network dynamics and function-structure coupling strategies. BGML incorporates a multi-granular hierarchical progressive learning mechanism rooted in feature graph grain learning to mitigate potential conflict between memorization and forgetting in graph memory learning. This mechanism allows for a comprehensive and multi-level perception of local details within evolving graphs. In addition, to tackle the issue of unreliable structures in newly added incremental information, the paper introduces an information self-assessment ownership mechanism. This mechanism not only facilitates the propagation of incremental information within the model but also effectively preserves the integrity of past experiences. We design five types of graph memory learning tasks: regular, memory, unlearning, data-incremental, and class-incremental to evaluate BGML. Its excellent performance is confirmed through extensive experiments on multiple node classification datasets.

Abstract:
Machine Learning, particularly Generative Adversarial Networks (GANs), has revolutionised Super-Resolution (SR). However, generated images often lack physical meaningfulness, which is essential for scientific applications. Our approach, PC-SRGAN, enhances image resolution while ensuring physical consistency for interpretable simulations. PC-SRGAN significantly improves both the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure compared to conventional SR methods, even with limited training data (e.g., only 13% of training data is required to achieve performance similar to SRGAN). Beyond SR, PC-SRGAN augments physically meaningful machine learning, incorporating numerically justified time integrators and advanced quality metrics. These advancements promise reliable and causal machine-learning models in scientific domains. A significant advantage of PC-SRGAN over conventional SR techniques is its physical consistency, which makes it a viable surrogate model for time-dependent problems. PC-SRGAN advances scientific machine learning by improving accuracy and efficiency, enhancing process understanding, and broadening applications to scientific research.

Abstract:
Recently, deep clustering methods have achieved remarkable results compared to traditional clustering approaches. However, its performance remains constrained by the absence of annotations. A thought-provoking observation is that there is still a significant gap between deep clustering and semi-supervised classification methods. Even with only a few labeled samples, the accuracy of semi-supervised learning is much higher than that of clustering. Given that we can annotate a small number of samples in a certain unsupervised way, the clustering task can be naturally transformed into a semi-supervised setting, thereby achieving comparable performance. Based on this intuition, we propose ClusMatch, a unified positive and negative pseudo-label learning based semi-supervised learning framework, which is pluggable and can be applied to existing deep clustering methods. Specifically, we first leverage the pre-trained deep clustering network to compute predictions for all samples, and then design specialized selection strategies to pick out a few high-quality samples as labeled samples for supervised learning. For the unselected samples, the novel unified positive and negative pseudo-label learning is introduced to provide additional supervised signals for semi-supervised fine-tuning. We also propose an adaptive positive-negative threshold learning strategy to further enhance the confidence of generated pseudo-labels. Extensive experiments on six widely-used datasets and one large-scale dataset demonstrate the superiority of our proposed ClusMatch. For example, ClusMatch achieves a significant accuracy improvement of 5.4% over the state-of-the-art method ProPos on an average of these six datasets.

Abstract:
Recent years have witnessed the rapid development of general human action understanding. However, when applied to real-world applications such as sports analysis, most existing datasets are still unsatisfactory, because of the limitations in rich labels on multiple tasks, language instructions, high-quality 3D data, and diverse environments. In this paper, we present FLAG3D++, a large-scale benchmark for 3D fitness activity comprehension, which contains 180 K sequences of 60 activity categories with language instruction. FLAG3D++ features the following four aspects: 1) fine-grained annotations of the temporal intervals of actions in the untrimmed long sequences and how well these actions are performed, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 4) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. In light of the specified features, we present two new practical applications as language-guided repetition action counting (L-RAC) and language-guided action quality assessment (L-AQA), which aim to take the language descriptions as references to count the repetitive times of an action and assess the quality of action respectively. Furthermore, we propose a Hierarchical Language-Guided Graph Convolutional Network (HL-GCN) model to better fuse the language information and skeleton sequences for L-RAC and L-AQA. To be specific, the HL-GCN performs cross-modal alignments by the early fusion of the linguistic feature and the hierarchical node features of the skeleton-based sequences encoded by the multiple intermediate graph convolutional layers. Extensive experiments show the superiority of our HL-GCN on both L-RAC and L-AQA, as well as the great research value of FLAG3D++ for various challenges, such as dynamic human mesh recovery and cross-domain human action recognition. Our dataset, source code, and trained models are made publicly available at FLAG3D++.

Abstract:
Image fusion aims to merge image pairs collected by different sensors over the same scene, preserving their distinct features. Recent works have often focused on designing various image fusion losses, developing different network architectures, and leveraging downstream tasks (e.g., object detection) for image fusion. However, a few studies have explored how language and semantic masks can serve as guidance to aid image fusion. In this paper, we investigate how the combination of language and masks can guide image fusion tasks, discarding the previously complex frameworks, which rely on downstream tasks, GAN-based cycle training, diffusion models, or deep image priors. Additionally, we exploit a recurrent neural network-like architecture to build a lightweight network that avoids the quadratic-cost of traditional attention mechanisms. To adapt the receptance weighted key value (RWKV) model to an image modality, we modify it into a bidirectional version using an efficient scanning strategy (ESS). To guide image fusion by language and mask features, we introduce a multi-modal fusion module (MFM) to facilitate information exchange. Comprehensive experiments show that the proposed framework achieved state-of-the-art results in various image fusion tasks (i.e., visible-infrared image fusion, multi-focus image fusion, multi-exposure image fusion, medical image fusion, hyperspectral and multispectral image fusion, and pansharpening).

Abstract:
Neural Architecture Search (NAS) has been extensively studied due to its ability in automatic architecture engineering. Existing NAS methods rely heavily on the gradients and data labels, which either incur immense computational costs or suffer from discretization discrepancy due to the supernet structure. Moreover, the majority of them are limited in generating diverse architectures. To alleviate these issues, in this paper, we propose a novel zero-cost proxy called \mathsf MeCoMeCo based on the Pearson correlation matrix of the feature maps. Unlike the previous work, the computation of \mathsf MeCoMeCo as well as its variant \mathsf MeCo_optMeCoopt requires only one random data for a single forward pass. Based on the proposed zero-cost proxy, we further craft a new zero-shot NAS scheme called \mathsf FLASHFLASH, which harnesses a new proxy-based operation scoring function and a greedy heuristic. Compared to the existing methods, \mathsf FLASHFLASH is highly efficient and can construct diverse model architectures instead of repeated cells. We design comprehensive experiments and extensively evaluate our designs on multiple benchmarks and datasets. The experimental results show that our method is one to six orders of magnitudes more efficient than the state-of-the-art baselines with the highest model accuracy.

Abstract:
The pursuit of model explainability has prompted the selective rationalization (aka, rationale extraction) which can identify important features (i.e., rationales) from the original input to support prediction results. Existing methods typically involve a cascaded approach with a selector responsible for extracting rationales from the input, followed by a predictor that makes predictions based on the selected rationales. However, these approaches often neglect the information contained in the non-rationales, underutilizing the input. Therefore, in our prior work, we introduce the Disentanglement-Augmented Rationale Extraction (DARE) method, which disentangles the input into rationale and non-rationale components, and enhances rationale representations by minimizing the mutual information between them. While DARE demonstrates strong performance in rationalization, it may still rely on shortcuts in the training distribution, leading to unfaithful rationales. To this end, in this paper, we propose Faith-DARE, an extension of DARE that aims to extract more reliable rationales by mitigating shortcut dependencies. Specifically, we treat the non-rationale features identified by DARE as environments that are decorrelated from the predictions. By shuffling and recombining these environments with rationales, we generate counterfactual samples and identify invariant rationales that remain predictive across shifted distributions. Extensive experiments on graph and textual datasets validate the effectiveness of Faith-DARE.

Abstract:
The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high-fidelity and low-costs requirements. Their main bottleneck lies in the additional prompt image encoder (i.e., CLIP vision encoder), which produces weak alignment signals with the text-to-image model that may lose face information and is not well ‘absorbed’ by the text-to-image model. Towards this end, we propose Inv-Adapter, which first introduces a more reasonable and efficient token representation of ID image features and introduces a lightweight parameter adaptor to inject ID features. Specifically, our Inv-Adapter extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without an additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then introduce a lightweight attention adapter to embed them efficiently into the base text-to-image model. We conduct extensive experiments on different text-to-image models to assess ID fidelity, generation loyalty, speed, training costs, model scale and generalization ability in scenarios of general object, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.

Abstract:
Due to the rich information and original data distribution, RAW data are widely used in many computer vision applications. However, the use of RAW video remains limited because of the high storage costs associated with data collection. Previous works have attempted to reconstruct RAW frames from sRGB data using small sampled metadata from the original RAW frames. Yet, these algorithms struggle with RAW video reconstruction due to the high computational cost of sampling metadata on cameras. To address these issues, we propose a new RAW video reconstruction pipeline that de-renders high-quality RAW videos from sRGB data using only one initial RAW frame as a reference. Specifically, we introduce three new models to achieve this goal. First, we present the Temporal-Affinity Guided De-rendering Network. This network leverages the temporal affinity between adjacent frames to construct a reference RAW image from previous RAW pixels. The corresponding RAW pixels in the previous frame provide valuable information about the original RAW data distribution, aiding in the precise reconstruction of the current frame. Second, to recover the missing RAW pixels caused by camera and foreground movement, we fully exploit the rich prior information from a pre-trained diffusion model and propose the RAW In-painting Model. This model can accurately fill in hollow regions in a RAW image based on the corresponding sRGB image and the surrounding RAW context. Lastly, we present a lightweight content-aware video clipper that automatically adjusts the clip length used for RAW video reconstruction, thereby balancing storage requirements with reconstruction quality. To better evaluate the performance of the proposed framework across different devices, we introduce the first RAW video reconstruction benchmark that comprises RAW videos from six types of camera devices with challenging scenarios. Experimental results demonstrate that our algorithm can accurately reconstruct RAW videos across all the scenarios.

Abstract:
Conventional Multi-Source Free Domain Adaptation (MSFDA) assumes that each source domain provides a single source model, and all source models adopt a uniform architecture. This paper introduces Zoo-MSFDA, a more general setting that allows each source domain to offer a zoo of multiple source models with different architectures. While it enriches the source knowledge, Zoo-MSFDA risks being dominated by suboptimal/harmful models. To address this issue, we theoretically analyze the model selection problem in Zoo-MSFDA, and introduce two principles: transferability principle and diversity principle. Recognizing the challenge of measuring transferability, we subsequently propose a novel Source-Free Unsupervised Transferability Estimation (SUTE). It enables assessing and comparing transferability across multiple source models with different architectures under domain shift, without requiring target labels and source data. Based on above, we introduce a Selection, Ensemble, and Adaptation (SEA) framework to address Zoo-MSFDA, which consists of: 1) source models selection based on the proposed principles and SUTE; 2) ensemble construction based on SUTE-estimated transferability; 3) target-domain adaptation of the ensemble model. Evaluations demonstrate that our SEA framework, with the introduced Zoo-MSFDA setting, significantly improves adaptation performance in 2D image classification tasks. Additionally, our SUTE achieves state-of-the-art performance in transferability estimation.

Abstract:
Equivariant quantum graph neural networks (EQGNNs) offer a potentially powerful method to process graph data. However, existing EQGNN models only consider the permutation symmetry of graphs, and failing to fully exploit the geometric and non-geometric information in graphs, resulting in suboptimal performance when processing 3D graph data. To address these limitations, we derive constraints of rotation and permutation equivariance, and then propose a novel rotation- and permutation-equivariant quantum graph neural network (RP-EQGNN). An equivariant module is designed to extract the geometric information. Then, a convolution and entanglement module is constructed to extract non-geometric information. To improve performance of our model, an edge entanglement strategy is designed to perform distinguishable entanglement operations based on edge heterogeneity. The experiment results demonstrate that RP-EQGNN is significantly better for graph regression on the QM9 dataset and the OC20 dataset than Q3DGL and EQC in MAE and achieves results comparable to those of EquiformerV2, Geoformer, SO3KRATES and HEGNN. It also has advantage for point cloud classification on the ModelNet40 dataset over quantum models, including sQCNN-3D and PI-QSVM. RP-EQGNN introduces an innovative approach for processing 3D graph data, establishing a basis for future investigations into symmetries within graph neural networks.

Abstract:
We propose and demonstrate the event-based visual microphone (EBVM), a passive electro-optical technique for remotely capturing audio signals using an event-based camera without any use of a conventional microphone. The event-based camera records local angular deformations of a surface induced by the sound propagation by observing the changes in the specular reflections at each pixel. By interpreting the timings of the specular incidences deduced from the event stream as signal level-crossings, we reconstruct the audio signal by imposing short-time Fourier sparsity conditions. The recovered audio signal is qualitatively comparable to or better than the prior art (intensity-based visual microphone), while simultaneously expanding the field of view by approximately 25 times and reducing data volume by three orders of magnitude. The proposed EBVM was tested on speech signal reconstruction as well as novel event-based acousto-optical passive ranging.

Abstract:
Human intelligence is characterized by our ability to absorb and apply knowledge from the world around us, especially in rapidly acquiring new concepts from minimal examples, underpinned by prior knowledge. Few-shot learning (FSL) aims to mimic this capacity by enabling significant generalizations and transferability. However, traditional FSL frameworks often rely on assumptions of clean, complete, and static data, conditions that are seldom met in real-world environments. Such assumptions falter in the inherently uncertain, incomplete, and dynamic contexts of the open world. This paper presents a comprehensive review of recent advancements designed to adapt FSL to open-world environments. We categorize existing methods into three distinct types of FSL in the open world: those involving varying instances, varying classes, and varying distributions. Each category is discussed in terms of its specific challenges and methods, as well as its strengths and weaknesses. We standardize experimental settings and metric benchmarks across scenarios and provide a comparative analysis of the performance of various methods. In conclusion, we outline potential future research directions for this evolving field. It is our hope that this review will catalyze further development of effective solutions to these complex challenges, thereby advancing the field of artificial intelligence.

Abstract:
Recently, a tensor-on-tensor (ToT) regression model has been proposed to generalize tensor recovery, encompassing scenarios like scalar-on-tensor regression and tensor-on-vector regression. However, the exponential growth in tensor complexity poses challenges for storage and computation in ToT regression. To overcome this hurdle, tensor decompositions have been introduced, with the tensor train (TT)-based ToT model proving efficient in practice due to reduced memory requirements, enhanced computational efficiency, and decreased sampling complexity. Despite these practical benefits, a disparity exists between theoretical analysis and real-world performance. In this paper, we delve into the theoretical and algorithmic aspects of the TT-based ToT regression model. Assuming the regression operator satisfies the restricted isometry property (RIP), we conduct an error analysis for the solution to a constrained least-squares optimization problem. This analysis includes upper error bound and minimax lower bound, revealing that such error bounds polynomially depend on the order N+MN+M. To efficiently find solutions meeting such error bounds, we propose two optimization algorithms: the iterative hard thresholding (IHT) algorithm (employing gradient descent with TT-singular value decomposition (TT-SVD)) and the factorization approach using the Riemannian gradient descent (RGD) algorithm. When RIP is satisfied, spectral initialization facilitates proper initialization, and we establish the linear convergence rate of both IHT and RGD. Notably, compared to the IHT, which optimizes the entire tensor in each iteration while maintaining the TT structure through TT-SVD and poses a challenge for storage memory in practice, the RGD optimizes factors in the so-called left-orthogonal TT format, enforcing orthonormality among most of the factors, over the Stiefel manifold, thereby reducing the storage complexity of the IHT. However, this reduction in storage memory comes at a cost: the recovery of RGD is worse than that of IHT, while the error bounds of both algorithms depend on N+MN+M polynomially. Experimental validation substantiates the validity of our theoretical findings.

Abstract:
Transformers based on Self-Attention (SA) mechanism have demonstrated unrivaled superiority in numerous areas. Compared to RNN-based networks, Transformers can learn the temporal dependency representation of an entire sequence in parallel, while efficiently dealing with long-range dependencies. However, the \mathcal O(L^2)O(L2) (LL denotes the length of the sequence) computational complexity of the SA mechanism and the high memory usage make the construction cost of the Transformer-based model prohibitively expensive. To address these challenges, we propose a Transformer-like model, HPformer: Low-Parameter Transformer with Temporal Dependency Hierarchical Propagation. HPformer first chunks the sequence into KK (K = \left\lceil \log L \right\rceil + 1K=logL+1, \left\lceil \cdot \right\rceil· denotes ceiling operation) sequence segments, then leverages the hierarchical propagation mechanism with \mathcal O(L)O(L) computational complexity to learn the temporal dependencies between the segments and within the segments, and ultimately generates KK vectors as KeyKey matrices. This reduces the complexity of the SA mechanism from \mathcal O(L^2)O(L2) to \mathcal O(L\log L)O(LlogL). In addition, we employ a strategy of sharing KeyKey and ValueValue matrices between layers to build the HPformer, thus reducing memory usage. Extensive experiments based on public health informatics benchmark and Long-Range Arena (LRA) benchmark have demonstrated that HPformer has advantages over Transformer-based models in terms of memory usage and efficiency.

Abstract:
Thanks to advances in deep learning techniques, Human Pose Estimation (HPE) has achieved significant progress in natural scenarios. However, these models perform poorly in artificial scenarios such as painting and sculpture due to the domain gap, constraining the development of virtual reality and augmented reality. With the growth of model size, retraining the whole model on both natural and artificial data is computationally expensive and inefficient. Our research aims to bridge the domain gap between natural and artificial scenarios with efficient tuning strategies. Leveraging the potential of language models, we enhance the adaptability of traditional pose estimation models across diverse scenarios with a novel framework called VLPose. VLPose leverages the synergy between language and vision to extend the generalization and robustness of pose estimation models beyond the traditional domains. Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively, compared to state-of-the-art tuning strategies.

Affiliations: School of Artificial Intelligence, Shenzhen University, Shenzhen, China; School of Artificial Intelligence, Beijing Normal University, Beijing, China; Department of Computer Vision Technology (VIS), Baidu Inc., Beijing, China; Shenzhen Institute of Advanced Study, University of Electronic Science and Technology of China (UESTC), Chengdu, China; Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong-Macao Greater Bay Area, Shenzhen Polytechnic University, Shenzhen, China; Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China

Abstract:
Gait recognition, a rapidly advancing vision technology for person identification from a distance, has made significant strides in indoor settings. However, evidence suggests that existing methods often yield unsatisfactory results when applied to newly released real-world gait datasets. Furthermore, conclusions drawn from indoor gait datasets may not easily generalize to outdoor ones. Therefore, the primary goal of this paper is to present a comprehensive benchmark study aimed at improving practicality rather than solely focusing on enhancing performance. To this end, we developed OpenGait, a flexible and efficient gait recognition platform. Using OpenGait, we conducted in-depth ablation experiments to revisit recent developments in gait recognition. Surprisingly, we detected some imperfect parts of some prior methods and thereby uncovered several critical yet previously neglected insights. These findings led us to develop three structurally simple yet empirically powerful and practically robust baseline models: DeepGaitV2, SkeletonGait, and SkeletonGait++, which represent the appearance-based, model-based, and multi-modal methodologies for gait pattern description, respectively. In addition to achieving state-of-the-art performance, our careful exploration provides new perspectives on the modeling experience of deep gait models and the representational capacity of typical gait modalities. In the end, we discuss the key trends and challenges in current gait recognition, aiming to inspire further advancements towards better practicality.

Abstract:
The major challenge in learning-based RF sensing is acquiring high-quality large-scale annotated datasets. Unlike visual datasets, RF signals are inherently non-intuitive and non-interpretable, making their annotation both time-consuming and labor-intensive. To address this challenge, we propose RF-URL 2.0, a novel unsupervised representation learning (URL) framework for RF sensing, which enables pre-training on easily collected, large-scale unannotated RF datasets to make downstream tasks solve easier. Existing URL techniques, such as contrastive learning, are primarily designed for natural images and are prone to learn shortcuts rather than meaningful information when applied to RF signals. RF-URL 2.0 is the first framework to overcome these limitations by constructing positive and negative pairs through well-established RF signal processing algorithms. Besides, it introduces a novel signal-model-driven augmentation technique, which augments signal representations by identifying and perturbing physically meaningful parameters of signal processing models. Moreover, the RF-URL 2.0 is carefully designed to take into account the heterogeneity characteristics of different RF signal processing representations. We show the universality of RF-URL 2.0 in three typical RF sensing tasks using two general RF devices (WiFi and radar), including human gesture recognition, 3D pose estimation, and silhouette generation. Extensive experiments on the HIBER and WiDAR 3.0 datasets demonstrate that RF-URL 2.0 takes a significant step toward learning-based solutions for RF sensing.

Abstract:
Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL’s reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy.

Abstract:
With advancements in robust stereo matching and optical flow estimation networks, models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However, their robustness can be seriously degraded when fine-tuning them in real-world scenarios. This paper investigates fine-tuning stereo matching and optical flow estimation networks without compromising their robustness to unseen domains. Specifically, we divide the pixels into consistent and inconsistent regions by comparing Ground Truth (GT) with Pseudo Label (PL) and demonstrate that the imbalance learning of consistent and inconsistent regions in GT causes robustness degradation. Based on our analysis, we propose the DKT framework, which utilizes PL to balance the learning of different regions in GT. The core idea is to utilize an exponential moving average (EMA) teacher to measure what the student network has learned and dynamically adjust the learning regions. We further propose the DKT++ framework, which improves target-domain performances and network robustness by applying slow-fast update teachers to generate more accurate PL, introducing the unlabeled data and synthetic data. We integrate our frameworks with state-of-the-art networks and evaluate their effectiveness on several real-world datasets. Extensive experiments show that our method effectively preserves the robustness of stereo matching and optical flow networks during fine-tuning.

Abstract:
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid pose aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid pose. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding the LiDAR-to-camera extrinsic parameters. We demonstrate the efficacy of NCLR by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network’s understanding abilities and effectiveness of learned representation.

Abstract:
Image outpainting aims to generate the content of an input sub-image outside its boundaries, which remains open for existing generative models. This paper explores image outpainting in three directions that have not been achieved in literature to our knowledge: outpainting 1) with continuous multiples (in contrast to the discrete ones by existing methods); 2) with arbitrary resolutions; and 3) in a single step (for any multiples and resolutions). The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and relative positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The continuous-resolution outpainting is achieved by introducing the multi-scale training strategy into generative models. Specifically, by disentangling the image resolution and the number of patches, it can generate images with arbitrary resolutions without post-processing. Meanwhile, we propose a query-based contrastive objective to make our method not rely on a pre-trained backbone network which is otherwise often required in peer methods. The comprehensive experimental results on public benchmarks show its superior performance over state-of-the-art approaches.

Abstract:
Optical computing systems provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gaps. We propose a gradient-based model-free optimization (G-MFO) method based on a Monte Carlo gradient estimation algorithm for computationally efficient in situ training of optical computing systems. This approach treats an optical computing system as a black box and back-propagates the loss directly to the optical computing weights’ probability distributions, circumventing the need for a computationally heavy and biased system simulation. Our experiments on diffractive optical computing systems show that G-MFO outperforms hybrid training on the MNIST and FMNIST datasets. Furthermore, we demonstrate image-free and high-speed classification of cells from their marker-free phase maps. Our method’s model-free and high-performance nature, combined with its low demand for computational resources, paves the way for accelerating the transition of optical computing from laboratory demonstrations to practical, real-world applications.

Abstract:
Current prevailing vision-language models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. The major bottleneck for the current robot 3D scene recognition approach for robotic applications is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world robot applications such as robot manipulation as well as robot navigation. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require a large number of high-quality labels to train neural networks, which merely perform well in a fully supervised manner. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Also, our proposed WS3D++ achieves state-of-the-art data-efficient learning performance on the other large-scale real-scene indoor and outdoor datasets S3DIS and SemanticKITTI. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.

Abstract:
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.

Abstract:
Most existing CoSOD models focus solely on extracting co-saliency cues while neglecting explicit exploration of background regions, potentially leading to difficulties in handling interference from complex background areas. To address this, this paper proposes a Discriminative co-saliency and background Mining Transformer framework (DMT) to explicitly mine both co-saliency and background information and effectively model their discriminability. DMT first learns two types of tokens by disjointly extracting co-saliency and background information from segmentation features, then performs discriminability within the segmentation features guided by these well-learned tokens. In the first phase, we propose economic multi-grained correlation modules for efficient detection information extraction, including Region-to-Region (R2R), Contrast-induced Pixel-to-Token (CtP2T), and Co-saliency Token-to-Token (CoT2T) correlation modules. In the subsequent phase, we introduce Token-Guided Feature Refinement (TGFR) modules to enhance discriminability within the segmentation features. To further enhance the discriminative modeling and practicality of DMT, we first upgrade the original TGFR’s intra-image modeling approach to an intra-group one, thus proposing Group TGFR (G-TGFR), which is more suitable for the co-saliency task. Subsequently, we designed a Noise Propagation Suppression (NPS) mechanism to apply our model to a more practical open-world scenario, ultimately presenting our extended version, i.e. DMT+O. Extensive experimental results on both conventional CoSOD and open-world CoSOD benchmark datasets demonstrate the effectiveness of our proposed model.

Abstract:
Bayesian neural networks (BNNs) treat neural network weights as random variables, which aim to provide posterior uncertainty estimates and avoid overfitting by performing inference on the posterior weights. However, selection of appropriate prior distributions remains a challenging task, and BNNs may suffer from catastrophic inflated variance or poor predictive performance when poor choices are made for the priors. Existing BNN designs apply different priors to weights, while the behaviours of these priors make it difficult to sufficiently shrink noisy signals or they are prone to overshrinking important signals in the weights. To alleviate this problem, we propose a novel R2D2-Net, which imposes the R^2R2-induced Dirichlet Decomposition (R2D2) prior to the BNN weights. The R2D2-Net can effectively shrink irrelevant coefficients towards zero, while preventing key features from over-shrinkage. To approximate the posterior distribution of weights more accurately, we further propose a variational Gibbs inference algorithm that combines the Gibbs updating procedure and gradient-based optimization. This strategy enhances stability and consistency in estimation when the variational objective involving the shrinkage parameters is non-convex. We also analyze the evidence lower bound (ELBO) and the posterior concentration rates from a theoretical perspective. Experiments on both natural and medical image classification and uncertainty estimation tasks demonstrate satisfactory performances of our method.

Abstract:
Eye-tracking is a reliable method for quantifying visual information processing and holds significant potential for group recognition, such as identifying autism spectrum disorder (ASD). However, eye-tracking research typically faces the heterogeneity of stimuli and is time-consuming due to the large number of observed stimuli. To address these issues, we first mathematically define the stimulus selection problem and introduce the concept of stimulus discrimination ability to reduce the computational complexity of the solution. Then, we construct a scanpath-based recognition model to mine the stimulus discrimination ability. Specifically, we propose cross-subject entropy and cross-subject divergence scores for quantitatively evaluating stimulus discrimination ability, effectively capturing differences in intra-group collective trends and inter-subject consistency within a group. Furthermore, we propose an iterative learning mechanism that employs stimulus-wise attention to focus on discriminative stimuli for discrimination purification. In the experiment, we construct an ASD eye-tracking dataset with diverse stimulus types and conduct extensive tests on three representative models to validate our approach. Remarkably, our method demonstrates superior performance using only 10 selected stimuli compared to models utilizing 220 stimuli. Additionally, we perform experiments on another eye-tracking task, gender prediction, to further validate our method. We believe that our approach is both simple and flexible for integration into existing models, promoting large-scale ASD screening and extending to other eye-tracking research domains.

Abstract:
Despite the effectiveness of convolutional neural networks (CNNs) in visual categorization, the logic behind their predictions is not human-understandable. While existing concept-based explainability methods reveal what a CNN sees, there is a need to understand how a specific concept is chosen (rather than another concept) for a prediction, aligning more closely with human perception. To address this challenge, we propose a novel contrastive paradigm to bridge the critical gap in global concept discovery by leveraging contrasts from cognitive sciences for discriminative concept retrieval. A new multiple-case concept retrieval method is proposed for improved local understanding of (dis)similar classification cases. We argue that a contrastive paradigm for concept retrieval and sanity checks is essential to an explainer’s trustworthiness and integrate these missing ingredients into state-of-the-art concept-based explanation frameworks to foster a better human understanding through contrast. The proposed Contrastive Perceptual Inference and Sanity Checks for Concept-based CNN Explanations (CoPISan) framework accelerates salient concept retrieval. It evaluates explainer trustworthiness via sanity checks conducted under Frontdoor and Poisoning adversarial attacks. Experimental results demonstrate CoPISan’s encouraging performance, mitigating issues related to duplication, entanglement, diminishing returns, and ambiguity of concept explanations. CoPISan is motivated by cognition and perception, offers theoretical justification and resilience, and is computationally efficient.

Abstract:
We propose a compact snapshot monocular depth estimation technique that relies on an engineered point spread function (PSF). Traditional approaches used in microscopic super-resolution imaging such as the Double-Helix PSF (DHPSF) are ill-suited for scenes that are more complex than a sparse set of point light sources. We show, using the Cramér-Rao lower bound, that separating the two lobes of the DHPSF and thereby capturing two separate images leads to a dramatic increase in depth accuracy. A special property of the phase mask used for generating the DHPSF is that a separation of the phase mask into two halves leads to a spatial separation of the two lobes. We leverage this property to build a compact polarization-based optical setup, where we place two orthogonal linear polarizers on each half of the DHPSF phase mask and then capture the resulting image with a polarization-sensitive camera. Results from simulations and a lab prototype demonstrate that our technique achieves up to 50%50% lower depth error compared to state-of-the-art designs including the DHPSF and the Tetrapod PSF, with little to no loss in spatial resolution.

Affiliations: National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China; Key Laboratory of the Ministry of Education for Mathematical Foundations and Applications of Digital Technology, School of Mathematical Sciences, University of Science and Technology of China, Hefei, China; School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China; College of Computing and Data Science, Nanyang Technological University, Singapore

Abstract:
Constrained optimization problems are pervasive in various fields, and while conventional techniques offer solutions, they often struggle with scalability. Leveraging the power of deep neural networks (DNNs) in optimization, we present a novel learning-based approach, the Constraint Boundary Wandering Framework (CBWF), to address these challenges. Our contributions include introducing a boundary wandering strategy inspired by the active-set method, enhancing equality constraint feasibility, and treating the Lipschitz constant as a learnable parameter. Additionally, we evaluate the regularization term, illustrating that the nonsmooth L2 norm yields superior results. Extensive testing on synthetic datasets and the ACOPT dataset demonstrates CBWF's superiority, outperforming existing deep learning-based solvers in terms of both objective and constraint loss.

Abstract:
3D scene graph has emerged as a powerful high-level representation of the environment and is regarded as a prerequisite for long-term autonomous robotic operations. A practical research problem here is to predict the 3D scene graph from sequentially captured data. However, existing methods neglect the polysemy of semantic roles that coarse feature vectors are insufficient to represent entities in different relationship semantics. This extremely limits their capability to predict relationships. We propose an approach to tackle the aforementioned challenge by introducing a novel representation, the hyperrectangle embedding, which represents entity using distinctive geometry for more effective scene understanding, rather than learning within vector-based feature with blindly increasing dimensions. By incorporating an entity within two affine-transformed embeddings, each representing either the subject or object and characterized by separate learnable transformations, we achieve the polysemy of semantic roles. The intersections of affine-transformed hyperrectangle embeddings represent the bidirectional relationship between two entities. We identify bias and reliability as two challenges impeding the model learning process. In response to the bias, that arises from long-tailed distributions in the data, we propose a history-guided debiasing strategy that utilizes a confusion history block comprised of previous hyperrectangle embeddings. This strategy mitigates inherent biases by extracting pertinent information and facilitating knowledge transfer from dominant categories to rare ones. To enhance the reliability of predictions, we introduce predictive uncertainty into the 3D scene graph prediction task. We develop a post-hoc reliability enhancement strategy to identify potentially unreliable predictions and subsequently enhance the model's predictive accuracy. Extensive experiments on the 3DSSG dataset show the effectiveness of the proposed method in this challenging task, outperforming existing state-of-the-art.

Affiliations: Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education, School of Computer Science and Information Engineering (School of Artificial Intelligence), Hefei University of Technology (HFUT), and Intelligent Interconnected Systems Laboratory of Anhui Province (HFUT), Hefei, China; Department of Chemistry and Centre for Atomic Engineering of Advanced Materials, Anhui University, Hefei, China; Department of Electronic Engineering and Information Science, School of Information Science and Technology, University of Science and Technology of China, Hefei, China

Abstract:
Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the “short-term effect” relationship. We innovatively consider the temporal distance and relevance as the joint “auxiliary evidence clues” and design a multi-kernel Temporal Gaussian Filter to expand the context clue into high-dimensional space, simulating the “complex visual perception”, and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as three datasets for SLVG, e.g., reporting new records at 38.88% R@1,IoU@0.7R@1,IoU@0.7 on ActivityNet Captions and 40.26% R@1,IoU@0.5R@1,IoU@0.5 on TACoS. To facilitate this field, we collect two new datasets (Charades-STA Speech and TACoS Speech) for SLVG task. Meanwhile, the inference speed of our UniSDNet is 1.56× faster than the strong multi-query benchmark.

Abstract:
Qualifying the discrepancy between 3D geometric models, which could be represented with either point clouds or triangle meshes, is a pivotal issue with board applications. Existing methods mainly focus on directly establishing the correspondence between two models and then aggregating point-wise distance between corresponding points, resulting in them being either inefficient or ineffective. In this paper, we propose DDM, an efficient, effective, robust, and differentiable distance metric for 3D geometry data. Specifically, we construct DDM based on the proposed implicit representation of 3D models, namely directional distance field (DDF), which defines the directional distances of 3D points to a model to capture its local surface geometry. We then transfer the discrepancy between two 3D geometric models as the discrepancy between their DDFs defined on an identical domain, naturally establishing model correspondence. To demonstrate the advantage of our DDM, we explore various distance metric-driven 3D geometric modeling tasks, including template surface fitting, rigid registration, non-rigid registration, scene flow estimation and human pose optimization. Extensive experiments show that our DDM achieves significantly higher accuracy under all tasks. As a generic distance metric, DDM has the potential to advance the field of 3D geometric modeling.

Abstract:
Time Series Forecasting (TSF) has been researched extensively, yet predicting time series with big variances and extreme events remains a challenging problem. Extreme events in reservoirs occur rarely but tend to cause huge problems, e.g., flooding entire towns or neighborhoods, which makes accurate reservoir water level prediction exceedingly important. In this work, we develop a novel extreme-adaptive forecasting approach to accommodate the big variance in hydrologic datasets. We model the time series data distribution as a mixture of both point-wise and segment-wise Gaussian distributions. In particular, we develop a novel End-To-End Mixture Clustering Attention Neural Network (MC-ANN) model for univariate time series forecasting, which we show is able to predict future reservoir water levels effectively. MC-ANN consists of two modules: 1) a grouped Auto-Encoder-based Forecaster (AEF) and 2) a mixture clustering-based learnable Weights Attention Network (WAN) with an attention mechanism. The WAN component is crucial, skillfully adjusting weights to distinguish data with varying distributions, enabling each AEF to concentrate on clusters of data with similar characteristics. Through extensive experiments on real-world datasets, we show MC-ANN’s effectiveness (10–45% root mean square error reductions over state-of-the-art methods), underlining its notable potential for practical applications in univariate, skewed, long-term time series prediction tasks.

Abstract:
Recent years have witnessed significant advances in image deraining due to the progress of effective image priors and deep learning models. As each deraining approach has individual settings (e.g., training and test datasets, evaluation criteria), how to fairly evaluate existing approaches comprehensively is not a trivial task. Although existing surveys aim to thoroughly review image deraining approaches, few of them focus on unifying evaluation settings to examine the deraining capability and practicality evaluation. In this paper, we provide a comprehensive review of existing image deraining methods and provide a unified evaluation setting to evaluate their performance. Furthermore, we construct a new high-quality benchmark named HQ-RAIN to conduct extensive evaluations, consisting of 5,000 paired high-resolution synthetic images with high harmony and realism. We also discuss existing challenges and highlight several future research opportunities worth exploring. To facilitate the reproduction and tracking of the latest deraining technologies for general users, we build an online platform to provide the off-the-shelf toolkit, involving the large-scale performance evaluation.

Abstract:
How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information. As adjacent frames usually contain different contents, directly stacking features of adjacent frames without discrimination may affect the latent clear frame restoration. Therefore, we develop a simple yet effective discriminative temporal feature fusion module to obtain useful temporal features for latent frame restoration. Moreover, to utilize the information from long-range frames, we develop a wavelet-based feature propagation method that takes the discriminative temporal feature fusion module as the basic unit to effectively propagate main structures from long-range frames for better video deblurring. Experimental results show that the proposed method performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity.

Abstract:
While existing causal discovery methods mostly focus on continuous time series, causal discovery for mixed time series encompassing both continuous variables (CVs) and discrete variables (DVs) is a fundamental yet underexplored problem. Together with nonlinearity and high dimensionality, mixed time series pose significant challenges for causal discovery. This study addresses the aforementioned challenges based on the following recognitions: 1) DVs may originate from latent continuous variables (LCVs) and undergo discretization processes due to measurement limitations, storage requirements, and other reasons. 2) LCVs contain fine-grained information and interact with CVs. By leveraging these interactions, the intrinsic continuity of DVs can be recovered. Thereupon, we propose a generic deep mixed time series temporal causal discovery framework. Our key idea is to adaptively recover LCVs from DVs with the guidance of CVs and perform causal discovery in a unified continuous-valued space. Technically, a new contextual adaptive Gaussian kernel embedding technique is developed for latent continuity recovery by adaptively aggregating temporal contextual information of DVs. Accordingly, two interdependent model training stages are devised for learning the latent continuity recovery with self-supervision and causal structure learning with sparsity-induced optimization. Experimentally, extensive empirical evaluations and in-depth investigations validate the superior performance of our framework.

Abstract:
Current prevailing Video Object Segmentation (VOS) methods follow the pipeline of extraction-then-matching, which first extracts features on current and reference frames independently, and then performs dense matching between them. This decoupled pipeline limits information propagation between frames to high-level features, and fails to capture fine-grained details for matching. Furthermore, the pixel-wise matching lacks holistic target understanding, making it prone to disturbance by similar distractors. To address these issues, we propose a unified VOS framework, coined JointFormer, for jointly modeling feature extraction, correspondence matching, and a compressed memory. The Joint Modeling Block leverages attention operations to simultaneously extract and propagate the target information from the reference frame to the current frame and a compressed memory token.This joint modeling scheme enables extensive multi-layer propagation beyond high-level feature space and facilitates robust instance-distinctive feature learning. In addition, to incorporate the long-term and holistic target information, we introduce a compressed memory token with a customized online updating mechanism, which aggregates target features and performs temporal information propagation in a frame-wise manner, enhancing the global modeling consistency. Our JointFormer achieves a new state-of-the-art performance on the DAVIS 2017 val/test-dev (89.7% and 87.6%) benchmarks and the YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks. To demonstrate the generalizability of JointFormer, it is further evaluated on four new benchmarks with various challenges, including MOSE for complex scenes, VISOR for egocentric videos, VOST for complex transformations, and LVOS for long-term videos. Without specific design to address these unusual difficulties, our model achieves the best performance across all benchmarks when compared with several current best models, illustrating its excellent generalization and robustness. Further extensive ablations and visualizations indicate our JointFormer enables more comprehensive and effective feature learning and matching.

Abstract:
Knowledge distillation (KD) has shown to be effective to boost the performance of graph neural networks (GNNs), where the typical objective is to distill knowledge from a deeper teacher GNN into a shallower student GNN. However, it is often quite challenging to train a satisfactory deeper GNN due to the well-known over-parametrized and over-smoothing issues, leading to invalid knowledge transfer in practical applications. In this paper, we propose the first Free-direction Knowledge Distillation framework via reinforcement learning for GNNs, called FreeKD, which is no longer required to provide a deeper well-optimized teacher GNN. Our core idea is to collaboratively learn two shallower GNNs in an effort to exchange knowledge between them via reinforcement learning in a hierarchical way. As we observe that one typical GNN model often exhibits better and worse performances at different nodes during training, we devise a dynamic and free-direction knowledge transfer strategy that involves two levels of actions: 1) node-level action determines the directions of knowledge transfer between the corresponding nodes of two networks; and then 2) structure-level action determines which of the local structures generated by the node-level actions to be propagated. Additionally, considering that different augmented graphs can potentially capture distinct perspectives or representations of the graph data, we propose FreeKD-Prompt that learns undistorted and diverse augmentations based on prompt learning for exchanging varied knowledge. Furthermore, instead of confining knowledge exchange within two GNNs, we develop FreeKD++ and FreeKD-Prompt++ to enable free-direction knowledge transfer among multiple shallow GNNs. Extensive experiments on five benchmark datasets demonstrate our approaches outperform the base GNNs by a large margin, and show their efficacy to various GNNs. More surprisingly, our FreeKD has comparable or even better performance than traditional KD algorithms that distill knowledge from a deeper and stronger teacher GNN.

Abstract:
Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems. Even though growing efforts have been devoted to cross-view recognition, academia is restricted by current existing databases captured in the controlled environment. In this paper, we contribute a new benchmark and strong baseline for Gait REcognition in the Wild (GREW). The GREW dataset is constructed from natural videos, which contain hundreds of cameras and thousands of hours of streams in open systems. With tremendous manual annotations, the GREW consists of 26 K identities and 128 K sequences with rich attributes for unconstrained gait recognition. Moreover, we add a distractor set of over 233 K sequences, making it more suitable for real-world applications. Compared with prevailing predefined cross-view datasets, the GREW has diverse and practical view variations, as well as more naturally challenging factors. To the best of our knowledge, this is the first large-scale dataset for gait recognition in the wild. Equipped with this benchmark, we dissect the unconstrained gait recognition problem, where representative appearance-based and model-based methods are explored. The proposed GREW benchmark proves to be essential for both training and evaluating gait recognizers in unconstrained scenarios. In addition, we propose the Single Path One-Shot neural architecture search with uniform sampling for Gait recognition, named SPOSGait, which is the first NAS-based gait recognition model. In experiments, SPOSGait achieves state-of-the-art performance on the CASIA-B, OU-MVLP, Gait3D, and GREW benchmarks, outperforming existing approaches by a large margin.

Abstract:
Semi-supervised learning (SSL) confronts a formidable challenge under class distribution mismatch, wherein unlabeled data contain numerous categories absent in the labeled dataset. Traditional SSL methods undergo performance deterioration in such mismatch scenarios due to the invasion of those instances from unknown categories. Despite some technical efforts to enhance SSL by mitigating the invasion, the profound theoretical analysis of SSL under class distribution mismatch is still under study. Accordingly, in this work, we propose Bi-Objective Optimization Mechanism (BOOM) to theoretically analyze the excess risk between the empirical optimal solution and the population-level optimal solution. Specifically, BOOM reveals that the SSL error is the essential contributor behind excess risk, resulting from both the pseudo-labeling error and invasion error. Meanwhile, BOOM unveils that the optimization objectives of SSL under mismatch are binary: high-quality pseudo-labels and adaptive weights on the unlabeled instances, which contribute to alleviating the pseudo-labeling error and the invasion error, respectively. Moreover, BOOM explicitly discovers the fundamental factors crucial for optimizing the bi-objectives, guided by which an approach is then proposed as a strong baseline for SSL under mismatch. Extensive experiments on benchmark and real datasets confirm the effectiveness of our proposed algorithm.

Abstract:
Single Image Reflection Removal (SIRR) is a canonical blind source separation problem and refers to the issue of separating a reflection-contaminated image into a transmission and a reflection image. The core challenge lies in minimizing the commonalities among different sources. Existing deep learning approaches either neglect the significance of feature interactions or rely on heuristically designed architectures. In this paper, we propose a novel Deep Exclusion unfolding Network (DExNet), a lightweight, interpretable, and effective network architecture for SIRR. DExNet is principally constructed by unfolding and parameterizing a simple iterative Sparse and Auxiliary Feature Update (i-SAFU) algorithm, which is specifically designed to solve a new model-based SIRR optimization formulation incorporating a general exclusion prior. This general exclusion prior enables the unfolded SAFU module to inherently identify and penalize commonalities between the transmission and reflection features, ensuring more accurate separation. The principled design of DExNet not only enhances its interpretability but also significantly improves its performance. Comprehensive experiments on four benchmark datasets demonstrate that DExNet achieves state-of-the-art visual and quantitative results while utilizing only approximately 8% of the parameters required by leading methods.

Abstract:
In scenarios with limited available data, training the function-to-function neural PDE solver in an unsupervised manner is essential. However, the efficiency and accuracy of existing methods are constrained by the properties of numerical algorithms, such as finite difference and pseudo-spectral methods, integrated during the training stage. These methods necessitate careful spatiotemporal discretization to achieve reasonable accuracy, leading to significant computational challenges and inaccurate simulations, particularly in cases with substantial spatiotemporal variations. To address these limitations, we propose the Monte Carlo Neural PDE Solver (MCNP Solver) for training unsupervised neural solvers via the PDEs’ probabilistic representation, which regards macroscopic phenomena as ensembles of random particles. Compared to other unsupervised methods, MCNP Solver naturally inherits the advantages of the Monte Carlo method, which is robust against spatiotemporal variations and can tolerate coarse step size. In simulating the trajectories of particles, we employ Heun’s method for the convection process and calculate the expectation via the probability density function of neighbouring grid points during the diffusion process. These techniques enhance accuracy and circumvent the computational issues associated with Monte Carlo sampling. Our numerical experiments on convection-diffusion, Allen-Cahn, and Navier-Stokes equations demonstrate significant improvements in accuracy and efficiency compared to other unsupervised baselines.

Abstract:
There has been a growing interest in unsupervised domain adaptation (UDA) to alleviate the data scalability issue, while the existing works usually focus on classifying independently discrete labels. However, in many tasks (e.g., medical diagnosis), the labels are discrete and successively distributed. The UDA for ordinal classification requires inducing non-trivial ordinal distribution prior to the latent space. Target for this, the partially ordered set (poset) is defined for constraining the latent vector. Instead of the typically i.i.d. Gaussian latent prior, in this work, a recursively conditional Gaussian (RCG) set is proposed for ordered constraint modeling, which admits a tractable joint distribution prior. Furthermore, we are able to control the density of content vectors that violate the poset constraint by a simple “three-sigma rule.” We explicitly disentangle the cross-domain images into a shared ordinal prior induced ordinal content space and two separate source/target ordinal-unrelated spaces, and the self-training is worked on the shared space exclusively for ordinal-aware domain alignment. Extensive experiments on UDA medical diagnoses and facial age estimation demonstrate its effectiveness.

Abstract:
In the structure from motion, the viewing graph is a graph where the vertices correspond to cameras (or images) and the edges represent the fundamental matrices. We provide a new formulation and an algorithm for determining whether a viewing graph is solvable, i.e., uniquely determines a set of projective cameras. The known theoretical conditions either do not fully characterize the solvability of all viewing graphs, or are extremely difficult to compute because they involve solving a system of polynomial equations with a large number of unknowns. The main result of this paper is a method to reduce the number of unknowns by exploiting cycle consistency. We advance the understanding of solvability by (i) finishing the classification of all minimal graphs up to 9 nodes, (ii) extending the practical verification of solvability to minimal graphs with up to 90 nodes, (iii) finally answering an open research question by showing that finite solvability is not equivalent to solvability, and (iv) formally drawing the connection with the calibrated case (i.e., parallel rigidity). Finally, we present an experiment on real data that shows that unsolvable graphs may appear in practice.

Abstract:
With wide applications of image editing tools, forged images (splicing, copy-move, removal and etc.) have been becoming great public concerns. Although existing image forgery localization methods could achieve fairly good results on several public datasets, most of them perform poorly when the forged images are JPEG compressed as they are usually done in social networks. To tackle this issue, in this paper, a self-supervised domain adaptation network, which is composed of a backbone network with Siamese architecture and a compression approximation network (ComNet), is proposed for JPEG-resistant image forgery detection and localization. To improve the performance against JPEG compression, ComNet is customized to approximate the JPEG compression operation through self-supervised learning, generating JPEG-agent images with general JPEG compression characteristics. The backbone network is then trained with domain adaptation strategy to localize the tampering boundary and region, and alleviate the domain shift between uncompressed and JPEG-agent images. Extensive experimental results on several public datasets show that the proposed method outperforms or rivals to other state-of-the-art methods in image forgery detection and localization, especially for JPEG compression with unknown QFs.

Affiliations: Department of Engineering Science, University of Oxford, Oxford, U.K.; Jarvis Research Center, Tencent YouTu Lab, Shenzhen, China; Department of Computer Science, University of Rochester, Rochester, NY, USA; School of Electronic and Computer Engineering, Peking University, Shenzhen, China; GlaxoSmithKline, London, U.K.; Nuffield Department of Primary Care Health Sciences, Applied Digital Health (ADH), University of Oxford, Oxford, U.K.; Department of Computer Science and Engineering, Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA

Abstract:
Given radiology images, automatic radiology report generation aims to produce informative text that reports diseases. It can benefit current clinical practice in diagnostic radiology. Existing methods typically rely on large-scale medical datasets annotated by clinicians to train desirable models. However, for novel diseases, sufficient training data are typically not available. We propose a prompt-based deep learning framework, i.e., PromptLLM, to align, autoencode, and prompt the (large) language model to generate reports for novel diseases accurately and efficiently. Our method includes three major steps: 1) aligning visual images and textual reports to learn general knowledge across modalities from diseases where labeled data are sufficient, 2) autoencoding the LLM using unlabeled data of novel diseases to learn the specific knowledge and writing styles of the novel disease, and 3) prompting the LLM with learned knowledge and writing styles to report the novel diseases contained in the radiology images. Through the above three steps, with limited labels on novel diseases, we show that PromptLLM can rapidly learn the corresponding knowledge for accurate novel disease reporting. The experiments on COVID-19 and diverse thorax diseases show that our approach, utilizing 1% of the training data, achieves desirable performance compared to previous methods. It shows that our approach allows us to relax the reliance on labeled data that is common to existing methods. It could have a real-world impact on data analysis during the early stages of novel diseases.

Abstract:
Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout \rightarrow→ translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.

Abstract:
Visual relationships are crucial for visual perception and reasoning, and cover tasks like Scene Graph Generation, Human-Object Interaction, and object affordance. Despite significant efforts, this field still suffers from the following limitations: specialists for a specific task without considering similar ones, strict and complex task formulations with limited flexibility, and underexploited reasoning with language and knowledge. To solve these limitations, we seek to build a new framework, one model for all tasks, over Large Multimodal Models (LMMs). LMMs offer the potential of unifying tasks, flexible forms, and reasoning with language. However, they fail to handle visual relationship tasks well. We find the obstacles include the conflicts between different tasks and insufficient instance-level information. We solve these problems by reforming the data for LMMs, rather than architectures, considering their strong language-in language-out capability. We propose to disassemble tasks into simple and common sub-tasks, verbally estimate instance confidence, and augment instance diversity, all without additional modules. These strategies help us build a visual relationship generalist, RelationLMM, with a simple architecture. Exhaustive experiments demonstrate RelationLMM is strong, generalizable and flexible to different tasks, with one model and one suite of weight.

Abstract:
Point cloud semantic segmentation is essential for understanding 3D scenes. Contemporary techniques often require extensive annotated training data, yet obtaining point-wise annotations for point clouds is time-consuming and laborious. Recent developments in weakly supervised methods seek to mitigate this problem by generating pseudo-labels using limited annotations. However, these pseudo-labels frequently suffer from either insufficient quantity or inferior quality. To overcome these hurdles, we introduce a Quantity-Quality Enhanced Self-training Network for Weakly Supervised Point Cloud Semantic Segmentation (Q2E). Specifically, an image-assisted pseudo-label generator is proposed to exploit 2D images to extend pseudo-labels for point clouds. Additionally, a hierarchical pseudo-label optimizer is developed to refine the quality of the pseudo-labels by hierarchically grouping them into broader categories. Extensive experiments on the ScanNet-v2, S3DIS, Semantic3D, and SemanticKITTI datasets demonstrate that Q2E outperforms state-of-the-art weakly supervised methods and rivals fully supervised approaches for point cloud semantic segmentation. Remarkably, as of the initial submission on February 2, 2024, our method ranked the first place in various settings of the ScanNet-v2 benchmark.

Abstract:
In this study, we propose Multimodal Fusion-supervised Cross-modality Alignment Perception (MulFS-CAP), a novel framework for single-stage fusion of unregistered infrared-visible images. Traditional two-stage methods depend on explicit registration algorithms to align source images spatially, often adding complexity. In contrast, MulFS-CAP seamlessly blends implicit registration with fusion, simplifying the process and enhancing suitability for practical applications. MulFS-CAP utilizes a shared shallow feature encoder to merge unregistered infrared-visible images in a single stage. To address the specific requirements of feature-level alignment and fusion, we develop a consistent feature learning approach via a learnable modality dictionary. This dictionary provides complementary information for unimodal features, thereby maintaining consistency between individual and fused multimodal features. As a result, MulFS-CAP effectively reduces the impact of modality variance on cross-modality feature alignment, allowing for simultaneous registration and fusion. Additionally, in MulFS-CAP, we advance a novel cross-modality alignment approach, creating a correlation matrix to detail pixel relationships between source images. This matrix aids in aligning features across infrared and visible images, further refining the fusion process. The above designs make MulFS-CAP more lightweight, effective and explicit registration-free. Experimental results from different datasets demonstrate the effectiveness of our proposed method and its superiority over the state-of-the-art two-stage methods.

Abstract:
Dynamic stream learning, which emphasizes high-velocity, single-pass, real-time responses to arriving data, is revealing new challenges to the standard machine learning paradigm. In particular, existing (deep) neural networks perform poorly when learning on data streams, as they often require having access to a large amount of training data. Therefore, to address the limitations of existing neural networks in high-speed data streams with a stationary environment, we propose a novel dynamic neural network, called Concept Neural Network (ConceptNN), by combining concepts and two different online updating strategies. First, we construct a new concept space, where each concept consists of two components: the feature vector (regarded as a concept’s intent) and its weight information (derived from a concept’s extent), for training an initial neural network. During training, the sample weight information directly works on the loss function of ConceptNN. Second, we propose a time-delay regret theory (namely, real-time prediction, then delayed update) based on online optimization theory for data stream learning. Finally, based on time-delay regret theory, we employ two online updating paradigms (i.e., the one-by-one updating strategy and chunk-by-chunk updating strategy) to update our model in the face of new data arriving continuously in a stream, and subsequently present their upper and lower bounds. Experimental results on various datasets demonstrate that the proposed ConceptNN makes it possible to learn fast-evolving data streams with better learning performance (simultaneously considering time-cost and accuracy) than the state-of-the-art dynamic learning algorithms.

Abstract:
Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game—a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. Project Page.

Abstract:
Recent advancements in bird’s eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. This suite incorporates a diverse set of camera corruption types, each examined over three severity levels. Our benchmarks also consider the impact of complete sensor failures that occur when using multi-modal models. Through RoboBEV, we assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction. Our analyses reveal a noticeable correlation between the model’s performance on in-distribution datasets and its resilience to out-of-distribution challenges. Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data. Furthermore, we observe that leveraging extensive temporal information significantly improves the model’s robustness. Based on our observations, we design an effective robustness enhancement strategy based on the CLIP model. The insights from this study pave the way for the development of future BEV models that seamlessly combine accuracy with real-world robustness.

Abstract:
We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses (often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. First, we augment the feature maps of the point cloud encoder and decoder with transformed feature maps from the image decoder, enabling effective 2D-3D interaction in both training and prediction. Second, we predict both the 3D shape and pose of the object in the decoder stage. This way, we better utilize the information in the 3D point clouds presented only in the training stage to train the network for more accurate prediction. We jointly train the two encoder-decoders for RGB and point cloud data to learn how to pass latent features to the point cloud decoder during inference. In testing, the encoder of the 3D point cloud is discarded. The design of Glissando-Net is inspired by codeSLAM. Unlike codeSLAM, which targets 3D reconstruction of scenes, we focus on pose estimation and shape reconstruction of objects, and directly predict the object pose and a pose invariant 3D reconstruction without the need of the code optimization step. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art.

Abstract:
A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is leveraged for graph filtering. Existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. However, there are many real-world datasets exhibiting strong anti-correlations, and thus a suitable model is a signed graph, containing both positive and negative edge weights. In this paper, we propose the first linear-time method for sampling signed graphs, centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix \bar\mathbf CC¯, we first learn a sparse inverse matrix \mathcal LL, interpreted as a graph Laplacian corresponding to a signed graph \mathcal GG. We approximate \mathcal GG with a balanced signed graph \mathcal G^bGb via fast edge weight augmentation in linear time, where the eigenpairs of Laplacian \mathcal L^bLb for \mathcal G^bGb are graph frequencies. Next, we select a node subset for sampling to minimize the error of the signal interpolated from samples in two steps. We first align all Gershgorin disc left-ends of Laplacian \mathcal L^bLb at the smallest eigenvalue \lambda _\min (\mathcal L^b)λmin(Lb) via similarity transform \mathcal L^s = \mathbf S\mathcal L^b \mathbf S^-1Ls=SLbS-1, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on \mathcal L^sLs using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experiments show that our signed graph sampling method outperformed fast sampling schemes designed for positive graphs on various datasets with anti-correlations.

Abstract:
Infrared-visible image fusion (IVIF) is a fundamental and critical task in the field of computer vision. Its aim is to integrate the unique characteristics of both infrared and visible spectra into a holistic representation. Since 2018, growing amount and diversity IVIF approaches step into a deep-learning era, encompassing introduced a broad spectrum of networks or loss functions for improving visual enhancement. As research deepens and practical demands grow, several intricate issues like data compatibility, perception accuracy, and efficiency cannot be ignored. Regrettably, there is a lack of recent surveys that comprehensively introduce and organize this expanding domain of knowledge. Given the current rapid development, this paper aims to fill the existing gap by providing a comprehensive survey that covers a wide array of aspects. Initially, we introduce a multi-dimensional framework to elucidate the prevalent learning-based IVIF methodologies, spanning topics from basic visual enhancement strategies to data compatibility, task adaptability, and further extensions. Subsequently, we delve into a profound analysis of these new approaches, offering a detailed lookup table to clarify their core ideas. Last but not the least, We also summarize performance comparisons quantitatively and qualitatively, covering registration, fusion and follow-up high-level tasks. Beyond delving into the technical nuances of these learning-based fusion approaches, we also explore potential future directions and open issues that warrant further exploration by the community.

Abstract:
We introduce FocalPose++, a neural render-and-compare method for jointly estimating the camera-object 6D pose and camera focal length given a single RGB input image depicting a known object. The contributions of this work are threefold. First, we derive a focal length update rule that extends an existing state-of-the-art render-and-compare 6D pose estimator to address the joint estimation task. Second, we investigate several different loss functions for jointly estimating the object pose and focal length. We find that a combination of direct focal length regression with a reprojection loss disentangling the contribution of translation, rotation, and focal length leads to improved results. Third, we explore the effect of different synthetic training data on the performance of our method. Specifically, we investigate different distributions used for sampling object's 6D pose and camera's focal length when rendering the synthetic images, and show that parametric distribution fitted on real training data works the best. We show results on three challenging benchmark datasets that depict known 3D models in uncontrolled settings. We demonstrate that our focal length and 6D pose estimates have lower error than the existing state-of-the-art methods.

Abstract:
Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions using knowledge learned from seen attribute-object compositions in the training set. Previous works mainly project an image and its corresponding composition into a common embedding space to measure their compatibility score. However, both attributes and objects share the visual representations learned above, leading the model to exploit spurious correlations and bias towards seen compositions. Instead, we reconsider CZSL as an out-of-distribution generalization problem. If an object is treated as a domain, we can learn object-invariant features to recognize attributes attached to any object reliably, and vice versa. Specifically, we propose an invariant feature learning framework to align different domains at the representation and gradient levels to capture the intrinsic characteristics associated with the tasks. To further facilitate and encourage the disentanglement of attributes and objects, we propose an “encoding-reshuffling-decoding” process to help the model avoid spurious correlations by randomly regrouping the disentangled features into synthetic features. Ultimately, our method improves generalization by learning to disentangle features that represent two independent factors of attributes and objects. Experiments demonstrate that the proposed method achieves state-of-the-art or competitive performance in both closed-world and open-world scenarios.

Abstract:
LiDAR-based fully sparse architecture has gained increasing attention. FSDv1 stands out as a representative work, achieving impressive efficacy and efficiency, albeit with intricate structures and handcrafted designs. In this paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1 and eliminate the ad-hoc heuristics in its handcrafted instance-level representation, thus promoting better universality. To this end, we introduce virtual voxels, taking over the clustering-based instance segmentation in FSDv1. Virtual voxels not only address the notorious issue of the Center Feature Missing in fully sparse detectors but also endow the framework with a more elegant and streamlined approach. Besides, we develop a suite of components to complement the virtual voxel mechanism, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy. We conduct experiments on three large-scale datasets: Waymo Open Dataset, Argoverse 2 dataset, and nuScenes dataset. Our results showcase state-of-the-art performance on all three datasets, highlighting the superiority of FSDv2 in long-range scenarios and its universality in achieving competitive performance across diverse scenarios. Moreover, we provide comprehensive experimental analysis to understand the workings of FSDv2.

Abstract:
Black-box adversarial attacks can be categorized into transfer-based and query-based attacks. The former usually has poor transfer performance due to the mismatch between the architectures of models, while the query-based attacks require massive queries and high dimensional optimization variables. In order to solve the above problems, we propose a novel attack framework integrating the advantages of transfer- and query-based attacks, where the framework is divided into two phases: training the adversarial generator and executing the black-box attacks. In the first stage, a generator is trained by the adversarial loss function so that it can output adversarial perturbation, where the latent variables are designed as the input of the generator to reduce the dimension of the optimization variables. In the second stage, based on the trained generator, we further employ a particle swarm optimization algorithm to optimize the latent variables so that the generator can output the perturbation that can achieve a successful attack. Extensive experiments are performed on the ImageNet dataset, and the results demonstrate that the proposed framework can obtain better attack performance compared with a number of the state-of-the-art black-box adversarial attack methods. In addition, we show the flexibility of the proposed framework by extending the experiment for few-pixel attacks.

Abstract:
There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or jointly encode a combination static and dynamic information. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.

Abstract:
Video anomaly detection (VAD) plays a crucial role in intelligent surveillance. However, an essential type of anomaly named scene-dependent anomaly is overlooked. Moreover, the task of video anomaly anticipation (VAA) also deserves attention. To fill these gaps, we build a comprehensive dataset named NWPU Campus, which is the largest semi-supervised VAD dataset and the first dataset for scene-dependent VAD and VAA. Meanwhile, we introduce a novel forward-backward framework for scene-dependent VAD and VAA, in which the forward network individually solves the VAD and jointly solves the VAA with the backward network. Particularly, we propose a scene-dependent generative model in latent space for the forward and backward networks. First, we propose a hierarchical variational auto-encoder to extract scene-generic features. Next, we design a score-based diffusion model in latent space to refine these features more compact for the task and generate scene-dependent features with a scene information auto-encoder, modeling the relationships between video events and scenes. Finally, we develop a temporal loss from key frames to constrain the motion consistency of video clips. Extensive experiments demonstrate that our method can handle both scene-dependent anomaly detection and anticipation well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, and the proposed NWPU Campus datasets.

Abstract:
Developmental plasticity plays a prominent role in shaping the brain’s structure during ongoing learning in response to dynamically changing environments. However, the existing network compression methods for deep artificial neural networks (ANNs) and spiking neural networks (SNNs) draw little inspiration from brain’s developmental plasticity mechanisms, thus limiting their ability to learn efficiently, rapidly, and accurately. This paper proposed a developmental plasticity-inspired adaptive pruning (DPAP) method, with inspiration from the adaptive developmental pruning of dendritic spines, synapses, and neurons according to the “use it or lose it, gradually decay” principle. The proposed DPAP model considers multiple biologically realistic mechanisms (such as dendritic spine dynamic plasticity, activity-dependent neural spiking trace, and local synaptic plasticity), with additional adaptive pruning strategy, so that the network structure can be dynamically optimized during learning without any pre-training and retraining. Extensive comparative experiments show consistent and remarkable performance and speed boost with the extremely compressed networks on a diverse set of benchmark tasks for deep ANNs and SNNs, especially the spatio-temporal joint pruning of SNNs in neuromorphic datasets. This work explores how developmental plasticity enables complex deep networks to gradually evolve into brain-like efficient and compact structures, eventually achieving state-of-the-art (SOTA) performance for biologically realistic SNNs.

Abstract:
Ghosting effects typically appear on glass surfaces, as each piece of glass has two contact surfaces causing two slightly offset layers of reflections. In this paper, we propose to take advantage of this intrinsic property of glass surfaces and apply it to glass surface detection, with two main technical novelties. First, we formulate a ghosting image formation model to describe the intensity and spatial relations among the main reflections and the background transmission within the glass region. Based on this model, we construct a new Glass Surface Ghosting Dataset (GSGD) to facilitate glass surface detection, with ～ 3.7K∼3.7K glass images and corresponding ghosting masks and glass surface masks. Second, we propose a novel method, called GhostingNet, for glass surface detection. Our method consists of a Ghosting Effects Detection (GED) module and a Glass Surface Detection (GSD) module. The key component of our GED module is a novel Double Reflection Estimation (DRE) block that models the spatial offsets of reflection layers for ghosting effect detection. The detected ghosting effects are then used to guide the GSD module for glass surface detection. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. We will release our code and dataset.

Abstract:
Multi-view multi-human association and tracking (MvMHAT), is an emerging yet important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self-learning. In this work, we tackle this problem with an end-to-end neural network in a self-supervised learning manner. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry, and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.

Abstract:
Hashing technology has exhibited great cross-modal retrieval potential due to its appealing retrieval efficiency and storage effectiveness. Most current supervised cross-modal retrieval methods heavily rely on accurate semantic supervision, which is intractable for annotations with ever-growing sample sizes. By comparison, the existing unsupervised methods rely on accurate sample similarity preservation strategies with intensive computational costs to compensate for the lack of semantic guidance, which causes these methods to lose the power to bridge the semantic gap. Furthermore, both kinds of approaches need to search for the nearest samples among all samples in a large search space, whose process is laborious. To address these issues, this paper proposes an unsupervised dual deep hashing (UDDH) method with semantic-index and content-code for cross-modal retrieval. Deep hashing networks are utilized to extract deep features and jointly encode the dual hashing codes in a collaborative manner with a common semantic index and modality content codes to simultaneously bridge the semantic and heterogeneous gaps for cross-modal retrieval. The dual deep hashing architecture, comprising the head code on semantic index and tail codes on modality content, enhances the efficiency for cross-modal retrieval. A query sample only needs to search for the retrieved samples with the same semantic index, thus greatly shrinking the search space and achieving superior retrieval efficiency. UDDH integrates the learning processes of deep feature extraction, binary optimization, common semantic index, and modality content code within a unified model, allowing for collaborative optimization to enhance the overall performance. Extensive experiments are conducted to demonstrate the retrieval superiority of the proposed approach over the state-of-the-art baselines.

Abstract:
Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS^\mathbf++ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS^\mathbf++ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation.

Abstract:
Robust support vector machine (RSVM) using ramp loss provides a better generalization performance than traditional support vector machine (SVM) using hinge loss. However, the good performance of RSVM heavily depends on the proper values of regularization parameter and ramp parameter. Traditional model selection technique with gird search has extremely high computational cost especially for fine-grained search. To address this challenging problem, in this paper, we first propose solution paths of RSVM (SPRSVM) based on the concave-convex procedure (CCCP) which can track the solutions of the non-convex RSVM with respect to regularization parameter and ramp parameter respectively. Specifically, we use incremental and decremental learning algorithms to deal with the Karush-Khun-Tucker violating samples in the process of tracking the solutions. Based on the solution paths of RSVM and the piecewise linearity of model function, we can compute the error paths of RSVM and find the values of regularization parameter and ramp parameter, respectively, which corresponds to the minimum cross validation error. We prove the finite convergence of SPRSVM and analyze the computational complexity of SPRSVM. Experimental results on a variety of benchmark datasets not only verify that our SPRSVM can globally search the regularization and ramp parameters respectively, but also show a huge reduction of computational time compared with the grid search approach.

Abstract:
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we propose a two-stage alignment method, including patch-based optical flow alignment and auxiliary-LR guiding alignment. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss. Furthermore, we take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts.

Abstract:
Forgetting refers to the loss or deterioration of previously acquired knowledge. While existing surveys on forgetting have primarily focused on continual learning, forgetting is a prevalent phenomenon observed in various other research domains within deep learning. Forgetting manifests in research fields such as generative models due to generator shifts, and federated learning due to heterogeneous data distributions across clients. Addressing forgetting encompasses several challenges, including balancing the retention of old task knowledge with fast learning of new task, managing task interference with conflicting goals, and preventing privacy leakage, etc. Moreover, most existing surveys on continual learning implicitly assume that forgetting is always harmful. In contrast, our survey argues that forgetting is a double-edged sword and can be beneficial and desirable in certain cases, such as privacy-preserving scenarios. By exploring forgetting in a broader context, we present a more nuanced understanding of this phenomenon and highlight its potential advantages. Through this comprehensive survey, we aspire to uncover potential solutions by drawing upon ideas and approaches from various fields that have dealt with forgetting. By examining forgetting beyond its conventional boundaries, we hope to encourage the development of novel strategies for mitigating, harnessing, or even embracing forgetting in real applications.

Abstract:
The recent advancement of generative foundational models has ushered in a new era of image generation in the realm of natural images, revolutionizing art design, entertainment, environment simulation, and beyond. Despite producing high-quality samples, existing methods are constrained to generating images of scenes at a limited scale. In this paper, we present MetaEarth - a generative foundation model that breaks the barrier by scaling image generation to a global level, exploring the creation of worldwide, multi-resolution, unbounded, and virtually limitless remote sensing images. In MetaEarth, we propose a resolution-guided self-cascading generative framework, which enables the generating of images at any region with a wide range of geographical resolutions. To achieve unbounded and arbitrary-sized image generation, we design a novel noise sampling strategy for denoising diffusion models by analyzing the generation conditions and initial noise. To train MetaEarth, we construct a large dataset comprising multi-resolution optical remote sensing images with geographical information. Experiments have demonstrated the powerful capabilities of our method in generating global-scale images. Additionally, the MetaEarth serves as a data engine that can provide high-quality and rich training data for downstream tasks. Our model opens up new possibilities for constructing generative world models by simulating Earthâs visuals from an innovative overhead perspective.

Abstract:
We present a comprehensive survey and benchmark of both traditional and learning-based methods for surface reconstruction from point clouds. This task is particularly challenging for real-world acquisitions due to factors, such as noise, outliers, non-uniform sampling, and missing data. Traditional approaches often simplify the problem by imposing handcrafted priors on either the input point clouds or the resulting surface, a process that can require tedious hyperparameter tuning. In contrast, deep learning models have the capability to directly learn the properties of input point clouds and desired surfaces from data. We study the influence of handcrafted and learned priors on the precision and robustness of surface reconstruction techniques. We evaluate various time-tested and contemporary methods in a standardized manner. When both trained and evaluated on point clouds with identical characteristics, the learning-based models consistently produce higher-quality surfaces compared to their traditional counterparts—even in scenarios involving novel shape categories. However, traditional methods demonstrate greater resilience to the diverse anomalies commonly found in real-world 3D acquisitions. For the benefit of the research community, we make our code and datasets available, inviting further enhancements to learning-based surface reconstruction.

Abstract:
This work reports a novel multi-frame Bundle Adjustment (BA) framework called RKHS-BA. It uses continuous landmark representations that encode RGB-D/LiDAR and semantic observations in a reproducing kernel hilbert space (RKHS). With a correspondence-free pose graph formulation, the proposed system constructs a loss function that achieves more generalized convergence than classical point-wise convergence. We demonstrate its applications in multi-view point cloud registration, sliding-window odometry, and global LiDAR mapping on simulated and real data. It shows highly robust pose estimations in extremely noisy scenes and exhibits strong generalization with various types of semantic inputs.

Abstract:
Aerial Remote Sensing (ARS) vision tasks present significant challenges due to the unique viewing angle characteristics. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes RingMo-Aerial, aiming to fill the gap in foundation model research in the field of ARS vision. A Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism is introduced to strengthen the model’s capacity for small-object representation. Complementarily, an affine transformation-based contrastive learning method improves its adaptability to the tilted viewing angles inherent in ARS tasks. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model’s adaptability and performance in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and efficacy of RingMo-Aerial in enhancing the performance of ARS vision tasks.

Abstract:
Robustly reconstructing 3D hand mesh from a single image is very challenging, due to (i) the lack of diversity in existing real-world datasets and (ii) the ambiguity in occluded hand regions. While data synthesis helps relieve issue (i), the syn-to-real gap still hinders its usage. For issue (ii), most previous works produce deterministic results while other probabilistic methods rely on ground truths to choose the best hypothesis. In this work, we explore the diffusion model to alleviate these problems by collectively considering two perspectives: (i) conditional synthesis and sampling approach for realistic data generation and (ii) probabilistic modeling with progressive multi-hypothesis aggregation. First, we present HandBooster, a new approach to uplift the data diversity by training a conditional generative space on hand-object interactions and sampling the space to synthesize effective data with reliable 3D annotations and diverse hand appearances, poses, views, and backgrounds. Second, we design HandBooster+, a probabilistic diffusion-based model to further boost the 3D hand-mesh reconstruction performance by progressively aggregating the multiple hypotheses. Extensive experimental results show that our method significantly improves several baselines and achieves SOTA on the HO3D and DexYCB benchmarks.

Abstract:
Visual recognition models pretrained on clean images usually do not perform well in the presence of image corruptions, such as blurring or noise, which limits their applicability in real-world scenarios. To solve this problem, existing approaches usually design complex data augmentations to train a robust model from scratch or adapt a pretrained model to corrupted scenarios. These approaches ignore the existence of the large number of deployed models in our community, causing extensive computation and storage costs for making deployed models adapted. Based on this consideration, this paper focuses on solving a practical problem of making many clean-image-pretrained models adapt to unlabeled corrupted images through one training procedure. To this end, we aim to learn a Plug-and-play Image Translator (PIT) that can be directly combined with recognition models after training. Existing approaches, such as vanilla image translation and restoration, are not proper for solving this problem, as they are mostly based on supervised training and are not recognition-oriented. To address this issue, we propose a recognition-oriented unsupervised image translation framework to make PIT produce images with indistinguishable recognition predictions from the clean ones. We verify the effectiveness of PIT on several recognition tasks and show that PIT boosts the performance of clean-image-pretrained models significantly in the presence of image corruptions.

Abstract:
This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20 K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs.

Abstract:
Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e.g. autonomous driving, yet it still remains notoriously labor-intensive. Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations under such a label-efficient fine-tuning paradigm. SPOT achieves effectiveness on various public datasets with different downstream tasks, showcasing its general representation power, cross-domain robustness and data scalability which are three key factors for real-world application. Specifically, we both theoretically and empirically show, for the first time, that general representations learning can be achieved through the task of occupancy prediction. Then, to address the domain gap caused by different LiDAR sensors and annotation methods, we develop a beam re-sampling technique for point cloud augmentation combined with class-balancing strategy. Furthermore, scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. Additionally, such pre-training strategy also remains compatible with unlabeled data. The hope is that our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.

Abstract:
With the fast development of AI-related techniques, the applications of trajectory prediction are no longer limited to easier scenes and trajectories. More and more trajectories with different forms, such as coordinates, bounding boxes, and even high-dimensional human skeletons, need to be analyzed and forecasted. Among these heterogeneous trajectories, interactions between different elements within a frame of trajectory, which we call “Dimension-wise Interactions”, would be more complex and challenging. However, most previous approaches focus mainly on a specific form of trajectories, and potential dimension-wise interactions are less concerned. In this work, we expand the trajectory prediction task by introducing the trajectory dimensionality MM, thus extending its application scenarios to heterogeneous trajectories. We first introduce the Haar transform as an alternative to the Fourier transform to better capture the time-frequency properties of each trajectory-dimension. Then, we adopt the bilinear structure to model and fuse two factors simultaneously, including the time-frequency response and the dimension-wise interaction, to forecast heterogeneous trajectories via trajectory spectrums hierarchically in a generic way. Experiments show that the proposed model outperforms most state-of-the-art methods on ETH-UCY, SDD, nuScenes, and Human3.6 M with heterogeneous trajectories, including 2D coordinates, 2D/3D bounding boxes, and 3D human skeletons.

Affiliations: School of Computer Science, Technology, Tongji University, Shanghai, China; School of Automotive Engineering, Tongji University, Shanghai, China; Faculty of Computer Science, Chemnitz University of Technology, Chemnitz, Germany; Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China; College of Computing, Data Science, Nanyang Technological University, Singapore; School of Computer Science, Technology, Key Laboratory of Embedded System, Service Computing, Ministry of Education, Tongji University, Shanghai, China

Abstract:
Deep neural networks often exhibit sub-optimal performance under covariate and category shifts. Source-Free Domain Adaptation (SFDA) presents a promising solution to this dilemma, yet most SFDA approaches are restricted to closed-set scenarios. In this paper, we explore Source-Free Universal Domain Adaptation (SF-UniDA) aiming to accurately classify “known” data belonging to common categories and segregate them from target-private “unknown” data. We propose a novel Global and Local Clustering (GLC) technique, which comprises an adaptive one-vs-all global clustering algorithm to discern between target classes, complemented by a local k-NN clustering strategy to mitigate negative transfer. Despite the effectiveness, the inherent closed-set source architecture leads to uniform treatment of “unknown” data, impeding the identification of distinct “unknown” categories. To address this, we evolve GLC to GLC++, integrating a contrastive affinity learning strategy. We examine the superiority of GLC and GLC++ across multiple benchmarks and category shift scenarios. Remarkably, in the most challenging open-partial-set scenarios, GLC and GLC++ surpass GATE by 16.8% and 18.9% in H-score on VisDA, respectively. GLC++ enhances the novel category clustering accuracy of GLC by 4.1% in open-set scenarios on Office-Home. Furthermore, the introduced contrastive learning strategy not only enhances GLC but also significantly facilitates existing methodologies.

Abstract:
Most clustering validity indexes (CVIs) for fuzzy clustering are based upon the fuzzy c-means (FCMs) algorithm, and the effect of these CVIs is limited due to the “uniform effect” of FCM. Besides, main existing CVIs have the problems of incompleteness characterization of separateness and weak performance for noisy datasets. To address these challenges, the multi-granularity fusion (MGF) index is proposed. First, MGF synthetically considers the FCM, possibilistic fuzzy c-means and kernel-based FCM algorithms, which is more comprehensive than just considering FCM. Second, we add a perturbation to the sum of the partition matrix as the fuzzy cardinality and combine it with the fuzzy weighted distance, which are helpful to grasp the compactness. Third, four elements are considered together to characterize the separateness, incorporating the minimum distance, the maximum distance, the mean distance, and the sample variance of cluster center, where the last one can make the separateness unbiased from the macroscopic perspective. Besides, the convergence of MGF is proved. Finally, we test MGF for five algorithms on 36 datasets comparing with 14 CVIs, validating the accuracy and stability of MGF. It is observed that MGF can get superior results than other CVIs, especially for high-dimensional datasets and noisy datasets.

Abstract:
Shape from focus (SFF) is a technique used to estimate the depth of a scene from a sequence of multifocus images. Existing SFF methods can be categorized into two groups: traditional methods and deep learning-based methods. Traditional methods typically employ a focus measure (FM) operator to assess the sharpness of individual pixels in a single-frame image, often overlooking the associations within the image sequence. Deep learning methods generally rely on labeled datasets, which are often challenging to obtain in real-world scenarios. Based on these observations, we propose a novel sequence association-based (SAS) framework aimed at enhancing the generalizability of SFF methods. In the SAS framework, an image sequence is treated as complete three-dimensional (3D) data throughout the processes of multiview decomposition, selective fusion and multiscale feature aggregation. Furthermore, the framework includes a tighter multiview learning generalization error bound to guide the development of the selective fusion method. This method leverages isomorphisms among multiple views to effectively mitigate the adverse effects of outlier noise on the reconstruction of various scenes. Comprehensive experiments on seven synthetic datasets and two real scenes with unknown labels demonstrate the effectiveness and generalizability of the SAS framework compared to state-of-the-art SFF methods.

Abstract:
Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs.

Abstract:
Cross-view video generation from exocentric (third-person) to egocentric (first-person) perspectives poses a challenging task, due to the significant viewpoint gap and limited overlap between these two views. Previous methods exhibit limitations in capturing long-range temporal context and overlook egocentric semantic priors, leading to degraded performance in cross-view synthesis. To address these challenges, we propose a cue-free video-based approach termed cascaded Dynamic memory Refinement and Semantic Alignment (DRSA), which integrates temporal knowledge over extended periods and learns egocentric semantic information to generate videos. The Dynamic Memory Refinement (DMR) exploits long horizon temporal dynamics to learn salient information that compensates for the limited overlap between views. Specifically, we devise a dynamic memory that serves as a knowledge repository, and utilize a sliding window to locate the corresponding long-term temporal information, which is subsequently processed with adaptive weighting and cross-attention transformer to refine feature representations. Furthermore, aware of the considerable viewpoint divergence that hinder semantic learning of target view, we propose Viewpoint-aware Semantic Alignment (VSA) with dual encoder-decoder learning and semantic alignment, which transfer egocentric semantic details from the egocentric synthesis pipeline to the exocentric synthesis pipeline. In particular, the VSA module narrows the semantic gap between views, further promoting long-range temporal modeling in DMR under alignment constraints. By extending this into a cascaded fashion, the Cascaded Alignment and Refinement (CAR) progressively aligns semantic features and performs feature refinement to facilitate viewpoint learning at different levels of granularity. To overcome the limitations of existing databases known for their limited static scenes and scarcity of interacting objects, we create a new dataset with dynamic exocentric scenes and rich interacting objects to further promote the task. Thorough experimental analysis reveals that our method surpasses current state-of-the-art techniques in terms of both quantitative metrics and qualitative evaluations.

Abstract:
In this paper, we develop the notion of the difference of evidence lower bounds (DELBO), based on which an efficient score algorithm is presented to implement feature selection on latent variables of VAE and its variants. Furthermore, we propose marginalization approximation algorithms to optimize VAE-related models by weighting the “more important” latent variables selected and accordingly increasing evidence lower bound. We discuss two kinds of different Gaussian posteriors, mean-field and full-covariance, for latent variables, and make the corresponding theoretical analyses to support the effectiveness of algorithms. Plenty of comparative experiments are carried out between our algorithms and the other 9 feature selection methods on 7 public datasets to address generative tasks. The results demonstrate the superior performance of our algorithms. Finally, we extend DELBO to its generalized version and apply the latter to tackling classification tasks of 5 new public datasets with satisfactory experimental results.

Abstract:
Adversarial training has been empirically demonstrated as an effective strategy to improve the robustness of deep neural networks (DNNs) against adversarial examples. However, the underlying reason of its effectiveness is still non-transparent. In this paper we conduct both extensive theoretical and empirical analysis on the implicit bias induced by adversarial training from a generalized margin perspective. Our results focus on adversarial training for homogeneous DNNs. In particular, (i) For deep linear networks with \ell _pℓp-norm perturbation, we show that weight matrices of adjacent layers get aligned and the converged parameters maximize the margin of adversarial examples, which can be further viewed as a generalized margin of the original dataset that can be achieved by an interpolation solution between \ell _2ℓ2-SVM and \ell _qℓq-SVM where 1/p + 1/q=11/p+1/q=1. (ii) For general homogeneous DNNs, including both linear and nonlinear ones, we investigate adversarial training with a variety of adversarial perturbations in a unified manner. Specifically, we show that the direction of the limit point of parameters converges to a KKT point of a constrained optimization problem that aims to maximize the margin for adversarial examples. Additionally, as an application of this general result for two special linear homogeneous DNNs, diagonal linear networks and linear convolutional networks, we show that adversarial training with \ell _pℓp-norm perturbation equivalently minimizes an interpolation norm that depends on the depth, the architecture, and the value of pp in the predictor space. Extensive experiments are conducted to verify theoretical claims. Our results theoretically provide the basis for the longstanding folklore Madry et al. 2018 that adversarial training modifies the decision boundary by utilizing adversarial examples to improve robustness, and potentially provide insights for designing new robust training strategies.

Abstract:
As neural networks are trained to perform tasks of increasing complexity, their size increases, which presents several challenges in their deployment on devices with limited resources. To cope with this, a recently proposed approach hinges on substituting the classical Multiply-and-ACcumulate (MAC) neurons in the hidden layers with other neurons called Multiply-And-Max/min (MAM) whose selective behavior helps identify important interconnections, thus allowing aggressive pruning of the others. Hybrid MAM&MAC structures promise a 10x or even 100x reduction in their memory footprint compared to what can be obtained by pruning MAC-only structures. However, a cornerstone of maintaining this promise is the assumption that MAC&MAM architectures have the same expressive power as MAC-only ones. To concretize such a cornerstone, we take here a step in the theoretical characterization of the capabilities of mixed MAM&MAC networks. We prove, with two theorems, that two hidden MAM layers followed by a MAC neuron with possibly a normalization stage is a universal approximator.

Abstract:
Event cameras draw inspiration from biological systems, boasting low latency and high dynamic range while consuming minimal power. The most current approach to processing Event Cloud often involves converting it into frame-based representations, which neglects the sparsity of events, loses fine-grained temporal information, and increases the computational burden. In contrast, Point Cloud is a popular representation for processing 3-dimensional data and serves as an alternative method to exploit local and global spatial features. Nevertheless, previous point-based methods show an unsatisfactory performance compared to the frame-based method in dealing with spatio-temporal event streams. In order to bridge the gap, we propose EventMamba, an efficient and effective framework based on Point Cloud representation by rethinking the distinction between Event Cloud and Point Cloud, emphasizing vital temporal information. The Event Cloud is subsequently fed into a hierarchical structure with staged modules to process both implicit and explicit temporal features. Specifically, we redesign the global extractor to enhance explicit temporal extraction among a long sequence of events with temporal aggregation and State Space Model (SSM) based Mamba. Our model consumes minimal computational resources in the experiments and still exhibits SOTA point-based performance on six different scales of action recognition datasets. It even outperformed all frame-based methods on both Camera Pose Relocalization (CPR) and eye-tracking regression tasks.

Abstract:
The objective of few-shot object detection (FSOD) is to detect novel objects with few training samples. The key challenge is constructing a generalized feature space for novel categories with limited data, leveraging the base category space to adapt the detection model. Most fine-tuning methods address this by pre-training on base categories and fine-tuning on novel ones. However, limited novel samples lead to two issues: (1) the features of the novel category are easily implicitly represented by the features of the base category, leading to inseparable classifier boundaries, (2) novel categories with fewer data are not enough to fully represent the distribution, where the model fine-tuning is prone to overfitting. To address these issues, we propose a generalized feature learning method for FSOD by leveraging side information. Specifically, we first construct a knowledge matrix from embedded side information to model semantic relations between base and novel categories. Then, to strengthen the discrimination between semantically similar categories, we further develop contextual semantic supervised contrastive learning which embeds side information. To mitigate overfitting from limited samples, we introduce a side-information guided region-aware masking module that improves sample diversity by removing biased information through counterfactual explanations. Extensive experiments using ResNet and ViT backbones on PASCAL VOC, MS COCO, LVIS V1, FSOD-1K, and FSVOD-500 benchmarks demonstrate that our model outperforms the previous state-of-the-art methods, significantly improving the ability of FSOD in most shots/splits.

Abstract:
Although supervised deep normal estimators have recently shown impressive results on synthetic benchmarks, their performance deteriorates significantly in real-world scenarios due to the domain gap between synthetic and real data. Building high-quality real training data to boost those supervised methods is not trivial because point-wise annotation of normals for varying-scale real-world 3D scenes is a tedious and expensive task. This paper introduces PointNorm-Net, the first self-supervised deep learning framework to tackle this challenge. The key novelty of PointNorm-Net is a three-stage multi-modal normal distribution estimation paradigm that can be integrated into either deep or traditional optimization-based normal estimation frameworks. Extensive experiments show that our method achieves superior generalization and outperforms state-of-the-art conventional and deep learning approaches across three real-world datasets that exhibit distinct characteristics compared to the synthetic training data.

Abstract:
Exogenous variables are specially used in Structural Causal Models (SCM), which, however, have some characteristics that are still useful under the property of the Bayesian network. In this paper, we propose a novel causal discovery learning algorithm called Endogenous and Exogenous Markov Blankets Intersection (EEMBI), which combines the properties of Bayesian networks and SCM. Through intersecting the Markov blankets of exogenous variables and endogenous variables (the original variables), EEMBI can remove the irrelevant connections and find the true causal structure theoretically. Furthermore, we propose an extended version of EEMBI, named EEMBI-PC, which integrates the last step of the PC algorithm into EEMBI. This extension enhances the algorithm's performance by leveraging the strengths of both approaches. Plenty of experiments are provided to prove that EEMBI have state-of-the-art performance on continuous datasets, and EEMBI-PC outperforms other algorithms on discrete datasets.

Abstract:
Room layout estimation seeks to infer the overall spatial configuration of indoor scenes using perspective or panoramic images. As the layout is determined by the dominant indoor planes, this problem inherently requires the reconstruction of these planes. Some studies reconstruct indoor planes from perspective images by learning pixel-level or instance-level plane parameters. However, directly learning these parameters has the problems of susceptibility to occlusions and position dependency. In this paper, we introduce the Comprehensive depth map to Planar depth (C2P) conversion, which reformulates planar depth reconstruction into the prediction of a comprehensive depth map and planar visibility confidence. Based on the parametric representation of planar depth we propose, the C2P conversion is applicable to both panoramic and perspective images. Accordingly, we present an effective framework for room layout estimation that jointly learns the comprehensive depth map and planar visibility confidence. Due to the differentiability of the C2P conversion, our network autonomously learns planar visibility confidence by constraining the estimated plane parameters and reconstructed planar depth map. We further propose a novel approach for 3D layout generation through sequential planar depth map integration. Experimental results demonstrate the superiority of our method across all evaluated panoramic and perspective datasets.

Abstract:
Detecting keypoints on diverse objects is essential for fine-grained visual understanding and analysis. This paper introduces Enhanced Explicit Box Detection (ED-Pose++), an end-to-end framework that leverages cascade box regression to realize both conventional and interactive multi-object keypoint detection. Unlike traditional one-stage methods, ED-Pose++ innovatively redefines multi-object keypoint detection as a dual-phase explicit box detection, achieving a unified representation and regression optimization process. Specifically, an object detection decoder first extracts each object’s position and global features, establishing a good initialization for subsequent keypoint detection. To bring in contextual information near keypoints, we also regard each keypoint as a small box to learn both positions and their related local contents. In practice, an object-to-keypoint detection decoder adopts a collaborative learning strategy between object and keypoint features, facilitating efficient information propagation between global and local perspectives. Rooted on the architecture, we further equip dual-phase box detection with an interactive mechanism that enables the model to refine its predictions based on limited user feedback. During training, we incorporate an error correction scheme to equip the model with an adept self-correction capability for use during inference. The comprehensive experiments demonstrate ED-Pose++’s superior performance in conventional multi-object keypoint detection tasks. For the first time, ED-Pose++ outperforms heatmap-based top-down approaches across various benchmarks, despite operating within a fully end-to-end architecture. The interactive variant also dramatically reduces more than 10 times the labeling effort of 2D keypoint annotation compared with manual-only annotation.

Abstract:
Point cloud semantic segmentation can enhance the understanding of the production environment and is a crucial component of vision tasks. The efficacy and generalization prowess of deep learning-based segmentation models are inherently contingent upon the quality and nature of the data employed in their training. However, it is often challenging to obtain data with inter-class balance, and training an intelligent segmentation network with the imbalanced data may cause cognitive bias. In this paper, a network framework InvSpaceNet is proposed, which generates an inverse feature space to alleviate the cognitive bias caused by imbalanced data. Specifically, we design a dual-branch training architecture that combines the superior feature representations derived from instance-balanced sampling data with the cognitive corrections introduced by the proposed inverse sampling data. In the inverse feature space of the point cloud generated by the auxiliary branch, the central points aggregated by class are constrained by the contrastive loss. To refine the class cognition in the inverse feature space, features are used to generate point cloud class prototypes through momentum update. These class prototypes from the inverse space are utilized to generate feature maps and structure maps that are aligned with the positive feature space of the main branch segmentation network. The training of the main branch is dynamically guided through gradients back propagated from different losses. Extensive experiments conducted on four large benchmarks (i.e., S3DIS, ScanNet v2, Toronto-3D, and SemanticKITTI) demonstrate that the proposed method can effectively mitigate point cloud imbalance issues and improve segmentation performance.

Abstract:
Recent neural networks based surface reconstruction can be roughly divided into two categories, one warping templates explicitly and the other representing 3D surfaces implicitly. To enjoy the advantages of both, we propose a novel 3D representation, Neural Vector Fields (NVF), which adopts the explicit learning process to manipulate meshes and implicit unsigned distance function (UDF) representation to break the barriers in resolution and topology. This is achieved by directly predicting the displacements from surface queries and modeling shapes as Vector Fields, rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods do. In this way, our approach is capable of encoding both the distance and the direction fields so that the calculation of direction fields is differentiation-free, circumventing the non-trivial surface extraction step. Furthermore, building upon NVFs, we propose to incorporate two types of shape codebooks, i.e., NVFs (Lite or Ultra), to promote cross-category reconstruction through encoding cross-object priors. Moreover, we propose a new regularization based on analyzing the zero-curl property of NVFs, and implement this through the fully differentiable framework of our NVF (ultra). We evaluate both NVFs on four surface reconstruction scenarios, including watertight vs non-watertight shapes, category-agnostic reconstruction vs category-unseen reconstruction, category-specific, and cross-domain reconstruction.

Abstract:
The latest trends in the research field of single-view human reconstruction are devoted to learning deep implicit functions constrained by explicit body shape priors. Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still exhibit limitations in terms of flexibility, generalizability, robustness, and/or representation capability. To comprehensively address the above issues, in this paper, we investigate an explicit point-based human reconstruction framework named HaP, which utilizes point clouds as the intermediate representation of the target geometric structure. Technically, our approach features fully explicit point cloud estimation (exploiting depth and SMPL), manipulation (SMPL rectification), generation (built upon diffusion), and refinement (displacement learning and depth replacement) in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20%% to 40%% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design. In addition, we newly contribute a real-scanned 3D human dataset featuring more intricate geometric details.

Abstract:
Hyperspectral super-resolution is commonly accomplished by the fusing of a hyperspectral imaging of low spatial resolution with a multispectral image of high spatial resolution, and many tensor-based approaches to this task have been recently proposed. Yet, it is assumed in such tensor-based methods that the spatial-blurring operation that creates the observed hyperspectral image from the desired super-resolved image is separable into independent horizontal and vertical blurring. Recent work has argued that such separable spatial degradation is ill-equipped to model the operation of real sensors which may exhibit, for example, anisotropic blurring. To accommodate this fact, a generalized tensor formulation based on a Kronecker decomposition is proposed to handle any general spatial-degradation matrix, including those that are not separable as previously assumed. Analysis of the generalized formulation reveals conditions under which exact recovery of the desired super-resolved image is guaranteed, and a practical algorithm for such recovery, driven by a blockwise-group-sparsity regularization, is proposed. Extensive experimental results demonstrate that the proposed generalized tensor approach outperforms not only traditional matrix-based techniques but also state-of-the-art tensor-based methods; the gains with respect to the latter are especially significant in cases of anisotropic spatial blurring.

Abstract:
The research of class-incremental semantic segmentation (CISS) seeks to enhance semantic segmentation methods by enabling the progressive learning of new classes while preserving knowledge of previously learned ones. A significant yet often neglected challenge in this domain is class imbalance. In CISS, each task focuses on different foreground classes, with the training set for each task exclusively comprising images that contain these currently focused classes. This results in an overrepresentation of these classes within the single-task training set, leading to a classification bias towards them. To address this issue, we propose a novel CISS method named STAR, whose core principle is to reintegrate the missing proportions of previous classes into current single-task training samples by replaying their prototypes. Moreover, we develop a prototype deviation technique that enables the deduction of past-class prototypes, integrating the recognition patterns of the classifiers and the extraction patterns of the feature extractor. With this technique, replay can be accomplished without using any storage to save prototypes. Complementing our method, we devise two loss functions to enforce cross-task feature constraints: the Old-Class Features Maintaining (OCFM) loss and the Similarity-Aware Discriminative (SAD) loss. The OCFM loss is designed to stabilize the feature space of old classes, thus preserving previously acquired knowledge without compromising the ability to learn new classes. The SAD loss aims to enhance feature distinctions between similar old and new class pairs, minimizing potential confusion. Our experiments on two public datasets, Pascal VOC 2012 and ADE20 K, demonstrate that our STAR achieves state-of-the-art performance.

Abstract:
Neural Radiance Fields (NeRFs) have shown promising results in novel view synthesis. While achieving state-of-the-art rendering results, NeRF usually encodes all properties related to geometry and appearance of the scene together into several MLP (Multi-Layer Perceptron) networks, which hinders downstream manipulation of geometry, appearance and illumination. Recently researchers made attempts to edit geometry, appearance and lighting for NeRF. However, they fail to render view-consistent results after editing the appearance of the input scene. Moreover, many approaches use Spherical Gaussian (SG) or Spherical Harmonic (SH) functions, or low-resolution environment maps to model lighting. These methods, however, struggle with high-frequency environmental relighting. While some approaches utilize high-resolution environment maps, the strategy of jointly optimizing geometry, material, and lighting introduces additional ambiguity. To solve the above problems, we propose VD-NeRF, a visibility-aware approach to decoupling view-independent appearance and view-dependent appearance in the scene with a hybrid lighting representation. Specifically, we first train a signed distance function to reconstruct an explicit mesh for the input scene. Then a decoupled NeRF learns to attach view-independent appearance to the reconstructed mesh by defining learnable disentangled features representing geometry and view-independent appearance on its vertices. For lighting, we approximate it with an explicit learnable environment map and an implicit lighting network to support both low-frequency and high-frequency relighting. By modifying the view-independent appearance, rendered results are consistent across different viewpoints. Our method also supports high-frequency environmental relighting by replacing the explicit environment map with a novel one and fitting the implicit lighting network to the novel environment map. We further take visibility into consideration when rendering and decoupling the input 3D scene, which improves the quality of decomposition and relighting results and also enables more downstream applications such as scene composition where occlusions between scenes are common. Extensive experiments show that our method achieves better editing and relighting performance both quantitatively and qualitatively compared to previous methods.

Abstract:
Recently, some weakly supervised 3D point cloud segmentation methods have been proposed to develop effective models with minimum annotation efforts. Our previous work, W4DTS, proposes a challenging task that utilizes only 0.001% points in outdoor point cloud datasets to achieve an effective segmentation model. However, under an extremely limited annotation budget, the quality of pseudo labels generated by W4DTS is unsatisfactory, which limits the segmentation performance in such scenarios. To solve this issue, we propose a progressive 4D grouping approach to group the annotated and unannotated points across space and time, which can generate high-quality pseudo labels with very sparse annotated points. Moreover, to further improve our progressive 4D grouping approach, we design a cross-frame contrastive learning and a local consistency learning to improve the quality of our 4D grouping. Experimental results reveal that with only 0.001% annotations, our solution significantly outperforms the previous best approach on SemanticKITTI. We also evaluate our framework on the SemanticPOSS dataset and ScribbleKITTI dataset, and achieve performances close to our fully supervised backbone models.

Abstract:
Dictionary learning is an effective tool for pattern recognition and classification of time series data. However, real-world time series data often exhibit temporal misalignment due to temporal delay, scaling or other temporal transformations, which poses significant challenges for effective dictionary learning. Dynamic time warping (DTW) is commonly used for dealing with such misalignment issues. Nevertheless, the DTW suffers overfitting or information loss due to its discrete nature in aligning time series data. To address this issue, we propose a generalized time warping invariant dictionary learning algorithm in this paper. Our approach features a generalized time warping operator, which consists of linear combinations of continuous basis functions for facilitating continuous temporal warping. The integration of the proposed operator and the dictionary learning is formulated as an optimization problem, where the block coordinate descent method is employed to jointly optimize warping paths, dictionaries, and sparse coefficients. The optimized results are then used as hyperspace distance measures to feed classification and clustering algorithms. The superiority of the proposed method in terms of dictionary learning, classification, and clustering is validated through ten sets of public datasets in comparison with various benchmark methods.

Abstract:
While achieving tremendous success in various fields, existing multi-agent reinforcement learning (MARL) with a black-box neural network makes decisions in an opaque manner that hinders humans from understanding the learned knowledge and how input observations influence decisions. In contrast, existing interpretable approaches usually suffer from weak expressivity and low performance. To bridge this gap, we propose MIXing Recurrent soft decision Trees (MIXRTs), a novel interpretable architecture that can represent explicit decision processes via the root-to-leaf path and reflect each agent’s contribution to the team. Specifically, we construct a novel soft decision tree using a recurrent structure and demonstrate which features influence the decision-making process. Then, based on the value decomposition framework, we linearly assign credit to each agent by explicitly mixing individual action values to estimate the joint action value using only local observations, providing new insights into interpreting the cooperation mechanism. Theoretical analysis confirms that MIXRTs guarantee additivity and monotonicity in the factorization of joint action values. Evaluations on complex tasks like Spread and StarCraft II demonstrate that MIXRTs compete with existing methods while providing clear explanations, paving the way for interpretable and high-performing MARL systems.

Abstract:
Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.

Abstract:
High-quality private machine learning (ML) data stored in local data centers becomes a key competitive factor for AI corporations. In this paper, we present a novel insider attack called Matryoshka to reveal the possibility of breaking the privacy of ML data even with no exposed interface. Our attack employs a scheduled-to-publish DNN model as a carrier model for covert transmission of secret models which memorize the information of private ML data that otherwise has no interface to the outsider. At the core of our attack, we present a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Our approach simultaneously achieves: (i) High Capacity – With almost no utility loss of the carrier model, Matryoshka can transmit over 10,000 real-world data samples within a carrier model which has 220×220× less parameters than the total size of the stolen data, and simultaneously transmit multiple heterogeneous datasets or models within a single carrier model under a trivial distortion rate, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency – once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness – Moreover, almost all the recovered models either have similar performance as if it is trained independently on the private data, or can be further used to extract memorized raw training data with low error; (iv) Robustness – Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness – A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.

Abstract:
We developed an intelligent innovative orientation method to improve the accuracy of polarization compasses in harsh conditions: weak skylight polarization patterns resulting from unfavorable weather conditions (e.g., haze, sandstorms) or locally destroyed skylight polarization conditions caused by occlusions (e.g., buildings, trees). First, the skylight polarization status was determined with the degree of linear polarization threshold analysis method and a bionic polarization enhancement sensing model was constructed to simulate the enhanced perception mechanism identified in the Syrphidae visual neural pathway, highly efficient in dark or weakly illuminated environments. The bionic model successfully enhanced the information content extracted from weak polarization patterns. Second, polarization pixel interferences, caused by occlusions under locally destroyed skylight polarization conditions, were removed with a convolutional neural network for image segmentation and the sky area of interest was identified. Finally, the incomplete angle of polarization map derived after image segmentation was fitted using our optimized adaptive antisymmetric ring algorithm. On the basis of the strong angle-of-polarization antisymmetry along the solar meridian, information extracted from the sparse and irregular polarization pixels was analyzed to derive a high-accuracy polarization orientation solution. The whole method intelligently realizes pattern analysis and deep learning intelligent processing, efficiently rotates to manage polarization disorientation. The experimental results demonstrated the performance of the proposed method in compensating for reduced orientation accuracy under degraded polarization conditions, its robustness against perturbations, and its beneficial impact on the environmental adaptability of bionic polarization compasses.

Abstract:
Accurate representations of 3D faces are of paramount importance in various computer vision and graphics applications. However, the challenges persist due to the limitations imposed by data discretization and model linearity, which hinder the precise capture of identity and expression clues in current studies. This paper presents a novel 3D morphable face model, named ImFace++, to learn a sophisticated and continuous space with implicit neural representations. ImFace++ first constructs two explicitly disentangled deformation fields to model complex shapes associated with identities and expressions, respectively, which simultaneously facilitate automatic learning of point-to-point correspondences across diverse facial shapes. To capture more sophisticated facial details, a refinement displacement field within the template space is further incorporated, enabling fine-grained learning of individual-specific facial details. Furthermore, a Neural Blend-Field is designed to reinforce the representation capabilities through adaptive blending of an array of local fields. In addition to ImFace++, we devise an improved learning strategy to extend expression embeddings, allowing for a broader range of expression variations. Comprehensive qualitative and quantitative evaluation demonstrates that ImFace++ significantly advances the state-of-the-art in terms of both face reconstruction fidelity and correspondence accuracy.

Abstract:
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP^22AM) to transfer knowledge at both prediction and prototype levels. RP^22AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.

Abstract:
Restoration tasks in low-level vision aim to restore high-quality (HQ) data from their low-quality (LQ) observations. To circumvents the difficulty of acquiring paired data in real scenarios, unpaired approaches that aim to restore HQ data solely on unpaired data are drawing increasing interest. Since restoration tasks are tightly coupled with the degradation model, unknown and highly diverse degradations in real scenarios make learning from unpaired data quite challenging. In this paper, we propose a degradation representation learning scheme to address this challenge. By learning to distinguish various degradations in the representation space, our degradation representations can extract implicit degradation information in an unsupervised manner. Moreover, to handle diverse degradations, we develop degradation-aware (DA) convolutions with flexible adaption to various degradations to fully exploit the degrdation information in the learned representations. Based on our degradation representations and DA convolutions, we introduce a generic framework for unpaired restoration tasks. Based on our framework, we propose UnIRnet and UnPRnet for unpaired image and point cloud restoration tasks, respectively. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on unpaired image and point cloud restoration tasks show that our UnIRnet and UnPRnet achieve state-of-the-art performance.

Abstract:
Federated learning (FL) commonly encourages the clients to perform multiple local updates before the global aggregation, thus avoiding frequent model exchanges and relieving the communication bottleneck between the server and clients. Though empirically effective, the negative impact of multiple local updates on the stability of FL is not thoroughly studied, which may result in a globally unstable and slow convergence. Based on sensitivity analysis, we define in this paper a local-update stability index for the general FL, as measured by the maximum inter-client model discrepancy after the multiple local updates that mainly stems from the data heterogeneity. It enables to determine how much the variation of client’s models with multiple local updates may influence the global model, and can also be linked with the convergence and generalization. We theoretically derive the proposed local-update stability for current state-of-the-art FL methods, providing possible insight to understanding their motivation and limitation from a new perspective of stability. For example, naively executing the parallel acceleration locally at clients would harm the local-update stability. Motivated by this, we then propose a novel accelerated yet stabilized FL algorithm (named FedANAG) based on the server- and client-level Nesterov accelerated gradient (NAG). In FedANAG, the global and local momenta are elaborately designed and alternatively updated, while the stability of local update is enhanced with help of the global momentum. We prove the convergence of FedANAG for strongly convex, general convex and non-convex settings. We then conduct evaluations on both the synthetic and real-world datasets to first validate our proposed local-update stability. The results further show that across various data heterogeneity and client participation ratios, FedANAG not only accelerates the global convergence by reducing the required number of communication rounds to a target accuracy, but converges to an eventually higher accuracy.

Abstract:
In the domain of machine learning, the significance of the loss function is paramount, especially in supervised learning tasks. It serves as a fundamental pillar that profoundly influences the behavior and efficacy of supervised learning algorithms. Traditional loss functions, though widely used, often struggle to handle outlier-prone and high-dimensional data, resulting in suboptimal outcomes and slow convergence during training. In this paper, we address the aforementioned constraints by proposing a novel robust, bounded, sparse, and smooth (RoBoSS) loss function for supervised learning. Further, we incorporate the RoBoSS loss within the framework of support vector machine (SVM) and introduce a new robust algorithm named \mathcal L_RoBoSSLRoBoSS-SVM. For the theoretical analysis, the classification-calibrated property and generalization ability are also presented. These investigations are crucial for gaining deeper insights into the robustness of the RoBoSS loss function in classification problems and its potential to generalize well to unseen data. To validate the potency of the proposed \mathcal L_RoBoSSLRoBoSS-SVM, we assess it on 88 benchmark datasets from KEEL and UCI repositories. Further, to rigorously evaluate its performance in challenging scenarios, we conducted an assessment using datasets intentionally infused with outliers and label noise. Additionally, to exemplify the effectiveness of \mathcal L_RoBoSSLRoBoSS-SVM within the biomedical domain, we evaluated it on two medical datasets: the electroencephalogram (EEG) signal dataset and the breast cancer (BreaKHis) dataset. The numerical results substantiate the superiority of the proposed \mathcal L_RoBoSSLRoBoSS-SVM model, both in terms of its remarkable generalization performance and its efficiency in training time.

Abstract:
Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence’s length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what’s possible in online, skeleton-based action recognition.

Abstract:
Surface reconstruction for point clouds is one of the important tasks in 3D computer vision. The latest methods rely on generalizing the priors learned from large scale supervision. However, the learned priors usually do not generalize well to various geometric variations that are unseen during training, especially for extremely sparse point clouds. To resolve this issue, we present a neural network to directly infer SDFs from single sparse point clouds without using signed distance supervision, learned priors or even normals. Our insight here is to learn surface parameterization and SDFs inference in an end-to-end manner. To make up the sparsity, we leverage parameterized surfaces as a coarse surface sampler to provide many coarse surface estimations in training iterations, according to which we mine supervision for our thin plate splines (TPS) based network to infer smooth SDFs in a statistical way. Our method significantly improves the generalization ability and accuracy on unseen point clouds. Our experimental results show our advantages over the state-of-the-art methods in surface reconstruction for sparse point clouds under synthetic datasets and real scans.

Abstract:
High-frequency gaze tracking demonstrates significant potential in various critical applications, such as foveatedrendering, gaze-based identity verification, and the diagnosis of mental disorders. However, existing eye-tracking systems based on CCD/CMOS cameras either provide tracking frequencies below 200 Hz or employ high-speedcameras, causing high power consumption and bulky devices. While there have been some high-speed eye-tracking datasets and methods based on event cameras, they are primarily tailored for near-eye camera scenarios. They lackthe advantages associated with remote camera scenarios, such as the absence of the need for direct contact, improved user comfort and head pose freedom. In this work, we present RGBE-Gaze, the first large-scale and multimodal dataset for remote gaze tracking in high-frequency through synchronizing RGB and event cameras. This dataset is collected from 66 participants with diverse genders and age groups. Our setup captures 3.6 million RGB images and 26.3 billion event samples. Additionally, the dataset includes 10.7 million gaze references from the Gazepoint GP3 HD eye tracker and 15,972 sparse points of gaze (PoG) ground truth obtained through manualstimuli clicks by participants. We present dataset characteristics such as head pose, gaze direction, and pupil size. Furthermore, we introduce a hybrid frame-event based gaze estimation method specifically designed for the collected dataset. Moreover, we perform extensive evaluations of different benchmarking methods under variousgaze-related factors. The evaluation results illustrate that introducing event stream as a new modality improves gazetracking frequency and demonstrates greater estimation robustness across diverse gaze-related factors.

Abstract:
Despite significant advancements in simulating the bokeh effect of Digital Single Lens Reflex Camera (DSLR) from an all-in-focus image, challenges remain in processing highlight points, preserving boundary details for in-focus objects and processing high-resolution images efficiently. To tackle these issues, we first develop a ray-tracing-based bokeh simulator. An innovative pipeline with weight redistribution is introduced to handle highlight rendering. By considering the front length of lens barrel, we can simulate realistic cat-eye effect. This bokeh simulator serves as the foundation for creating our training dataset. Building on this dataset, we introduce a hybrid framework BokehMe++, combining a classical renderer and a neural renderer. The classical renderer is implemented by a hierarchical scattering-based method, which suffers from boundary inaccuracies. These erroneous areas will be identified by an error map generator and be corrected by a two-stage neural renderer. Adaptive resizing and iterative upsampling are introduced in the neural renderer to process arbitrary blur size efficiently. Extensive experiments demonstrate that BokehMe++ outperforms existing methods and provides highly customizable rendering features, such as adjustable blur amount, focal plane, highlight mode and cat-eye effect. Furthermore, BokehMe++ can maintain the sharpness of hair details in portraits through an auxiliary alpha map input.

Abstract:
Causal structure learning (CSL), a prominent technique for encoding cause-and-effect relationships among variables, through Bayesian Networks (BNs). Although recovering causal structure solely from data is a challenge, the integration of prior knowledge, revealing partial structural truth, can markedly enhance learning quality. However, current methods based on prior knowledge exhibit limited resilience to errors in the prior, with hard constraint methods disregarding priors entirely, and soft constraints accepting priors based on a predetermined confidence level, which may require expert intervention. To address this issue, we propose a strategy resilient to edge-level prior errors for CSL, thereby minimizing human intervention. We classify prior errors into different types and provide their theoretical impact on the Structural Hamming Distance (SHD) under the presumption of sufficient data. Intriguingly, we discover and prove that the strong hazard of prior errors is associated with a unique acyclic closed structure, defined as “quasi-circle”. Leveraging this insight, a post-hoc strategy is employed to identify the prior errors by its impact on the increment of “quasi-circles”. Through empirical evaluation on both real and synthetic datasets, we demonstrate our strategy’s robustness against prior errors. Specifically, we highlight its substantial ability to resist order-reversed errors while maintaining the majority of correct prior.

Abstract:
Two-view correspondence learning has increasingly focused on the coherence and smoothness of motion fields between image pairs. Conventional methods either regularize the complexity of the field function at substantial computational expense, or apply local filters that prove ineffective for large scene disparities. In this paper, we present DeMatch++, a novel network drawing inspiration from Fourier decomposition principles that decomposes the motion field to retain its primary “low-frequency” and smooth components. This approach achieves implicit regularization with lower computational overhead while exhibiting inherent piecewise smoothness. Specifically, our method decomposes the noise-contaminated motion field into multiple linearly independent basis vectors, generating smooth sub-fields that preserve the main energy of the original field. These sub-fields facilitate the recovery of a cleaner motion field for precise vector derivation. Within this framework, we aggregate local context within each sub-field while enhancing global information across all sub-fields. We also employ a masked decomposition strategy that mitigates the influence of false matches, and construct a compact representation to suppress redundant sub-fields. The complete pipeline is formulated as a discrete learnable architecture, circumventing the need for dense field computation. Extensive experiments demonstrate that DeMatch++ outperforms state-of-the-art methods while maintaining computational efficiency and piecewise smoothness.

Abstract:
In the current scenario, a vast amount of unlabeled high-dimensional data exhibits intrinsic relationships, making it suitable for information extraction through graph-based clustering methods. However, these datasets often lack edge structure information and contain numerous irrelevant features. To address these challenges, we propose a comprehensive solution that involves: (1) applying a feature weighting approach to manage features, (2) constructing edges based on weighted granular-balls, and (3) integrating graph convolutional networks (GCNs) with edge generation to develop an autoencoder network. Our method significantly enhances the extraction of relevant information from high-dimensional, unlabeled data, improving the overall performance and reliability of the clustering process. Extensive experimental results demonstrate that our model, AW-GBGAE, excels in clustering tasks and exhibits strong competitiveness compared to baseline models. The code is publicly available at https://github.com/xjnine/AWGBGAE.

Abstract:
Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability.

Abstract:
In this paper, we first propose MoE-Adapters, a parameter-efficient training framework to alleviate long-term forgetting issues in incremental learning with Vision-Language Models (VLM). Our MoE-Adapters leverages incrementally added routers to activate and integrate exclusive expert adapters from a pre-defined static expert set, enabling the pre-trained CLIP to efficiently adapt to new tasks. To preserve the zero-shot capability of VLM, a Distribution Discriminative Auto-Selector (DDAS) is introduced that automatically routes in-distribution and out-of-distribution inputs to the MoE-Adapters and the original CLIP, respectively. However, relying on a static expert set and a separate distribution selector can lead to parameter redundancy and increased training complexity. In response, we further extend an MoE-Adapters++ framework by introducing dynamic MoE-adapters, which allows experts to be adaptively involved during the continual learning process. Additionally, a Latent Embedding Auto-Selector (LEAS) is proposed that incorporates distribution selection within CLIP to create a more unified architecture. Extensive experiments across diverse settings demonstrate that the proposed method consistently surpasses previous state-of-the-art approaches while concurrently improving training efficiency.

Abstract:
Face-swapping techniques have advanced rapidly with the evolution of deep learning, leading to widespread use and growing concerns about potential misuse, especially in cases of fraud. While many efforts have focused on detecting swapped face images or videos, these methods are insufficient for tracing the malicious users behind fraudulent activities. Intrusive watermark-based approaches also fail to trace unmarked identities, limiting their practical utility. To address these challenges, we introduce FaceTracer, the first non-intrusive framework specifically designed to trace the identity of the source person from swapped face images or videos. Specifically, FaceTracer leverages a disentanglement module that effectively suppresses identity information related to the target person while isolating the identity features of the source person. This allows us to extract robust identity information that can directly link the swapped face back to the original individual, aiding in uncovering the actors behind fraudulent activities. Extensive experiments demonstrate FaceTracer’s effectiveness across various face-swapping techniques, successfully identifying the source person in swapped content and enabling the tracing of malicious actors involved in fraudulent activities. Additionally, FaceTracer shows strong transferability to unseen face-swapping methods including commercial applications and robustness against transmission distortions and adaptive attacks.

Abstract:
Graph Neural Networks (GNNs) exhibit satisfactory performance on homophilic networks, where most edges connect two nodes with the same label. However, their effectiveness diminishes as the graphs become heterophilic (low homophily), prompting the exploration of various message-passing schemes. In particular, assigning negative weights to heterophilic edges (signed propagation) for message-passing has gained significant attention, and some studies theoretically confirm its effectiveness. Nevertheless, prior theorems assume binary classification scenarios, which may not hold well for graphs with multiple classes. To solve this limitation, we offer new theoretical insights into GNNs in multi-class environments and identify the drawbacks of employing signed propagation from two perspectives: message-passing and parameter update. We found that signed propagation without considering feature distribution can degrade the separability of dissimilar neighbors, which also increases prediction uncertainty (e.g., conflicting evidence) that can cause instability. To address these limitations, we introduce two novel calibration strategies aiming to improve discrimination power while reducing entropy in predictions. Through theoretical and extensive experimental analysis, we demonstrate that the proposed schemes enhance the performance of both signed and general message-passing neural networks (Choi et al. 2023).

Affiliations: Department of Data Science, Frontier of Artificial Network (FAN) Lab, City University of Hong Kong, Hong Kong, China; School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin, China; School of Management Science and Engineering, Nanjing University of Information Science and Technology, Nanjing, China; College of Engineering, Peking University, Beijing, China; Institute for Advanced Study, Beijing Normal-Hong Kong Baptist University, Zhuhai, China; Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA

Abstract:
Inspired by neuronal diversity in the biological neural system, a plethora of studies proposed to design novel types of artificial neurons and introduce neuronal diversity into artificial neural networks. Recently proposed quadratic neuron, which replaces the inner-product operation in conventional neurons with a quadratic one, have achieved great success in many essential tasks. Despite the promising results of quadratic neurons, there is still an unresolved issue: Is the superior performance of quadratic networks simply due to the increased parameters or due to the intrinsic expressive capability? Without clarifying this issue, the performance of quadratic networks is always suspicious. Additionally, resolving this issue is reduced to finding killer applications of quadratic networks. In this paper, with theoretical and empirical studies, we show that quadratic networks enjoy parametric efficiency, thereby confirming that the superior performance of quadratic networks is due to the intrinsic expressive capability. This intrinsic expressive ability comes from that quadratic neurons can easily represent nonlinear interaction, while it is hard for conventional neurons. Theoretically, we derive the approximation efficiency of quadratic networks over conventional ones in terms of real space and manifolds. Moreover, from the perspective of the Barron space, we demonstrate that there exists a functional space whose functions can be approximated by quadratic networks in a dimension-free error, but the approximation error of conventional networks is dependent on dimensions. Empirically, experimental results on synthetic data, classic benchmarks, and real-world applications show that quadratic models broadly enjoy parametric efficiency, and the gain of efficiency depends on the task.

Abstract:
In real-world scenarios, distribution shifts give rise to the importance of two problems: out-of-distribution (OoD) generalization, which focuses on models’ generalization ability against covariate shifts (i.e., the changes of environments), and OoD detection, which aims to be aware of semantic shifts (i.e., test-time unseen classes). Real-world testing environments often involve a combination of both covariate and semantic shifts. While numerous methods have been proposed to address these critical issues, only a few works tackled them simultaneously. Moreover, prior works often improve one problem but sacrifice the other. To overcome these limitations, we delve into boosting OoD detection and OoD generalization from the perspective of information theory, which can be easily applied to existing models and different tasks. Building upon the theoretical bounds for mutual information and conditional entropy, we provide a unified approach, composed of Mutual Information Minimization (MI-Min) and Conditional Entropy Maximizing (CE-Max). Extensive experiments and comprehensive evaluations on multi-label image classification and object detection have demonstrated the superiority of our method. It successfully mitigates trade-offs between the two challenges compared to competitive baselines.

Abstract:
Continual Semantic Segmentation (CSS) extends static semantic segmentation by incrementally introducing new classes for training. To alleviate the catastrophic forgetting issue in this task, replay methods can be adopted, constructing a memory buffer that stores a small number of samples from previous classes for future replay. However, existing replay approaches in CSS often lack a thorough exploration of two critical issues: how to find the most suitable memory samples and how to utilize them for replay more effectively. Common strategies either randomly select samples or rely on hand-crafted, single-factor-driven methods that are hard to be optimal, and often employ conventional training techniques for replay that do not account for class imbalance problem resulting from limited memory capacity. In this work, we tackle these challenges by introducing a novel memory sample selection method that leverages a reinforcement learning framework with innovative state representations and a dual-stage action scheme to automatically learn a selection policy. Additionally, we propose an expert mechanism and a dual-phase training method to address the class imbalance issue, thereby enhancing the effectiveness of replay training by making better use of memory samples. Incorporating the proposed automatic sample selection and effective memory utilization methods, we develop a novel and effective replay-based pipeline for CSS. Our extensive experiments on Pascal VOC 2012 and ADE20 K datasets demonstrate the effectiveness of our approach, which achieves state-of-the-art (SOTA) performance and outperforms previous advanced methods significantly.

Abstract:
Open-vocabulary 3D Object Detection (OV-3DDet) addresses the detection of objects from an arbitrary list of novel categories in 3D scenes, which remains a very challenging problem. In this work, we propose CoDAv2, a unified framework designed to innovatively tackle both the localization and classification of novel 3D objects, under the condition of limited base categories. For localization, the proposed 3D Novel Object Discovery (3D-NOD) strategy utilizes 3D geometries and 2D open-vocabulary semantic priors to discover pseudo labels for novel objects during training. 3D-NOD is further extended with an Enrichment strategy that significantly enriches the novel object distribution in the training scenes, and then enhances the model’s ability to localize more novel objects. The 3D-NOD with Enrichment is termed 3D-NODE. For classification, the Discovery-driven Cross-modal Alignment (DCMA) module aligns features from 3D point clouds and 2D/textual modalities, employing both class-agnostic and class-specific alignments that are iteratively refined to handle the expanding vocabulary of objects. Besides, 2D box guidance boosts the classification accuracy against complex background noises, which is coined as Box-DCMA. Extensive evaluation demonstrates the superiority of CoDAv2. CoDAv2 outperforms the best-performing method by a large margin (\textAP_NovelAPNovel of 9.17 vs. 3.61 on SUN-RGBD and 9.12 vs. 3.74 on ScanNetv2).

Abstract:
Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. It serves as an essential testing ground for Multimodal Large Language Models (MLLMs). To advance this field, we introduced a new REC dataset in our previous conference paper, characterized by two key features. First, it is designed with controllable difficulty levels, requiring multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Second, it incorporates negative text and images generated through fine-grained editing and augmentation, explicitly testing a model’s ability to reject scenarios where the target object is absent—an often-overlooked yet critical challenge in existing datasets. In this extended work, we propose two new methods to tackle the challenges of fine-grained REC by combining the strengths of Specialist Models and MLLMs. The first method adaptively assigns simple cases to faster, lightweight models and reserves complex ones for powerful MLLMs, balancing accuracy and efficiency. The second method lets a specialist generate a set of possible object regions, and the MLLM selects the most plausible one using its reasoning ability. These collaborative strategies lead to significant improvements on our dataset and other challenging benchmarks. Our results show that combining specialized and general-purpose models offers a practical path toward solving complex real-world vision-language tasks.

Abstract:
Thermal Infrared detection is widely used in autonomous driving, medical AI, etc., but its security has only attracted attention recently. We propose infrared adversarial clothing designed to evade thermal person detectors in real-world scenarios. The design of the adversarial clothing is based on 3D modeling, which makes it easier to simulate multiangle scenes near the real world compared to 2D modeling. We optimized the patch layout pattern of 3D clothing based on the adversarial example technique and made physical adversarial clothing using the aerogel. The idea is to paste a set of square aerogel patches, which display black squares in thermal images, in the inner side of clothing at specific locations with specific orientations. To enhance realism, we propose a method to build infrared 3D models with real infrared photos and develop texture maps for 3D models to simulate varied infrared characteristics over time and location. In physical attacks, we achieved an attack success rate of 80.11% indoors and 76.85% outdoors against YOLOv9. In contrast, randomly placed patches yielded much lower success rates (26.53% indoors and 23.03% outdoors). The adversarial clothing also showed good transferability to unknown detectors with an ensemble attack method, demonstrating the effectiveness of our approach.

Abstract:
Deep neural networks (DNNs) have achieved satisfactory performance in multiple fields. However, recent studies have shown that DNNs can be easily fooled by adversarial examples. To mitigate the threats caused by adversarial attacks, a highly effective strategy is to design detectors to reject adversarial examples. This article proposes an unsupervised class- and classifier-free adversarial detection method. It only takes unlabeled clean data for training to discriminate illegal samples, and does not require any knowledge about the adversarial examples, sample classes, and the original classifier. More specifically, motivated by the idea that adversarial examples may differ significantly from benign data in terms of sample structural information, we develop an adversarial detector that can simultaneously capture the residual information and the variable-wise structural relationships of data. After that, we design an attribute called data identity (ID) that combines the extracted residual and structural information of data to identify adversarial examples. We validate the superiority of the proposed method through detecting adversarial attacks on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that the performance of our model is the best among various state-of-the-art adversarial detectors. Besides, we also conduct visualization experiments to illustrate the role of structural information in detecting adversarial examples.

Abstract:
Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector’s inputs determine the classifier’s inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector’s inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier’s inputs are treated as the confounder, and both the detector’s inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, i.e., the causality from final predictions to the classifier’s inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.

Abstract:
Recent studies on pose-based gait recognition have underscored the potential of utilizing such fundamental data to achieve superior outcomes. Nonetheless, the development of current pose-based methods faces significant obstacles due to several critical issues: (1) Misaligned Settings, which results in a lack of thorough and unbiased comparative analysis. (2) Inferior Performance, which causes diminished focus on pose-based gait representations. (3) Limited Generalization, which hinders the effective application in real-world scenarios. Focused on tackling the aforementioned challenges, our study introduces a comprehensive benchmark and a versatile approach to bridge the past and future for pose-based gait recognition. First, we revisit previous pose-based methods and make great efforts to establish a unified framework, FastPoseGait, aiming at a fair and comprehensive comparison investigation with consistent experimental settings and a more stable training process. Then, within this framework, we propose GPGait++, a generalized pose-based gait recognition method featuring a human-oriented input and part-aware modeling, intended to enhance the generalization ability and discriminative power across diverse environments and camera viewpoints. Experiments on six public gait recognition datasets reveal that our unified framework significantly enhances the performance of previous approaches, and GPGait++ exhibits state-of-the-art cross-domain capabilities compared to existing pose-based methods, marking a significant advancement in the field of pose-based gait recognition.

Abstract:
Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.

Abstract:
Prototypical-part methods, e.g., ProtoPNet, enhance interpretability in image recognition by linking predictions to training prototypes, thereby offering intuitive insights into their decision-making. Existing methods, which rely on a point-based learning of prototypes, typically face two critical issues: 1) the learned prototypes have limited representation power and are not suitable to detect Out-of-Distribution (OoD) inputs, reducing their decision trustworthiness; and 2) the necessary projection of the learned prototypes back into the space of training images causes a drastic degradation in the predictive performance. Furthermore, current prototype learning adopts an aggressive approach that considers only the most active object parts during training, while overlooking sub-salient object regions which still hold crucial classification information. In this paper, we present a new generative paradigm to learn prototype distributions, termed as Mixture of Gaussian-distributed Prototypes (MGProto). The distribution of prototypes from MGProto enables both interpretable image classification and trustworthy recognition of OoD inputs. The optimisation of MGProto naturally projects the learned prototype distributions back into the training image space, thereby addressing the performance degradation caused by prototype projection. Additionally, we develop a novel and effective prototype mining strategy that considers not only the most active but also sub-salient object parts. To promote model compactness, we further propose to prune MGProto by removing prototypes with low importance priors. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art image recognition and OoD detection performances, while providing encouraging interpretability results.

Abstract:
The Cauchy-Schwarz (CS) divergence was developed by Príncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be elegantly estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., rigorous faithfulness guarantee, lower computational complexity, higher statistical power, and much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional Kullback-Leibler divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely time series clustering and uncertainty-guided exploration for sequential decision making.

Abstract:
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead.

Abstract:
Object pose estimation and shape reconstruction are inherently coupled tasks although they have so far been studied separately in most existing approaches. A few recent works addressed the problem of joint pose estimation and shape reconstruction, but they found difficulties in handling partial observations and shape ambiguities. An open challenge in this area is to design a mechanism that has the two tasks benefit each other and boost the performance and robustness of both. In this work, we advocate the use of diffusion models for joint estimation of category-level object poses and reconstruction of object geometry. Diffusion models formulate shape reconstruction as a generation process conditioned on input observations. It has two main advantages. First, the iterative inference of diffusion models provides a mechanism for iterative optimization for both pose estimation and shape reconstruction. Second, diffusion models allow multiple outputs starting from different input noises, which would address the problem of ambiguity caused by partial observations. To achieve this, we propose equivariant diffusion model for joint pose estimation and shape reconstruction. The approach consists of an equivariant feature extractor to aggregate features of the input point cloud and a ShapePose diffusion model to generate object pose and shape simultaneously. To avoid training the model on all possible shape poses in the SO(3) space, we propose to augment the diffusion model with A5-group neurons where the neurons are converted into 5D vectors and can be rotated with the alternating group A5. Based on the A5-group neurons, we implement SO(3)-equivariant 3D point convolution and SO(3)-equivariant concatenation, making the entire network SO(3)-equivariant. Moreover, to select the most plausible combination of pose and shape from the generated ones, we propose a geometry-based measure of plausibility for an estimated pose along with a reconstructed shape. Extensive experiments demonstrate the effectiveness of the proposed method. Specifically, our method achieves the state-of-the-art on two public datasets and a new dataset with stacked objects, in terms of shape reconstruction and pose estimation. In particular, we show the proposed method could provide multiple plausible outputs under partial observations and shape ambiguities.

Abstract:
To address the challenges of long-tailed classification, researchers have proposed several approaches to reduce model bias, most of which assume that classes with few samples are weak classes. However, recent studies have shown that tail classes are not always hard to learn, and model bias has been observed on sample-balanced datasets, suggesting the existence of other factors that affect model bias. In this work, we first establish a geometric perspective for analyzing model fairness and then systematically propose a series of geometric measurements for perceptual manifolds in deep neural networks. Subsequently, we comprehensively explore the effect of the geometric characteristics of perceptual manifolds on classification difficulty and how learning shapes the geometric characteristics of perceptual manifolds. An unanticipated finding is that the correlation between the class accuracy and the separation degree of perceptual manifolds gradually decreases during training, while the negative correlation with the curvature gradually increases, implying that curvature imbalance leads to model bias. We thoroughly validate this finding across multiple networks and datasets, providing a solid experimental foundation for future research. We also investigate the convergence consistency between the loss function and curvature imbalance, demonstrating the lack of curvature constraints in existing optimization objectives. Building upon these observations, we propose curvature regularization to facilitate the model to learn curvature-balanced and flatter perceptual manifolds. Evaluations on multiple long-tailed and non-long-tailed datasets show the excellent performance and exciting generality of our approach, especially in achieving significant performance improvements based on current state-of-the-art techniques. Our work opens up a geometric analysis perspective on model bias and reminds researchers to pay attention to model bias on non-long-tailed and even sample-balanced datasets.

Abstract:
Multi-task dense prediction aims at handling multiple pixel-wise prediction tasks within a unified network simultaneously for visual scene understanding. However, cross-task feature interactions of current methods are still suffering from incomplete levels of representations, less discriminative semantics in feature participants, and inefficient pair-wise task interaction processes. To tackle these under-explored issues, we propose a novel BridgeNet framework, which extracts comprehensive and discriminative intermediate Bridge Features, and conducts interactions based on them. Specifically, a Task Pattern Propagation (TPP) module is first applied to ensure highly semantic task-specific feature participants are prepared for subsequent interactions, and a Bridge Feature Extractor (BFE) is specially designed to selectively integrate both high-level and low-level representations to generate the comprehensive bridge features. Then, instead of conducting heavy pair-wise cross-task interactions, a Task-Feature Refiner (TFR) is developed to efficiently take guidance from bridge features and form final task predictions. To the best of our knowledge, this is the first work considering the completeness and quality of feature participants in cross-task interactions. Extensive experiments are conducted on NYUD-v2, Cityscapes and PASCAL Context benchmarks, and the superior performance shows the proposed architecture is effective and powerful in promoting different dense prediction tasks simultaneously.

Abstract:
AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in recent years. However, existing FRL algorithms have a limitation: they are primarily designed for categorical sensitive attributes and thus cannot be applied to continuous sensitive attributes, such as age or income. In this paper, we propose an FRL algorithm for continuous sensitive attributes. First, we introduce a measure called the Expectation of Integral Probability Metrics (EIPM) to assess the fairness level of representation space for continuous sensitive attributes. We demonstrate that if the distribution of the representation has a low EIPM value, then any prediction head constructed on the top of the representation become fair, regardless of the selection of the prediction head. Furthermore, EIPM possesses a distinguished advantage in that it can be accurately estimated using our proposed estimator with finite samples. Based on these properties, we propose a new FRL algorithm called Fair Representation using EIPM with MMD (FREM). Experimental evidences show that FREM outperforms other baseline methods.

Abstract:
Due to the urgent need of the robustness of deep neural networks (DNN), numerous existing open-sourced tools or platforms are developed to evaluate the robustness of DNN models by ensembling the majority of adversarial attack or defense algorithms. Unfortunately, current platforms can neither optimize the DNN architectures nor the configuration of adversarial attacks to further enhance the model robustness or the performance of adversarial attacks. To alleviate these problems, in this paper, we propose a novel platform called auto-adversarial attack and defense (A^3DA3D), which can help search for robust neural network architectures and efficient adversarial attacks. A^3DA3D integrates multiple neural architecture search methods to find robust architectures under different robustness evaluation metrics. Besides, we provide multiple optimization algorithms to search for efficient adversarial attacks. In addition, we combine auto-adversarial attack and defense together to form a unified framework. Among auto adversarial defense, the searched efficient attack can be used as the new robustness evaluation to further enhance the robustness. In auto-adversarial attack, the searched robust architectures can be utilized as the threat model to help find stronger adversarial attacks. Experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the feasibility and effectiveness of the proposed platform.

Abstract:
Recent source-free domain adaptation (SFDA) methods have focused on learning meaningful cluster structures in feature space, successfully adapting the knowledge from the source domain to the unlabeled target domain without accessing the private source data. However, existing methods rely on pseudo-labels generated by source models that can be noisy due to domain shift, presenting a significant challenge to their efficacy. In this paper, we study SFDA from the perspective of learning with label noise (LLN) and prove that the label noise in SFDA, unlike in conventional LLN scenarios, follows a different distribution assumption. This discrepancy renders some existing LLN methods less effective in SFDA. To address this issue and comprehensively improve adaptation performance, we tackle label noise in SFDA from two perspectives. First, we demonstrate that the early-time training phenomenon (ETP), previously observed in LLN settings, still exists in SFDA. Hence, we introduce a simple yet effective approach to leveraging ETP to improve current SFDA algorithms. Second, we propose a noise and variance control module, mitigating the label noise discrepancy between SFDA and LLN and enhancing the effectiveness of LLN methods in SFDA. Extensive empirical evaluation and analysis of four benchmarks show that our methods substantially outperform existing baselines.

Abstract:
Effective video frame interpolation hinges on the adept handling of motion in the input scene. Prior work acknowledges asynchronous event information for this, but often overlooks whether motion induces blur in the video, limiting its scope to sharp frame interpolation. We instead propose a unified framework for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. To enhance the generalization from synthetic data to real event cameras, we integrate self-supervised framework with the proposed model to enhance the generalization on real-world datasets in the wild. At the dataset level, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring, and the joint task of both. Experiments on domain transfer reveal that self-supervised training effectively mitigates the performance degradation observed when transitioning from synthetic data to real-world data. Code and datasets are available at https://github.com/AHupuJR/REFID.

Abstract:
In the context of incremental learning, a network is sequentially trained on a stream of tasks, where data from previous tasks are particularly assumed to be inaccessible. The major challenge is how to overcome the stability-plasticity dilemma, i.e., learning knowledge from new tasks without forgetting the knowledge of previous tasks. To this end, we propose two mathematical conditions for guaranteeing network stability and plasticity with theoretical analysis. The conditions demonstrate that we can restrict the parameter update in the null space of uncentered feature covariance at each linear layer to overcome the stability-plasticity dilemma, which can be realized by layerwise projecting gradient into the null space. Inspired by it, we develop two algorithms, dubbed Adam-NSCL and Adam-SFCL respectively, for incremental learning. Adam-NSCL and Adam-SFCL provide different ways to compute the projection matrix. The projection matrix in Adam-NSCL is constructed by singular vectors associated with the smallest singular values of the uncentered feature covariance matrix, while the projection matrix in Adam-SFCL is constructed by all singular vectors associated with adaptive scaling factors. Additionally, we explore adopting self-supervised techniques, including self-supervised label augmentation and a newly proposed contrastive loss, to improve the performance of incremental learning. These self-supervised techniques are orthogonal to Adam-NSCL and Adam-SFCL and can be incorporated with them seamlessly, leading to Adam-NSCL-SSL and Adam-SFCL-SSL respectively. The proposed algorithms are applied to task-incremental and class-incremental learning on various benchmark datasets with multiple backbones, and the results show that they outperform the compared incremental learning methods.

Abstract:
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368 k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data.

Abstract:
Deep learning-based low-light image enhancement (LLIE) is a task of leveraging deep neural networks to enhance the image illumination while keeping the image content unchanged. From the perspective of training data, existing methods complete the LLIE task driven by one of the following three data types: paired data, unpaired data and zero-reference data. Each type of these data-driven methods has its own advantages, e.g., zero-reference data-based methods have very low requirements on training data and can meet the human needs in many scenarios. In this paper, we leverage pure Gaussian noise to complete the LLIE task, which further reduces the requirements for training data in LLIE tasks and can be used as another alternative in practical use. Specifically, we propose Noise SElf-Regression (NoiSER) without access to any task-related data, simply learns a convolutional neural network equipped with an instance-normalization layer by taking a random noise image, \mathcal N(0,\sigma ^2)N(0,σ2) for each pixel, as both input and output for each training pair, and then the low-light image is fed to the trained network for predicting the normal-light image. Technically, an intuitive explanation for its effectiveness is as follows: 1) the self-regression reconstructs the contrast between adjacent pixels of the input image, 2) the instance-normalization layer may naturally remediate the overall magnitude/lighting of the input image, and 3) the \mathcal N(0,\sigma ^2)N(0,σ2) assumption for each pixel enforces the output image to follow the well-known gray-world hypothesis (Buchsbaum, 1980) when the image size is big enough. Compared to current state-of-the-art LLIE methods with access to different task-related data, NoiSER is highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost. In addition, the experiments also demonstrate that NoiSER has great potential in overexposure suppression and joint processing with other restoration tasks.

Abstract:
Multi-modality imaging is widely used in clinical practice and biomedical research to gain a comprehensive understanding of an imaging subject. Currently, multi-modality imaging is accomplished by post hoc fusion of independently reconstructed images under the guidance of mutual information or spatially registered hardware, which limits the accuracy and utility of multi-modality imaging. Here, we investigate a data-driven multi-modality imaging (DMI) strategy for synergetic imaging of CT and MRI. We reveal two distinct types of features in multi-modality imaging, namely intra- and inter-modality features, and present a multi-sensor learning (MSL) framework to utilize the crossover inter-modality features for augmented multi-modality imaging. The MSL imaging approach breaks down the boundaries of traditional imaging modalities and allows for optimal hybridization of CT and MRI, which maximizes the use of sensory data. We showcase the effectiveness of our DMI strategy through synergetic CT-MRI brain imaging. The principle of DMI is quite general and holds enormous potential for various DMI applications across disciplines.

Abstract:
Longitudinal data with incomplete entries pose a significant challenge for clinical score regression over multiple time points. Although many methods primarily estimate longitudinal scores with complete baseline features (i.e., features collected at the initial time point), such snapshot features may overlook beneficial latent longitudinal traits for generalization. Alternatively, certain completion approaches (e.g., tensor decomposition technology) have been proposed to impute incomplete longitudinal data before score estimation, most of which, however, are transductive and cannot utilize label semantics. This work presents a tensor coupled learning (TCL) paradigm of incomplete longitudinal features and labels for clinical score regression. The TCL enjoys three advantages: 1) It drives semantic-aware factor matrices and collaboratively deals with incomplete longitudinal entries (of features and labels), during which a dynamic regularizer is designed for adaptive attribute selection. 2) It establishes a closed loop connecting baseline features and the coupled factor matrices, which enables inductive inference of longitudinal scores relying on only baseline features. 3) It reinforces the information encoding of baseline data by preserving the local manifold of longitudinal feature space and detecting the temporal alteration across multiple time points. Extensive experiments demonstrate the remarkable performance improvement of our method on clinical score regression with incomplete longitudinal data.

Abstract:
Recent advancements in deep learning-based compression techniques have demonstrated remarkable performance surpassing traditional methods. Nevertheless, deep neural networks have been observed to be vulnerable to backdoor attacks, where an added pre-defined trigger pattern can induce the malicious behavior of the models. In this paper, we propose a novel approach to launch a backdoor attack with multiple triggers against learned image compression models. Drawing inspiration from the widely used discrete cosine transform (DCT) in existing compression codecs and standards, we propose a frequency-based trigger injection model that adds triggers in the DCT domain. In particular, we design several attack objectives that are adapted for a series of diverse scenarios, including: 1) attacking compression quality in terms of bit-rate and reconstruction quality; 2) attacking task-driven measures, such as face recognition and semantic segmentation in downstream applications. To facilitate more efficient training, we develop a dynamic loss function that dynamically balances the impact of different loss terms with fewer hyper-parameters, which also results in more effective optimization of the attack objectives with improved performance. Furthermore, we consider several advanced scenarios. We evaluate the resistance of the proposed backdoor attack to the defensive pre-processing methods and then propose a two-stage training schedule along with the design of robust frequency selection to further improve resistance. To strengthen both the cross-model and cross-domain transferability on attacking downstream CV tasks, we propose to shift the classification boundary in the attack loss during training. Extensive experiments also demonstrate that by employing our trained trigger injection models and making slight modifications to the encoder parameters of the compression model, our proposed attack can successfully inject multiple backdoors accompanied by their corresponding triggers into a single image compression model.

Abstract:
In recent years, a large number of studies have shown that low rank matrix learning (LRML) has become a popular approach in machine learning and computer vision with many important applications, such as image inpainting, subspace clustering, and recommendation system. The latest LRML methods resort to using some surrogate functions as convex or nonconvex relaxation of the rank function. However, most of these methods ignore the difference between different rank components and can only yield suboptimal solutions. To alleviate this problem, in this paper we propose a novel nonconvex regularizer called capped reweighting norm minimization (CRNM), which not only considers the different contributions of different rank components, but also adaptively truncates sequential singular values. With it, a general LRML model is obtained. Meanwhile, under some mild conditions, the global optimum of CRNM regularized least squares subproblem can be easily obtained in closed-form. Through the analysis of the theoretical properties of CRNM, we develop a high computational efficiency optimization method with convergence guarantee to solve the general LRML model. More importantly, by using the Kurdyka-Łojasiewicz (KŁ) inequality, its local and global convergence properties are established. Finally, we show that the proposed nonconvex regularizer, as well as the optimization approach are suitable for different low rank tasks, such as matrix completion and subspace clustering. Extensive experimental results demonstrate that the constructed models and methods provide significant advantages over several state-of-the-art low rank matrix leaning models and methods.

Abstract:
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps. As an effect of the impressive results of diffusion models on image synthesis, it has been cemented as the major image decoder used by text-to-image models and brought text-to-image generation to the forefront of machine-learning (ML) research. In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models, resulting the generation result nearly indistinguishable from real-world images, revolutionizing the way we retrieval images. Our explorative study has incentivised us to think that there are further ways of scaling text-to-image models with the combination of innovative model architectures and prediction enhancement techniques. We have divided the work of this survey into five main sections wherein we detail the frameworks of major literature in order to delve into the different types of text-to-image generation methods. Following this we provide a detailed comparison and critique of these methods and offer possible pathways of improvement for future work. In the future work, we argue that TTI development could yield impressive productivity improvements for creation, particularly in the context of the AIGC era, and could be extended to more complex tasks such as video generation and 3D generation.

Abstract:
Few-shot segmentation (FSS) aims to accurately segment target objects in a query image using only a limited number of annotated support images. Existing approaches typically follow a paradigm that directly leverages category information from the support set to identify target objects in the query. However, these methods often ignore the category information gap between query and support images, leading to suboptimal performance when faced with images containing objects exhibiting significant intra-class diversity. To address this issue, we propose a novel framework that introduces intermediate prototypes to capture both deterministic information from the support images and adaptive knowledge from the query at multiple scales. Our framework, named the K-shot Multi-scale Intermediate Prototype Mining Transformer (KMIPMT), is based on the Transformer architecture and learns intermediate prototypes in an iterative manner, where each KMIPMT layer propagates category information from both K-shot support features and multi-scale query features to intermediate prototypes. This information is then utilized to activate the query feature map. Through repeated iterations, both intermediate prototypes and the query feature are progressively enhanced, and the final refined query feature is used for generating precise segmentation predictions. Despite its simplicity, our method achieves remarkable performance gains on standard benchmarks, including PASCAL-5^i5i, COCO-20^i20i, and FSS-1000, setting new state-of-the-art results. Furthermore, we explore several practical and challenging extensions of our method, including 3D point cloud FSS, zero-shot segmentation, weak-label FSS, and cross-domain FSS. These extensions showcase the versatility and effectiveness of our proposed KMIPMT framework across different domains and scenarios.

Abstract:
Efficient and accurate reconstruction of a relightable, dynamic clothed human avatar from a monocular video is crucial for the entertainment industry. This article presents SGIA (Surfel-based Gaussian Inverse Avatar), which introduces efficient training and rendering for relightable dynamic human reconstruction. SGIA advances previous Gaussian Avatar methods by comprehensively modeling Physically-Based Rendering (PBR) properties for clothed human avatars, allowing for the manipulation of avatars into novel poses under diverse lighting conditions. Specifically, our approach integrates pre-integration and image-based lighting for fast light calculations that surpass the performance of existing implicit-based techniques. To address challenges related to material lighting disentanglement and accurate geometry reconstruction, we propose an innovative occlusion approximation strategy and a progressive training approach. Extensive experiments demonstrate that SGIA not only achieves highly accurate physical properties but also significantly enhances the realistic relighting of dynamic human avatars, providing a substantial speed advantage.

Abstract:
To defend the inference attacks and mitigate the sensitive information leakages in Federated Learning (FL), client-level Differentially Private FL (DPFL) is the de-facto standard for privacy protection by clipping local updates and adding random noise. However, existing DPFL methods tend to make a sharp loss landscape and have poor weight perturbation robustness, resulting in severe performance degradation. To alleviate these issues, we propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient perturbation to mitigate the negative impact of DP. Specifically, DP-FedSAM integrates Sharpness Aware Minimization (SAM) optimizer to generate local flatness models with improved stability and weight perturbation robustness, which results in the small norm of local updates and robustness to DP noise, thereby improving the performance. To further reduce the magnitude of random noise while achieving better performance, we propose DP-FedSAM-\operatornametop_ktopk by adopting the local update sparsification technique. From the theoretical perspective, we present the convergence analysis to investigate how our algorithms mitigate the performance degradation induced by DP. Meanwhile, we give rigorous privacy guarantees with Rényi DP, the sensitivity analysis of local updates, and generalization analysis. At last, we empirically confirm that our algorithms achieve state-of-the-art (SOTA) performance compared with existing SOTA baselines in DPFL.

Abstract:
Open World Object Detection (OWOD) is a novel computer vision task with a considerable challenge, bridging the gap between classic object detection (OD) and real-world object detection. In addition to detecting and classifying seen/known objects, OWOD algorithms are expected to localize all potential unseen/unknown objects and incrementally learn them. The large pre-trained vision-language grounding models (VLM, e.g., GLIP) have rich knowledge about the open world, but are limited by text prompts and cannot localize indescribable objects. However, there are many detection scenarios in which pre-defined language descriptions are unavailable during inference. In this paper, we attempt to specialize the VLM model for OWOD tasks by distilling its open-world knowledge into a language-agnostic detector. Surprisingly, we observe that the simple knowledge distillation approach leads to unexpected performance for unknown object detection, even with a small amount of data. Unfortunately, knowledge distillation for unknown objects severely affects the learning of detectors with conventional structures, leading to catastrophic damage to the model’s ability to learn about known objects. To alleviate these problems, we propose the down-weight training strategy for knowledge distillation from vision-language model to single visual modality one. Meanwhile, we propose the cascade decoupled decoders that decouple the learning of localization and recognition to reduce the impact of category interactions of known and unknown objects on the localization learning process. Ablation experiments demonstrate that both of them are effective in mitigating the impact of open-world knowledge distillation on the learning of known objects. Additionally, to alleviate the current lack of comprehensive benchmarks for evaluating the ability of the open-world detector to detect unknown objects in the open world, we refine the benchmark for evaluating the performance of unknown object detection by augmenting annotations for unknown objects which we name“IntensiveSet\scriptstyle\spadesuit♠”. Comprehensive experiments performed on OWOD, MS-COCO, and our proposed benchmarks demonstrate the effectiveness of our methods.

Abstract:
Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the core set in fMRI-Shape. Each subject views 3,142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Moreover, we propose MinD-3D++, a novel framework for decoding textured 3D visual information from fMRI signals. The framework evaluates the feasibility of not only reconstructing 3D objects from the human mind but also generating, for the first time, 3D textured meshes with detailed textures from fMRI data. We establish new benchmarks by designing metrics at the semantic, structural, and textured levels to evaluate model performance. Furthermore, we assess the model’s effectiveness in out-of-distribution settings and analyze the attribution of the proposed 3D pari fMRI dataset in visual regions of interest (ROIs) in fMRI signals. Our experiments demonstrate that MinD-3D++ not only reconstructs 3D objects with high semantic and spatial accuracy but also provides deeper insights into how the human brain processes 3D visual information.

Abstract:
Post-training quantization (PTQ) for transformer-based large foundation models (LFMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. However, existing methods face three main issues: 1) The scaling factors, which are commonly used in scale reparameterization based weight-activation quantization for mitigating the quantization errors, are mostly hand-crafted defined which may lead to suboptimal results; 2) The formulation of current quantization error defined by L2-norm ignores the directional shifts after quantization; 3) Most methods are devised tailored for single scenario, i.e., only evaluated on LLMs or only designed for weight-only quantization, which lacks of a comprehensive evaluation on diverse benchmarks and a broad application scope. To address these challenges, this paper introduces a unified Learnable and Robust post-training Quantization framework for transformer based LFMs and various quantization scenarios, called LRQuant. First, we consider an efficient block-wise learnable paradigm to find optimal scaling factors which are initialized by logarithmic activation equivalent and get suitable clipping range of quantization steps. In addition, we empirically find that only relying on MSE loss could hardly lead to optimal quantization results, so we reformulate the quantization error and then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. To fully investigate the potentiality of our learnable paradigm, we propose a more superior version LRQuant+. Specifically, we first propose a dynamically weighted scheme to balance MSE and NLC loss, and then devise learnable rotation vectors to further directly reduce directional gaps. In addition, we improve the block-wise optimization framework into a novel two-branch nature which jointly considers the error propagation and homologous reconstruction error. Extensive experiments demonstrate the superiority of our LRQuantand LRQuant+, as well as their unified effectiveness across various LFMs for both weight-activation and weight-only quantization, especially under challenging quantization scenarios, i.e., W4A4 and W2A16 on LLMs, ViTS, and MLLMs.

Abstract:
Hypergraph-based modeling has gained significant attention for capturing complex higher-order interactions among vertices. While random walks serve as fundamental tools for analyzing hypergraphs, existing approaches either fail to fully leverage edge-dependent vertex weights (EDVWs) or lack sufficient expressiveness to model intricate hypergraph structures. To address these limitations, we propose a unified random walk framework that integrates hyperedge degrees and vertex weights, offering a more robust approach to hypergraph modeling. We establish equivalence conditions between hypergraph and graph random walks, leading to a novel unified random-walk-based hypergraph Laplacian that incorporates EDVWs, ensuring expressiveness and desirable spectral properties. Building on this foundation, we introduce the General Hypergraph Spectral Convolution (GHSC) framework, which extends existing Graph Convolutional Neural Networks (GCNNs) for effective hypergraph learning, supporting both edge-independent and edge-dependent vertex weights. Extensive experiments across diverse datasets, including citation networks, visual objects, and protein modeling tasks, demonstrate state-of-the-art performance, with notable improvements in protein structure modeling using EDVW-hypergraphs. This work advances the theoretical understanding of hypergraph random walks and spectral theory while providing a versatile framework for deep hypergraph learning. Code is available at GHSC_H_GNNs.

Abstract:
Recent years have witnessed the rapid development of large language models (LLMs). Multi-modal LLMs (MLLMs) extend modality from text to various domains, attracting widespread attention due to their diverse application scenarios. As LLMs and MLLMs rely on vast amounts of model parameters and data to achieve emergent capabilities, the importance of data is gaining increasing recognition. Reviewing recent data-driven works for MLLMs, we find that the development of models and data is not two separate paths but rather interconnected. Vaster and higher-quality data improve MLLM performance, while MLLMs, in turn, facilitate the development of data. The co-development of multi-modal data and MLLMs requires a clear view of 1) at which development stages of MLLMs specific data-centric approaches can be employed to enhance certain MLLM capabilities, and 2) how MLLMs, using these capabilities, can contribute to multi-modal data in specific roles. To promote data-model co-development for MLLM communities, we systematically review existing works on MLLMs from the data-model co-development perspective.

Abstract:
Deep learning (DL) has advanced the field of dense prediction, while gradually dissolving the inherent barriers between different tasks. However, most existing works focus on designing architectures and constructing visual cues only for the specific task, which ignores the potential uniformity introduced by the DL paradigm. In this paper, we attempt to construct a novel ComPlementary transformer, ComPtr, for diverse bi-source dense prediction tasks. Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. remote sensing change detection, RGB-T crowd counting, RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed method consistently obtains favorable performance.

Abstract:
Scene graph generation (SGG) establishes a structured representation between multiple objects by exploring their relationship for visual perception and reasoning tasks. Existing SGG methods often fit the relationships’ distribution by introducing language prior or statistical knowledge. However, the relationships should be the semantic reflection of the interaction between objects, rather than the statistical dependency between their categories. To solve this problem, we propose a novel Causal Features Enhancement Network (CFEN) to mine the essential semantic features between objects and relationships. Specifically, by decomposing the object features into class-generic and object-specific components, the causal graph framework is designed to analyze these existing SGG methods. To measure the influence of object-specific features for relationship recognition, we construct the counterfactual training framework for computing the difference between fact and counterfactual logits. Besides, to strengthen the role of object-specific features and learn the interaction between objects, a distribution matching loss is proposed to compute the KL divergence between counterfactual outputs and standard difference distributions and modulate the relations predictions. Finally, compared with the current state-of-the-art methods, the extensive experimental results on VG150 and VrR-VG datasets demonstrate the effectiveness and superiority of our proposed CFEN.

Abstract:
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet—a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. The two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3 (Brown et al. 2020), Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. Prophet is general that can be instantiated with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones). Moreover, Prophet can also be integrated with modern large multimodal models in different stages, which is named Prophet++, to further improve the capabilities on knowledge-based VQA tasks.

Abstract:
The bounded variance, gradient Lipschitz, and unbiased stochastic gradient are three key assumptions for ensuring the convergence and generalization of stochastic methods, especially in nonconvex scenarios. However, it is important to acknowledge that in practical applications, one or more of these assumptions might be violated, which is the main focus of this paper. In this study, we aim to demonstrate that by incorporating simple gradient normalization with momentum, SGD can effectively guarantee convergence and generalization, even in the presence of unbounded noise, weak gradient Lipschitz, and biased stochastic gradient caused by delays. These results significantly broaden the range of applications for stochastic algorithms, as they relax the previous assumptions and provide more flexibility in real-world scenarios.

Abstract:
We propose a method to reconstruct a personalized hand avatar, representing the user’s hand shape and appearance, from a monocular RGB-D video of a hand performing unknown hand poses under unknown illumination. Our method, HandRT, jointly optimizes hand pose, shape, appearance, and lighting parameters using a physically-based shading model in a differentiable rendering framework incorporating Monte Carlo path tracing. HandRT extends our previous work, Intrinsic Hand Avatar, by relaxing the assumption of a known coarse hand pose and utilizing depth data in the optimization. Specifically, we introduce an articulated registration energy based on iterative closest point over the depth point cloud that enables reconstruction from unknown hand poses via tracking across frames. Thus, we can reconstruct the avatar from arbitrary poses with high accuracy without relying on an off-the-shelf 2D joint detector at each frame. Further, HandRT is capable of precisely tracking the reconstructed avatar from either RGB or RGB-D input. Our evaluation demonstrates that our method outperforms existing hand avatar reconstruction methods on all commonly used metrics while producing significantly accurate mesh compared to state-of-the-art hand mesh recovery methods by a large margin on public and our captured datasets.

Abstract:
Despite the impressive performance obtained by recent single-image hand modeling techniques, they lack the capability to capture sufficient details of the 3D hand mesh. This deficiency greatly limits their applications when high-fidelity hand modeling is required, e.g., personalized hand modeling. To address this problem, we design a frequency split network to generate 3D hand meshes using different frequency bands in a coarse-to-fine manner. To capture high-frequency personalized details, we transform the 3D mesh into the frequency domain, and proposed a novel frequency decomposition loss to supervise each frequency component. By leveraging such a coarse-to-fine scheme, hand details that correspond to the higher frequency domain can be preserved. In addition, the proposed network is scalable, and can stop the inference at any resolution level to accommodate different hardware with varying computational powers. To feed the scalable frequency network with frequency split image features, we proposed an image-graph ring feature mapping strategy. To train our network with per-vertex supervision, we use a bidirectional registration strategy to generate a topology-fixed ground-truth. To quantitatively evaluate the performance of our method in terms of recovering personalized shape details, we introduce a new evaluation metric named Mean-frequency Signal-to-Noise Ratio (MSNR) to measure the mean signal-to-noise ratio of mesh signal on each frequency component. Extensive experiments demonstrate that our approach generates fine-grained details for high-fidelity 3D hand reconstruction, and our evaluation metric is more effective than traditional metrics for measuring mesh details.

Abstract:
The act of telling stories is a fundamental part of what it means to be human. This work introduces the concept of narrative information, which we define as the overlap in information space between a story and the items that compose the story. Using contrastive learning methods, we show how modern artificial neural networks can be leveraged to distill stories and extract a representation of the narrative information. We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative template curves and how these—in tandem with a novel curve-fitting algorithm we introduce—can reorder music albums to automatically induce stories in them. In doing so, we give statistically significant evidence that (1) these narrative information template curves are present in existing albums and that (2) people prefer an album ordered through one of these learned template curves over a random one. The premises of our work extend to any form of (largely) independent media, and as evidence, we also show that our method works with image data.

Affiliations: Southwest Jiaotong University and Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu, China; Harbin Institute of Technology (Shenzhen), Shenzhen, China; Beihang University, Beijing, China; Beijing Academy of Artificial Intelligence, Beijing, China; SenseTime Group Limited, Shanghai, China; College of Computer Science and Technology, and Qingdao Institute of Software, China University of Petroleum (East China), Qingdao, China; Case Western Reserve University, Cleveland, OH, USA

Abstract:
Although weakly-supervised techniques can reduce the labeling effort, it is unclear whether a saliency model trained with weakly-supervised data (e.g., point annotation) can achieve the equivalent performance of its fully-supervised version. This paper attempts to answer this unexplored question by proving a hypothesis: there is a point-labeled dataset where saliency models trained on it can achieve equivalent performance when trained on the densely annotated dataset. To prove this conjecture, we proposed a novel yet effective adversarial spatio-temporal ensemble active learning. Our contributions are four-fold: 1) Our proposed adversarial attack triggering uncertainty can conquer the overconfidence of existing active learning methods and accurately locate these uncertain pixels. 2) Our proposed spatio-temporal ensemble strategy not only achieves outstanding performance but significantly reduces the model's computational cost. 3) Our proposed relationship-aware diversity sampling can conquer oversampling while boosting model performance. 4) We provide theoretical proof for the existence of such a point-labeled dataset. Experimental results show that our approach can find such a point-labeled dataset, where a saliency model trained on it obtained 98%–99% performance of its fully-supervised version with only ten annotated points per image.

Abstract:
Modeling count data using suitable statistical distributions has been instrumental for analyzing the patterns it conveys. However, failing to address critical aspects, like overdispersion, jeopardizes the effectiveness of such an analysis. In this paper, overdispersed count data is modeled using the Dirichlet Multinomial (DM) distribution by maximizing its likelihood using a fixed-point iteration algorithm. This is achieved by estimating the DM distribution parameters while comparing the recent Languasco-Migliardi (LM), and the Yu-Shaw (YS) procedures, which address the well-known computational difficulties of evaluating its log-likelihood. Experiments were conducted using multiple datasets from different domains spanning polls, images, and IoT network traffic. They all showed the superiority of the LM procedure as it succeeded at estimating the DM parameters at the designated level of accuracy in all experiments, while the YS procedure failed to produce sufficiently accurate results (or any results at all) in several experiments. Moreover, the LM procedure achieved a speedup that ranged from 2-fold to 20-fold over YS.

Abstract:
A universal multiscale conditional coding framework, Unicorn, is proposed to code the geometry and attribute of any given point cloud. Attribute compression is discussed in Part II of this paper, while geometry compression is given in Part I of this paper. We first construct the multiscale sparse tensors of each voxelized point cloud attribute frame. Since attribute components exhibit very different intrinsic characteristics from the geometry element, e.g., 8-bit RGB color versus 1-bit occupancy, we process the attribute residual between lower-scale reconstruction and current-scale data. Similarly, we leverage spatially lower-scale priors in the current frame and (previously processed) temporal reference frame to improve the probability estimation of attribute intensity through conditional residual prediction in lossless mode or enhance the attribute reconstruction through progressive residual refinement in lossy mode for better performance. The proposed Unicorn is a versatile, learning-based solution capable of compressing a great variety of static and dynamic point clouds in both lossy and lossless modes. Following the same evaluation criteria, Unicorn significantly outperforms standard-compliant approaches like MPEG G-PCC, V-PCC, and other learning-based solutions, yielding state-of-the-art compression efficiency with affordable encoding/decoding runtime.

Abstract:
A universal multiscale conditional coding framework, Unicorn, is proposed to compress the geometry and attribute of any given point cloud. Geometry compression is addressed in Part I of this paper, while attribute compression is discussed in Part II. We construct the multiscale sparse tensors of each voxelized point cloud frame and properly leverage lower-scale priors in the current and (previously processed) temporal reference frames to improve the conditional probability approximation or content-aware predictive reconstruction of geometry occupancy in compression. Unicorn is a versatile, learning-based solution capable of compressing static and dynamic point clouds with diverse source characteristics in both lossy and lossless modes. Following the same evaluation criteria, Unicorn significantly outperforms standard-compliant approaches like MPEG G-PCC, V-PCC, and other learning-based solutions, yielding state-of-the-art compression efficiency while presenting affordable complexity for practical implementations.

Abstract:
Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent SRL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches rely on coded primitives with little feature learning, and when applied to high-dimensional visual scenes, they can suffer from scalability issues and perform poorly when images have complex object interactions. To address these challenges, we propose Differentiable Symbolic Expression Search (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more and scalable than state-of-the-art SRL methods, with a reduced amount of symbolic prior knowledge.

Abstract:
Building fair deep neural networks (DNNs) is a crucial step towards achieving trustworthy artificial intelligence. Delving into deeper factors that affect the fairness of DNNs is paramount and serves as the foundation for mitigating model biases. However, current methods are limited in accurately predicting DNN biases, relying solely on the number of training samples and lacking more precise measurement tools. Here, we establish a geometric perspective for analyzing the fairness of DNNs, comprehensively exploring how DNNs internally shape the intrinsic geometric characteristics of datasets—the intrinsic dimensions (IDs) of perceptual manifolds, and the impact of IDs on the fairness of DNNs. Based on multiple findings, we propose Intrinsic Dimension Regularization (IDR), which enhances the fairness and performance of models by promoting the learning of concise and ID-balanced class perceptual manifolds. In various image recognition benchmark tests, IDR significantly mitigates model bias while improving its performance.

Abstract:
Complex networks serve as abstract models for understanding real-world complex systems and provide frameworks for studying structured dynamical systems. This article addresses limitations in current studies on the exploration of individual birth-death and the development of community structures within dynamic systems. To bridge this gap, we propose a networked evolution model that includes the birth and death of individuals, incorporating reinforcement learning through games among individuals. Each individual has a lifespan following an arbitrary distribution, engages in games with network neighbors, selects actions using Q-learning in reinforcement learning, and moves within a two-dimensional space. The developed theories are validated through extensive experiments. Besides, we observe the evolution of cooperative behaviors and community structures in systems both with and without the birth-death process. The fitting of real-world populations and networks demonstrates the practicality of our model. Furthermore, comprehensive analyses of the model reveal that exploitation rates and payoff parameters determine the emergence of communities, learning rates affect the speed of community formation, discount factors influence stability, and two-dimensional space dimensions dictate community size. Our model offers a novel perspective on real-world community development and provides a valuable framework for studying population dynamics behaviors.

Abstract:
Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model’s ability to distinguish sentimental discrepancies. GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), thus enabling simultaneous execution of MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.

Abstract:
Adversarial attacks in 3D environments have emerged as a critical threat to the reliability of visual perception systems, particularly in safety-sensitive applications such as identity verification and autonomous driving. These attacks employ adversarial patches and 3D objects to manipulate deep neural network (DNN) predictions by exploiting vulnerabilities within complex scenes. Existing defense mechanisms, such as adversarial training and purification, primarily employ passive strategies to enhance robustness. However, these approaches often rely on pre-defined assumptions about adversarial tactics, limiting their adaptability in dynamic 3D settings. To address these challenges, we introduce Reinforced Embodied Active Defense (Rein-EAD), a proactive defense framework that leverages adaptive exploration and interaction with the environment to improve perception robustness in 3D adversarial contexts. By implementing a multi-step objective that balances immediate prediction accuracy with predictive entropy minimization, Rein-EAD optimizes defense strategies over a multi-step horizon. Additionally, Rein-EAD involves an uncertainty-oriented reward-shaping mechanism that facilitates efficient policy updates, thereby reducing computational overhead and supporting real-world applicability without the need for differentiable environments. Comprehensive experiments validate the effectiveness of Rein-EAD, demonstrating a substantial reduction in attack success rates while preserving standard accuracy across diverse tasks. Notably, Rein-EAD exhibits robust generalization to unseen and adaptive attacks, making it suitable for real-world complex tasks, including 3D object classification, face recognition and autonomous driving. By integrating proactive policy learning with embodied scene interaction, Rein-EAD establishes a scalable and adaptable approach for securing DNN-based perception systems in dynamic and adversarial 3D environments.

Abstract:
Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of large-scale false positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We proposes a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels’ correspondence under the condition of topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via gradient. We also proposes a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code at a companion website.

Abstract:
Event cameras form a fundamental foundation for visual perception in scenes characterized by high speed and a wide dynamic range. Although deep learning techniques have achieved remarkable success in estimating event-based optical flow, existing methods have not adequately addressed the significance of temporal information in capturing spatiotemporal features. Due to the dynamics of spiking neurons in SNNs, which preserve important information while forgetting redundant information over time, they are expected to outperform analog neural networks (ANNs) with the same architecture and size in sequential regression tasks. In addition, SNNs on neuromorphic hardware achieve advantages of extremely low power consumption. However, present SNN architectures encounter issues related to limited generalization and robustness during training, particularly in noisy scenes. To tackle these problems, this study introduces an innovative spike-based self-supervised learning algorithm known as SeLHIB, which leverages the information bottleneck theory. By utilizing event-based camera inputs, SeLHIB enables robust estimation of optical flow in the presence of noise. To the best of our knowledge, this is the first proposal of a self-supervised information bottleneck learning strategy based on SNNs. Furthermore, we develop spike-based self-supervised algorithms with nonlinear and high-order information bottleneck learning that employs nonlinear and high-order mutual information to enhance the extraction of relevant information and eliminate redundancy. We demonstrate that SeLHIB significantly enhances the generalization ability and robustness of optical flow estimation in various noise conditions. In terms of energy efficiency, SeLHIB achieves 90.44% and 45.70% cut down of energy consumption compared to its counterpart ANN and counterpart SNN models, while attaining 33.78% lower AEE (MVSEC), 5.96% lower RSAT (ECD) and 6.21% lower RSAT (HQF) compared to the counterpart ANN implementations with the same sizes and architectures.

Abstract:
Simple multiple kernel k-means (SMKKM) introduces a new minimization-maximization learning paradigm for multi-view clustering and makes remarkable achievements in some applications. As one of its variants, localized SMKKM (LSMKKM) is recently proposed to capture the variation among samples, focusing on reliable pairwise samples, which should keep together and cut off unreliable, farther pairwise ones. Though demonstrating effectiveness, we observe that LSMKKM indiscriminately utilizes the variation of each sample, resulting in unsatisfying clustering performance. To overcome this limitation, we propose a sample adaptive localized SMKKM (SAL-SMKKM) algorithm where the weight of the local alignment for each sample can be adaptively adjusted, resulting in a more challenging tri-level minimization-minimization-maximization. To deal with it, we reformulate it into a minimization problem of an optimal function characterized by minimization-maximization dynamics, prove its differentiability, and develop a reduced gradient descent method to optimize it. We then theoretically analyze the clustering performance of the proposed SAL-SMKKM by deriving its generalization error bound. In addition, we empirically evaluate the clustering performance of the proposed SAL-SMKKM on several benchmark datasets. Experiment results clearly indicate that proposed algorithms consistently outperform state-of-the-art ones. Finally, we apply the proposed SAL-SMKKM to the multi-modal parcellation of the human cerebral cortex, which is essential and helpful to understanding brain organization and function. As seen, SAL-SMKKM achieves accurate parcellation in an automatic and objective manner without any manual intervention, which once again demonstrates its validity and effectiveness in practical applications.

Abstract:
The Heterogeneous Information Network (HIN) stands out as a prominent tool for depicting interactions in real-world systems. Recently, representation learning on HINs has attracted significant attention, as the structured and compact output embeddings offer great convenience for network analysis and graph machine learning tasks. While existing HIN representation learning methods excel in supervised training or direct proximity reconstruction, yielding satisfactory performance in tasks like node clustering and classification, they often overlook the critical HIN generative process characterized by numerous events. As a result, these methods fail to preserve the higher-order interactions among the nodes and predict potential links in HINs. To address these limitations, we propose a Contrastive Learning method via Events on Heterogeneous Information Networks (CLEH). CLEH delineates the generative process from the local structure (nodes) to the higher-order structure (events) in HINs. We design a novel event-level contrastive learning procedure, endowing representations with the capability to capture higher-order relations among nodes. Moreover, CLEH leverages a normalizing flow model as the encoder to enhance the expressiveness of embeddings. Experimental results on HIN datasets demonstrate the significant superiority of CLEH in link prediction compared to popular baselines.

Abstract:
This paper introduces a new, model-independent, metric, called RExQUAL, for quantifying the quality of explanations provided by attribution-based explainable artificial intelligence techniques and compare them. The underlying idea is based on feature attribution, using a subset of the ranking of the attributes highlighted by a model-agnostic explainable method in a forecasting task. Then, association rules are generated using these key attributes as input data. Novel metrics, including global support and confidence, are proposed to assess the joint quality of generated rules. Finally, the quality of the explanations is calculated based on a wise and comprehensive combination of the association rules global metrics. The proposed method integrates local explanations through attribution-based approaches for evaluation and feature selection with global explanations for the entire dataset. This paper rigorously evaluates the new metric by comparing three explainability techniques: the widely used SHAP and LIME, and the novel methodology RULEx. The experimental design includes predicting time series of different natures, including univariate and multivariate, through deep learning models. The results underscore the efficacy and versatility of the proposed methodology as a quantitative framework for evaluating and comparing explainable techniques.

Abstract:
Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model’s perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method’s efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.

Abstract:
Addressing the pervasive challenge of imperfect data in autonomous vehicle (AV) systems, this study pioneers an integrated trajectory prediction model, WAKE, that fuses physics-informed methodologies with sophisticated machine learning techniques. Our model operates in two principal stages: the initial stage utilizes a Wavelet Reconstruction Network to accurately reconstruct missing observations, thereby preparing a robust dataset for further processing. This is followed by the Kinematic Bicycle Model which ensures that reconstructed trajectory predictions adhere strictly to physical laws governing vehicular motion. The integration of these physics-based insights with a subsequent machine learning stage, featuring a Quantum Mechanics-Inspired Interaction-aware Module, allows for sophisticated modeling of complex vehicle interactions. This fusion approach not only enhances the prediction accuracy but also enriches the model's ability to handle real-world variability and unpredictability. Extensive tests using specific versions of MoCAD, NGSIM, HighD, INTERACTION, and nuScenes datasets featuring missing observational data, have demonstrated the superior performance of our model in terms of both accuracy and physical feasibility, particularly in scenarios with significant data loss—up to 75% missing observations. Our findings underscore the potency of combining physics-informed models with advanced machine learning frameworks to advance autonomous driving technologies, aligning with the interdisciplinary nature of information fusion.

Abstract:
Large language models (LLMs) have garnered substantial attention due to their promising applications in diverse domains. Nevertheless, the increasing size of LLMs comes with a significant surge in the computational requirements for training and deployment. Memristor crossbars have emerged as a promising solution, which demonstrated a small footprint and remarkably high energy efficiency in computer vision (CV) models. Memristors possess higher density compared to conventional memory technologies, making them highly suitable for effectively managing the extreme model size associated with LLMs. However, deploying LLMs on memristor crossbars faces three major challenges. First, the size of LLMs increases rapidly, already surpassing the capabilities of state-of-the-art memristor chips. Second, LLMs often incorporate multi-head attention blocks, which involve non-weight stationary multiplications that traditional memristor crossbars cannot support. Third, while memristor crossbars excel at performing linear operations, they are not capable of executing complex nonlinear operations in LLM such as softmax and layer normalization. To address these challenges, we present a novel architecture for the memristor crossbar that enables the deployment of state-of-the-art LLM on a single chip or package, eliminating the energy and time inefficiencies associated with off-chip communication. Our testing on BERT_\text Large Large showed negligible accuracy loss. Compared to traditional memristor crossbars, our architecture achieves enhancements of up to 39×39× in area overhead and 18×18× in energy consumption. Compared to modern TPU/GPU systems, our architecture demonstrates at least a 68×68× reduction in the area-delay product and a significant 69% energy consumption reduction.

Affiliations: School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China; Department of Automation, Shanghai Jiao Tong University, Shanghai, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China; Department of Computer Science, Cornell University, Ithaca, NY, USA; College of Computer and Information Technology, China Three Gorges University, Yichang, China; Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; School of Geosciences and Info-Physics, Central South University, Changsha, China; School of Automation, Southeast University, Nanjing, China; School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Scene graph generation (SGG) in satellite imagery (SAI) benefits promoting understanding of geospatial scenarios from perception to cognition. In SAI, objects exhibit great variations in scales and aspect ratios, and there exist rich relationships between objects (even between spatially disjoint objects), which makes it attractive to holistically conduct SGG in large-size very-high-resolution (VHR) SAI. However, there lack such SGG datasets. Due to the complexity of large-size SAI, mining triplets < > heavily relies on long-range contextual reasoning. Consequently, SGG models designed for small-size natural imagery are not directly applicable to large-size SAI. This paper constructs a large-scale dataset for SGG in large-size VHR SAI with image sizes ranging from 512 × 768 to 27 860 × 31 096 pixels, named STAR (Scene graph generaTion in lArge-size satellite imageRy), encompassing over 210K objects and over 400K triplets. To realize SGG in large-size SAI, we propose a context-aware cascade cognition (CAC) framework to understand SAI regarding object detection (OBD), pair pruning and relationship prediction for SGG. We also release a SAI-oriented SGG toolkit with about 30 OBD and 10 SGG methods which need further adaptation by our devised modules on our challenging STAR dataset.

Abstract:
In this article, we propose a new method of representing and analyzing music audio records. The method is based on the concept of the trajectory of fifths, which was initially developed for the analysis of music represented in MIDI format. To adapt this concept to the needs of audio signal processing, we implement a short-term spectral analysis of a musical piece, followed by a mapping of its subsequent spectral timeframes onto signatures of fifths reflecting relative intensities of sounds associated with each of the 12 pitch classes. Subsequently, the calculation of the characteristic points of the consecutive signatures of fifths enables the creation of the trajectory of fifths. The results of the experiments and statistical analysis conducted in a set of 8996 audio music pieces belonging to 10 genres indicate that this kind of trajectory, just as its MIDI-compliant precursor, is a source of valuable information (i.e., feature coefficients) concerning the harmonic structure of music, which may find use in audio music classification processes.

Abstract:
Contactless 3D fingerprint identification systems have emerged to provide more accurate and hygienic alternatives to contact-based conventional systems that acquire hundreds of millions of fingerprints everyday. However, the intricate process of acquiring 3D fingerprints presents a significant challenge, acting as a key barrier to fully unlocking the potential of 3D fingerprint biometrics. This paper introduces a novel framework to directly recover corresponding 3D minutiae template from a single contactless 2D fingerprint image. Billions of contact-based fingerprints have been acquired and employed everyday for e-governance and other applications. Seamless adoption of contactless 3D fingerprint technologies also requires advanced capabilities to accurately match 3D fingerprints with respective 2D fingerprint templates, which is currently missing in existing literature. We therefore introduce novel capabilities to accurately align minutiae templates in 3D spaces and enable compensation for the unknown perspective transformation. This capability significantly enhances the ability to accurately match 3D to 3D and 3D to 2D fingerprint templates. Furthermore, we introduce a new approach to synthesizing realistic contactless fingerprint images, resulting in the generation of a large synthetic database complete with corresponding 3D ground truths of minutiae points. Finally, we provide a detailed theoretical analysis of formulation for the uniqueness of recovered 3D minutiae templates, providing a theoretical justification for the superiority of such 3D minutiae templates over their 2D counterparts.

Abstract:
Understanding the effect of hyperparameters of the network structure on the performance of Convolutional Neural Networks (CNNs) remains the most fundamental and urgent issue in deep learning, and we attempt to address this issue based on the piecewise linear (PWL) function nature of CNNs in this paper. Firstly, the operations of convolutions, ReLUs and Max pooling in a CNN are represented as the multiplication of multiple matrices for a fixed sample in order to obtain an algebraic expression of CNNs, this expression clearly suggests that CNNs are PWL functions. Although such representation has high time complexity, it provides a more convenient and intuitive way to study the mathematical properties of CNNs. Secondly, we develop a tight bound of the number of linear regions and the upper bounds of generalization error for CNNs, both taking into account factors such as the number of layers, dimension of pooling, and the width in the network. The above research results provide a possible guidance for designing and training CNNs.

Abstract:
Humans exhibit complex motions that vary depending on the activity they are performing, the interactions they engage in, as well as subject-specific preferences. Therefore, forecasting a human’s future pose based on the history of his or her previous motion is a challenging task. This paper presents an innovative auxiliary-memory-powered deep neural network framework to improve the modelling of historical knowledge. Specifically, we disentangle subject-specific, action-specific, and other auxiliary information from the observed pose sequences and utilise these factorised features to query the memory. A novel Multi-Head knowledge retrieval scheme leverages these factorised feature embeddings to perform multiple querying operations over the historical observations captured within the auxiliary memory. Moreover, we propose a dynamic masking strategy to make this feature disentanglement process adaptive. Two novel loss functions are introduced to encourage diversity within the auxiliary memory, while ensuring the stability of the memory content such that it can locate and store salient information that aids the long-term prediction of future motion, irrespective of any data imbalances or the diversity of the input data distribution. Extensive experiments conducted on two public benchmarks, Human3.6M and CMU-Mocap, demonstrate that these design choices collectively allow the proposed approach to outperform the current state-of-the-art methods by significant margins: > > 17% on the Human3.6M dataset and > > 9% on the CMU-Mocap dataset.

Abstract:
In this work, we have developed a variational Bayesian inference theory of elasticity, which is accomplished by using a mixed Variational Bayesian inference Finite Element Method (VBI-FEM) that can be used to solve the inverse deformation problems of continua. In the proposed variational Bayesian inference theory of continuum mechanics, the elastic strain energy is used as a prior in a Bayesian inference network, which can intelligently recover the detailed continuum deformation mappings with only given the information of the deformed and undeformed continuum body shapes without knowing the interior deformation and the precise actual boundary conditions, neither traction nor displacement boundary conditions, and the actual material constitutive relation. Moreover, we have implemented the related finite element formulation in a computational probabilistic mechanics framework. To numerically solve mixed variational problem, we developed an operator splitting or staggered algorithm that consists of the finite element (FE) step and the Bayesian learning (BL) step as an analogue of the well-known the Expectation-Maximization (EM) algorithm. By solving the mixed probabilistic Galerkin variational problem, we demonstrated that the proposed method is able to inversely predict continuum deformation mappings with strong discontinuity or fracture without knowing the external load conditions. The proposed method provides a robust machine intelligent solution for the long-sought-after inverse problem solution, which has been a major challenge in structure failure forensic pattern analysis in past several decades. The proposed method may become a promising artificial intelligence-based inverse method for solving general partial differential equations.