TCSVT2024

Abstract:
News captioning task aims to generate sentences by describing named entities or concrete events for an image with its news article. Existing methods have achieved remarkable results by relying on the large-scale pre-trained models, which primarily focus on the correlations between the input news content and the output predictions. However, the news captioning requires adhering to some fundamental rules of news reporting, such as accurately describing the individuals and actions associated with the event. In this paper, we propose the rule-driven news captioning method, which can generate image descriptions following designated rule signal. Specifically, we first design the news-aware semantic rule for the descriptions. This rule incorporates the primary action depicted in the image (e.g., “performing”) and the roles played by named entities involved in the action (e.g., “Agent” and “Place”). Second, we inject this semantic rule into the large-scale pre-trained model, BART, with the prefix-tuning strategy, where multiple encoder layers are embedded with news-aware semantic rule. Finally, we can effectively guide BART to generate news sentences that comply with the designated rule. Extensive experiments on two widely used datasets (i.e., GoodNews and NYTimes800k) demonstrate the effectiveness of our method.

Abstract:
Existing research on knowledge distillation has primarily concentrated on the task of facilitating student networks in acquiring the complete knowledge imparted by teacher networks. However, recent studies have shown that good networks are not suitable for acting as teachers, and there is a positive correlation between distillation performance and teacher prediction uncertainty. To address this finding, this paper thoroughly analyzes in depth the reasons why the teacher network affects the distillation performance, gives full play to the participation of the student network in the process of knowledge distillation, and assists the teacher network in distilling the knowledge that is suitable for their learning. In light of this premise, a novel approach known as Collaborative Knowledge Distillation (CKD) is introduced, which is founded upon the concept of “Tailoring the Teaching to the Individual”. Compared with Baseline, this paper’s method improves students’ accuracy by an average of 3.42% in CIFAR-100 experiments, and by an average of 1.71% compared with the classical Knowledge Distillation (KD) method. The ImageNet experiments conducted revealed a significant improvement of 2.04% in the Top-1 accuracy of the students.

Abstract:
Deep-learning-based methods have achieved promising performance in visual tracking tasks. However, the backbones of the existing trackers normally emanate from the object detection realm, making them inefficient and insufficient in terms of spatial template matching. Moreover, such trackers apply temporal information without considering its authenticity during the online inference step, rendering them prone to error accumulation. To address these two issues, this work proposes OTETrack, a novel visual tracker with overlapped feature extraction and robust trajectory enhancement. The backbone of OTETrack, termed Overlapped ViT, slices the input image into overlapped patches to attain stronger template matching capabilities and sends them to alternating attention modules to maintain high model efficiency. Moreover, the trajectory enhancement mechanism in OTETrack is used to predict the center of the ladder-shaped Hanning window, which mildly penalizes the displacements between the spatial tracking results and the temporal predicted results to maintain the tracking consistency of a video sequence, thus mitigating the influences of spurious temporal information. Extensive experiments conducted on five benchmarks with thirteen baselines demonstrate the state-of-the-art performance of OTETrack. The source code and Appendix are released on https://github.com/OrigamiSL/OTETrack.

Abstract:
Hashing methods have made significant progress in cross-modal retrieval tasks with fast query speed and low storage cost. Among them, deep learning-based hashing achieves better performance on large-scale data due to its excellent extraction and representation ability for nonlinear heterogeneous features. However, there are still two main challenges in catastrophic forgetting when data with new categories arrive continuously, and time-consuming for non-continuous hashing retrieval to retrain for updating. To this end, we, in this paper, propose a novel deep lifelong cross-modal hashing to achieve lifelong hashing retrieval instead of re-training hash function repeatedly when new data arrive. Specifically, we design lifelong learning strategy to update hash functions by directly training the incremental data instead of retraining new hash functions using all the accumulated data, which significantly reduce training time. Then, we propose lifelong hashing loss to enable original hash codes participate in lifelong learning but remain invariant, and further preserve the similarity and dis-similarity among original and incremental hash codes to maintain performance. Additionally, considering distribution heterogeneity when new data arriving continuously, we introduce enhanced-semantic similarity to supervise hash learning, and it has been proven that the similarity improves performance with detailed analysis. Experimental results on benchmark datasets show that our proposed method achieves comparative performance comparing with recent state-of-the-art cross-modal hashing methods, and it yields substantial average increments over 20% in retrieval accuracy and almost reduces over 80% training time when new data arrives continuously.

Abstract:
Label distribution learning (LDL) trains a model to predict the relevance of a set of labels (called label distribution (LD)) to an instance. The previous LDL methods all assumed the LDs of the training instances are accurate. However, annotating highly accurate LDs for training instances is time-consuming and extremely expensive, and in reality the collected LDs are often inaccurate. This paper first investigates the inaccurate LDL (ILDL) problem—learn an LDL method from the inaccurate LDs. We assume that the inaccurate LD blends the ground-truth LD and sparse noise. Consequently, the ILDL problem becomes an inverse problem, whose objective is to recover the ground-truth LD and noise from the inaccurate LD. We hypothesize that the ground-truth LD exhibits low rank due to label correlations. Besides, we leverage the local geometric structure of instances (represented as graph) to further recover the ground-truth LD. Finally, the proposed method is formulated as a graph-regularized low-rank and sparse decomposition problem. Next, we induce an LDL predictive method by learning from recovered LD. Extensive experiments conducted on multiple datasets demonstrate the better performance of our method, especially for ILDL problem.

Abstract:
Video captioning is a multi-modal task across computer vision and natural language processing. Previous methods generally follow two paradigms, i.e. template-based and sequence-based. Template-based methods can generate relatively accurate elements (e.g. humans, objects, or actions) to complete a template caption, but with a rather limited vocabulary and syntactic structure; in contrast, sequence-based methods generate more natural descriptions like humans but easily suffer element errors due to their heavy dependence on visual features that often contain much distracting information. In this work, we draw lessons from the element extraction manner in template-based methods and propose a novel Element-aware video Captioning (EvCap) framework that applies linguistic features beyond general visual features to consolidate model awareness of specific elements under the sequence-based paradigm. In particular, we introduce two new linguistic features, i.e. action and object-relevant features, from the upstream encoder of the sequence-based paradigm to encode action and object information (in the forms of phrases and words respectively) that benefits the generation of corresponding elements in the final description. Moreover, to fuse the heterogeneous representations and relieve noise of inaccurate features, we design a post-operation fusion strategy, with semantic interaction and energy weighting to ensure the effective usage of each feature. Experimental results show that our EvCap achieves amazingly promising performance compared with baselines under diverse upstream encoder architectures including CNNs, ViT and CLIP, demonstrating good scalability with respect to encoder choices.

Abstract:
Although interactive image segmentation techniques have made significant progress, supervised learning-based methods rely heavily on large-scale labeled data which is difficult to obtain in certain domains such as medicine, biology, etc. Models trained on natural images also struggle to achieve satisfactory results when directly applied to these domains. To solve this dilemma, we propose a Self-supervised Interactive Segmentation (SIS) method that achieves superior generalization performance. By clustering features from unlabeled data, we obtain classifiers that assign pseudo-labels to pixels in images. After refinement by super-pixel voting, these pseudo-labels are then used to train our segmentation network. To enable our network to better adapt to cross-domain images, we introduce correction learning and anti-forgetting regularization to conduct test-time adaptation. Our experiment results on five datasets show that our approach significantly outperforms other interactive segmentation methods across natural image datasets in the same conditions and achieves even better performance than some supervised methods when across to medical image domain. The code and models are available at https://github.com/leal0110/SIS.

Abstract:
Edge preserving filter is the basis of many computational photography and image processing. This can be achieved by global optimization method or local filtering method. Generally, the filtering results of global optimization methods are better than that of local filtering methods, and local filtering methods usually run much faster than global optimization methods. In this paper, a globally optimized method called iterative self-guided image filter (isGIF) is extended based on the assumptions of the guided image filter (GIF), which can produce high-quality edge-preserving filtering results by using the input image itself as the guidance image. Some comparisons with other edge-aware filters are presented to show the advantages of our method. Extensive experiments demonstrate that our filter generates images with better visual quality, while reducing/avoiding halo artifacts in the final image, and the running time is competitive.

Abstract:
Online cross-modal hashing has received increasing research attention due to its capability of encoding streaming data and updating hash functions simultaneously. Despite significant progress, there is still room for further improving accuracy from two aspects, i.e., 1) enhancing discrimination of hash codes with an efficient training process; 2) elevating generalization performance by harmonizing the training and retrieval process. Inspired by this, we propose an Online Discriminative Cross-modal Hashing method, called ODCH. To enlarge the inter-class margin and magnify the intra-class similarity, ODCH skillfully constructs a discriminative semantic space and seamlessly integrates bit balance and uncorrelation constraints, discrete optimization, and asymmetric strategy for embedding the discriminative semantic information into hamming space. Furthermore, ODCH attempts to boost the generalization process by bridging the gap between learning and generalization. It develops adaptive bit-wise weights to reflect different learning conditions among bits and transmits them into the generalization process. Besides, the proposed discriminative embedding and adaptive weighting can be adopted by existing supervised cross-modal hashing methods, achieving more precise performance than the original versions. Extensive experiments on three benchmarked datasets show that ODCH achieves up to an average of 4.17% mAP score gains compared to state-of-the-art online cross-modal hashing methods, indicating its superiority.

Abstract:
Large-scale cross-modal hashing has drawn extensive attention due to its attractive efficiency in both storage and retrieval. Existing methods exhibit poor performance when exploiting the semantic correlations implied in unsupervised and unpaired data during training process. To deal with this issue, we propose a novel hashing method, named Semi-supervised Semi-paired Cross-modal Hashing (SSCH). By leveraging a general and flexible two-step scheme, the proposed method can handle the complex training data effectively and efficiently, where both the common semantics and the modality-specific optimal pseudo semantics are well captured. Specifically, the proposed SSCH performs an alignment-free pseudo-labeling process to get strengthened semantic information. Furthermore, hash representations for various data are learned via a label-enhanced strategy, through which the cross-modal correlations are strengthened and preserved with considering efficiency. The semantic-preserving proof of SSCH is given based on statistical analysis. Also, we prove the stability of the proposed time-saving algorithm using properties of Bregman divergence. Experimental results on three benchmark datasets show that SSCH can obtain satisfactory precision and scalability in various scenarios.

Abstract:
Recent studies have shown that deep learning-based classifiers are vulnerable to malicious inputs, i.e., adversarial examples. A practical solution is to construct a perceptible but localized perturbation called patch, making the well-trained models misclassified. However, most existing patch-based adversarial attacks focus on designing patches with localized rectangles, squares, or grids, ignoring the effect of the non-local patch. In this paper, we propose a novel cross-shaped patch attack paradigm (CSPA), a simple yet efficient and effective adversarial attack in Black-box scenarios. Specifically, the cross-shaped patch consists of two line segments intersected and perpendicular to each other at the midpoint. These two line segments are designed to be sufficiently thin and long to reach the four corners of the input image nearly. Thus, the patch has a globalized perturbation capacity while preserving its continuousness. The content and location of cross-shaped patch are then iteratively optimized by a carefully contrived random search-based algorithm to maximize this global property. Comprehensive experiments are conducted on four benchmark datasets against various victim networks. The results show that the proposed CSPA outperforms the existing patch-based attacks regarding both attack success rate and query efficiency by a large margin. Specifically, compared with the baselines, CSPA increases the success rate by up to 20% on ImageNet and reaches 100% on the CIFAR-100 and CIFAR-10 datasets. Meanwhile, CSPA reduces the average number of queries by up to 7 times. Even for the white-box attack scenario, our designed cross-shaped patch can still be applicable, achieving state-of-the-art performance.

Abstract:
Human instance matting aims to estimate an alpha matte for each human instance in an image, which is extremely challenging and has rarely been studied so far. Despite some efforts to use instance segmentation to generate a trimap for each instance and apply trimap-based matting methods, the resulting alpha mattes are often inaccurate due to inaccurate segmentation. In addition, this approach is computationally inefficient due to multiple executions of the matting method. To address these problems, this paper proposes a novel End-to-End Human Instance Matting (E2E-HIM) framework for simultaneous multiple instance matting in a more efficient manner. Specifically, a general perception network first extracts image features and decodes instance contexts into latent codes. Then, a united guidance network exploits spatial attention and semantics embedding to generate united semantics guidance, which encodes the locations and semantic correspondences of all instances. Finally, an instance matting network decodes the image features and united semantics guidance to predict all instance-level alpha mattes. In addition, we construct a large-scale human instance matting dataset (HIM-100K) comprising over 100,000 human images with instance alpha matte labels. Experiments on HIM-100K demonstrate the proposed E2E-HIM outperforms the existing methods on human instance matting with 50 % lower errors and 5 × faster speed (6 instances in a 640×640 image). Experiments on the PPM-100, RWP-636, and P3M datasets demonstrate that E2E-HIM also achieves competitive performance on traditional human matting.

Abstract:
Privacy protection has become a top priority due to the widespread collection and misuse of personal data. Anonymization and visual identity information hiding are two crucial tasks in face privacy protection, both striving to alter identifying characteristics from face images to prevent privacy information leakage. However, the goals of the two are not entirely the same. Consequently, training a model to simultaneously perform both tasks proves challenging. In this paper, we propose Diff-Privacy, a novel face privacy protection method based on diffusion models that unifies the task of anonymization and visual identity information hiding. Specifically, we present a Multi-Scale image Inversion module (MSI) that, through training, generates a set of Stable Diffusion (SD) format conditional embeddings for the original image. With these conditional embeddings, we design corresponding embedding scheduling strategies and formulate distinct energy functions during the inference process to achieve anonymization and visual identity information hiding, respectively. Extensive experiments demonstrate the effectiveness of the proposed method in protecting face privacy.

Abstract:
Existing effective cover selection methods aim to select the complex images as covers to achieve the highly security with the aid of the embedding distortion computed from a natural image. However, the calculation of the embedding distortion divulges the image content to a steganographer. To overcome this issue, this work proposes a novel cover selection scheme in encrypted images to achieve the image content-protection and cover-selection simultaneously. In the first phase, the content owner encrypts several most significant bits (MSBs) of each image using an encryption key and the encrypted image is shuffled by block. Meanwhile, with a sampling key, the content owner selects some encrypted blocks and outputs them to the steganographer. In the second phase, the steganographer calculates first-order noise residuals of adjacent pixels of the acquired blocks along different directions. Importantly, we design a texture descriptor named as structured Local binary pattern (SLBP) to encode all the residuals by which the images owing the maximal SLBP values are chosen as the optimal covers. We demonstrate the security of our proposed scheme on multiple steganographic and steganalytic methods and the extensive results show that our scheme exhibits excellent performance without knowing of the original image content. Moreover, the results testify that the designed SLBP achieves the perfect evaluation of image complexity.

Abstract:
Latent multi-view subspace clustering has been demonstrated to have desirable clustering performance. However, the original latent representation method vertically concatenates the data matrices from multiple views into a single matrix along the direction of dimensionality to recover the latent representation matrix, which may result in an incomplete information recovery. To fully recover the latent space representation, we in this paper propose an Enhanced Latent Multi-view Subspace Clustering (ELMSC) method. The ELMSC method involves constructing an augmented data matrix that enhances the representation of multi-view data. Specifically, we stack the data matrices from various views into the block-diagonal locations of the augmented matrix to exploit the complementary information. Meanwhile, the non-block-diagonal entries are composed based on the similarity between different views to capture the consistent information. In addition, we enforce a sparse regularization for the non-diagonal blocks of the augmented self-representation matrix to avoid redundant calculations of consistency information. Finally, a novel iterative algorithm based on the framework of Alternating Direction Method of Multipliers (ADMM) is developed to solve the optimization problem for ELMSC. Particularly, we theoretically analyze the convergence of ELMSC in detail. Extensive experiments on real-world datasets show that our proposed ELMSC is able to achieve higher clustering performance than some state-of-art multi-view clustering methods. Moreover, our experiments show that our method remains effective with randomly chosen parameters, demonstrating ELMSC’s practical potential.

Abstract:
Tracking features in image sequences suffers from varying illumination and viewpoints. In recent years, learning-based features have achieved higher repeatability in challenging scenes and are considered to have the potential to solve this problem. However, features with high repeatability are not always easy to track. There is a gap between repeatability and trackability. To obtain features that are easily tracked in illumination and viewpoints, a data-driven approach expands the definition of good features. Trackability is defined end-to-end as the tracking error. According to this definition, a complete feature tracking process is used to compute the tracking loss and train the network. A four-layer convolutional network is used to extract low dimensional image information and obtain features. To validate the proposed method, we compare the tracking errors of several mainstream methods on a challenging test dataset, and the proposed method shows significant advantages. Then, fundamental matrix estimation and visual odometry experiments demonstrate the feature excels in practical tasks. Finally, the features were used in a visual inertial odometry system and achieved a 43% improvement in absolute trajectory error on the challenging dataset. All code will be open source for the benefit of the community.

Abstract:
During the last decades, deep learning (DL) has been proven to be a very powerful and successful technique in many real-world applications, e.g., video surveillance or object detection. However, when class label distributions are highly skewed, DL classifiers tend to be biased towards majority classes during training phases. This leads to poor generalization of minority classes and consequently reduces the overall accuracy. How to effectively deal with this long-tailed class distribution in DL, i.e., deep long-tailed classification (DLC), remains a challenging problem despite many research efforts. Among various approaches, data augmentation, which aims at generating more samples for reducing label imbalance, is the most common and practical one. However, simply relying on existing class-agnostic augmentation strategies without properly considering the label differences would worsen the problem since more head-class samples can be inevitably augmented than tail-class ones. Moreover, none of the existing works consider the quality and suitability of augmented samples during the training process. Our proposed approach, called Long-tailed Classification via Self-Labeling (LCSL), is specifically designed to address these limitations. LCSL fundamentally differs from existing works by the way it iteratively exploits the preceding network during the training process to re-label the labeled augmented samples and uses the output confidence to decide whether new samples belong to minority classes before adding them to the data. Not only does this help to reduce imbalance ratios among classes, but this also helps to reduce the uncertainty of class prediction problems for minority classes by selecting more confident samples to the data. This incremental learning and generating scheme thus provide a new robust approach for decreasing model over-fitting, thus enhancing the overall accuracy, especially for minority classes. Extensive experiments have demonstrated that LCSL acquires better performance than state-of-the-art long-tailed learning techniques on various standard benchmark datasets. More specifically, our LCSL obtains 85.8%, 54.4%, and 56.2% in terms of accuracy on CIFAR10-LT, CIFAR100-LT, and ImageNet-LT (with moderate to extreme imbalance ratios), respectively. The source code is available at https://github.com/vdquang1991/lcsl/.

Abstract:
Video rescaling helps to fit different display devices. In video rescaling systems, videos are downsampled for easier storage, transmission and preview. The downsampled videos can be upsampled with a neural network to restore the details when needed. Previous group-based video rescaling algorithms benefit from the joint downsampling and joint upsampling of multiple frames, but are restricted by the fully joint operation. In this paper, we propose a recurrent diffusion-based framework for video rescaling. We employ biased joint operation and recurrent diffusion, to make a better use of the temporal relation within different frames in each image group. We explicitly control the direction of information propagation by arranging the processing order of all frames. In biased joint operation, we concentrate on restoring one frame, i.e., the middle frame. The other frames in the group are coarsely reconstructed. Our recurrent diffusion compensates the coarse frames by gradually propagating information from the middle to borders backwardly and forwardly. The recurrent diffusion module is performed by fusing the information of adjacent frames. Biased joint operation and recurrent diffusion are jointly trained. We design several propagation variants and find that our recurrent diffusion is the best among them. It is also shown that recurrent diffusion is better than non-recurrent diffusion in terms of reconstruction quality and model size. We also adopt a high-resolution fine-tuning strategy to further improve the quality of high-resolution frames. Experimental results demonstrate the effectiveness of the proposed method in terms of visual quality, quantitative evaluations, and computational efficiency. The code will be released at https://github.com/5ofwind/RDVR.

Abstract:
Image restoration aims to recover the high-quality images from their degraded observations. Since most existing methods have been dedicated into single degradation removal, they may not yield optimal results on other types of degradations, which do not satisfy the applications in real world scenarios. In this paper, we propose a novel data ingredient-oriented approach that leverages prompt-based learning to enable a single model to efficiently tackle multiple image degradation tasks. Specifically, we utilize a encoder to capture features and introduce prompts with degradation-specific information to guide the decoder in adaptively recovering images affected by various degradations. In order to model the local invariant properties and non-local information for high-quality image restoration, we combine CNNs operations and Transformers. Simultaneously, we make several key designs in the Transformer blocks (multi-head rearranged attention with prompts and simple-gate feed-forward network) to reduce computational requirements and selectively determines what information should be persevered to facilitate efficient recovery of potentially sharp images. Furthermore, we incorporate a feature fusion mechanism further explores the multi-scale information to improve the aggregated features. The resulting tightly interlinked hierarchy architecture, named as CAPTNet, extensive experiments demonstrate that our method performs competitively to the state-of-the-art. The code and the pre-trained models are released at https://github.com/Tombs98/CAPTNet

Abstract:
Recently, the rapid advancement of generative model has led to its exploitation by malicious actors who employ it to fabricate fake synthetic images. Meanwhile, the deceptive images are often disseminated on social network platforms, thereby undermining public trust. Although reliable forensic tools have emerged to detect generative fake images, the existing supervised detectors excessively rely on the correctly-labeled training samples, leading to overwhelming outsourcing annotation costs and the potential risk of suffering from label flipping attack. In light of the aforementioned limitations, we propose an unsupervised detector fighting against generative fake image. In particular, we assign the noisy labels to the training samples. Then dependent on the pre-clustered samples with noisy labels, the strategy of pre-training and re-training mechanism helps us train the feature extractor utilized to extract the discriminative feature. Last, the extracted feature guides us to respectively cluster both pristine and fake images; the fake images are effectively filtered by employing cosine similarity. Extensive experimental results highlight that our unsupervised detector rivals the baseline supervised methods; moreover, it has better capability of defending against label flipping attack.

Abstract:
To transfer the representation capacity of large pre-trained models to lightweight models, knowledge distillation has been widely explored. However, conventional single-stage distillation methods are prone to getting stuck in the transfer of task-specific knowledge, making it difficult to retain task-agnostic knowledge which is crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to boost lightweight models under the assistance of large models pre-trained by masked image modeling. In generic distillation, the decoder of a small model is encouraged to align feature predictions with that of a large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are encouraged to be consistent with those of the large model, to guarantee task performance. G2SD is also applicable for heterogeneous settings(i.e., distilling from ViT to CNN). With G2SD, the ViT-Small model respectively achieves 98.9%, 98.4%, 99.3% and 98.9% accuracies when compared with its teachers (ViT-Base) for image classification, object detection, semantic segmentation and video recognition tasks. The lightweight ResNet models are improved to a new height on image classification task. The code is available at github.com/pengzhiliang/G2SD.

Abstract:
Video streaming and its applications are growing rapidly, making video optimization a primary target for content providers looking to enhance their services. Enhancing the quality of videos requires the adjustment of different encoding parameters such as bitrate, resolution, and frame rate. To avoid brute force approaches for predicting optimal encoding parameters, video complexity features are typically extracted and utilized. To predict optimal encoding parameters effectively, content providers traditionally use unsupervised feature extraction methods, such as ITU-T’s Spatial Information (SI) and Temporal Information (TI) to represent the spatial and temporal complexity of video sequences. Recently, Video Complexity Analyzer (VCA) was introduced to extract DCT-based features to represent the complexity of a video sequence (or parts thereof). These unsupervised features, however, cannot accurately predict video encoding parameters. To address this issue, this paper introduces a novel supervised feature extraction method named DeepVCA, which extracts the spatial and temporal complexity of video sequences using deep neural networks. In this approach, the encoding bits required to encode each frame in intra-mode and inter-mode are used as labels for spatial and temporal complexity, respectively. Initially, we benchmark various deep neural network structures to predict spatial complexity. We then leverage the similarity of features used to predict the spatial complexity of the current frame and its previous frame to rapidly predict temporal complexity. This approach is particularly useful as the temporal complexity may depend not only on the differences between two consecutive frames but also on their spatial complexity. Our proposed approach demonstrates significant improvement over unsupervised methods, especially for temporal complexity. As an example application, we verify the effectiveness of these features in predicting the encoding bitrate and encoding time of video sequences, which are crucial tasks in video streaming. The source code and dataset is available at https://github.com/ cd-athena/DeepVCA.

Abstract:
Deep learning has greatly advanced the performance of semantic segmentation, however, its success relies on the availability of large amounts of annotated data for training. Hence, many efforts have been devoted to domain adaptive semantic segmentation that focuses on transferring semantic knowledge from a labeled source domain to an unlabeled target domain. Existing self-training methods typically require multiple rounds of training, while another popular framework based on adversarial training is known to be sensitive to hyper-parameters. We propose an easy-to-train framework that learns domain-invariant prototypes for domain adaptive semantic segmentation. In particular, we show that domain adaptation shares a common character with few-shot learning in that both aim to recognize some types of unseen data with knowledge learned from large amounts of seen data. Thus, we propose a unified framework for domain adaptation and few-shot learning. The core idea is to use the class prototypes extracted from few-shot annotated target images to classify pixels of both source images and target images. Our method involves only one-stage training and does not need to be trained on large-scale un-annotated target images. Moreover, our method can be extended to variants of both domain adaptation and few-shot learning. Competitive performances achieved on GTA5-to-Cityscapes and SYNTHIA-to-Cityscapes adaptation tasks show the effectiveness of the proposed novel while simple domain adaptation framework. The source code used in this paper is available at https://github.com/zgyang-hnu/DIP-hunnu.

Abstract:
FSCIL (Few-shot class-incremental learning) is a prominent research topic in the ML community. It faces two significant challenges: forgetting old class knowledge and overfitting to limited new class training examples. In this paper, we present a novel FSCIL approach inspired by the human brain’s analogical learning mechanism, which enables human beings to form knowledge about a target domain from the knowledge of the source domains that are analogical to the target in some aspects. The proposed analogical learning-based FSCIL (ALFSCIL) method consists of two major components: new class classifier constructor (NCCC) and Meta-Analogical training (MAT). The NCCC module utilizes a multi-head cross-attention transformer to compute analogies between new and old classes, generating new class classifiers by blending old class classifiers based on the computed analogies. The MAT module updates the parameters of the CNN feature extractor, the NCCC module, and the knowledge for each encountered class after each round of the FSCIL session. We turn the optimization process into a bi-level optimization problem (BOP) whose theoretical analysis proves the stability and plasticity of our proposed model. Experimental evaluations reveal that this proposed ALFSCIL method achieves the SOTA performance accuracies on three benchmark datasets: CIFAR100, miniImageNet, and CUB200.

Abstract:
The study in the gait field has rarely paid attention to the class-imbalanced learning, while the realistic data always exhibits an imbalanced distribution. The main reason lies in the difficulty of collecting the cross-clothes sequences, since the collection is usually aided by person re-identification and it is more likely to obtain the sequences for a subject wearing the same clothes. In this work, we formulate a new problem to tackle the task-specific cloth-imbalanced issue, dubbed as Cloth-Imbalanced Gait Recognition, and the training data consists of two parts denoted as head set and tail set. The sequences for a subject in head set cover the cross-clothes variation which is scarce in tail set to mimic the collection difficulty. Along with the problem formulation, we design a new method to deal with the inherent challenges, called Cross-Clothes Hallucination or CCH for short. Our method is inspired by the observation that certain directions in deep feature space correspond to meaningful semantic transformations, and it tries to generate the cross-clothes sequences for tail set referring to the cloth-changing transformation in head set. To evaluate CCH, we build two cloth-imbalanced benchmarks based on the widely-used CASIA-B and Outdoor-Gait. Extensive experiments demonstrate that CCH brings significant improvements over the baselines.

Abstract:
Person re-identification (Re-ID) is to match the images of the same person from different camera views, which demands a view-invariant feature embedding. Recently, intra-camera supervised (ICS) Re-ID develops the Re-ID models without cross-view annotated data. Existing ICS methods are developed based on the assumptions, such as assuming each person in the training set appears under multiple cameras. However, there is no guarantee that the assumptions are true without cross-view annotations, and their performance degrades when the assumptions are violated. In this work, we generalize the ICS Re-ID and develop an ICS Re-ID model without the assumptions. The absence of prior assumptions and cross-view annotations poses a challenge in exploiting the discriminative information among cross-view images. To this end, we propose to mine the view-invariant relations between cross-view images for Re-ID model to exploit discriminative information and overcome the cross-view variations. Specifically, we learn composited view-aware features by compositing the identity information with different camera view information in the feature composition module. Then, we exploit the composited features to model various view-aware relations between pairwise images. By mining the common patterns among the view-aware relations, we obtain the view-invariant pairwise relation for learning. Besides, leveraging the composited view-aware features, we develop a view-aware marginal constraint for robust cross-view learning. To facilitate learning the feature composition module, we augment an auxiliary network to exploit the camera view information at the feature level. Extensive experimental results show the effectiveness of our method under different scenarios.

Abstract:
Recent research in cross-domain image retrieval has focused on addressing two challenging issues: handling domain variations in the data and dealing with the lack of sufficient training labels. However, these problems have often been studied separately, limiting the practicality and significance of the research outcomes. The existing cross-domain setting is also restricted to cases where domain labels are known during training, and all samples have semantic category information or instance correspondences. In this paper, we propose a novel approach to address a more general and practical problem: fully unsupervised domain-agnostic image retrieval under the domain-unknown setting, where no annotations are provided. Our approach tackles both the domain variation and missing labels challenges simultaneously. We introduce a new fully unsupervised One-Shot Synthesis-based Contrastive learning method (termed OSSCo) to project images from different data distributions into a shared feature space for similarity measurement. To handle the domain-unknown setting, we propose One-Shot unpaired image-to-image Translation (OST) between a randomly selected one-shot image and the rest of the training images. By minimizing the global distance between the original images and the generated images from OST, the model learns domain-agnostic representations. To address the label-unknown setting, we employ contrastive learning with a synthesis-based transform module from the OST training. This allows for effective representation learning without any annotations or external constraints. We evaluate our proposed method on diverse datasets, and the results demonstrate its effectiveness. Notably, our approach achieves comparable performance to current state-of-the-art supervised methods.

Abstract:
Open World Object Detection (OWOD), simulating the real dynamic world where knowledge grows continuously, attempts to detect both known and unknown classes and incrementally learn the identified unknown ones. Recently a few studies have introduced and explored the OWOD problem, however, the main challenges in the OWOD task that distinguishing unknown classes from the background (Unknown Objectness) or known classes (Unknown Discrimination) have not been well solved, and there is lacking systematic analysis of benchmark and metrics for evaluating the OWOD task. In this paper, we revisit the OWOD problem and rethink it from benchmark, metrics, and algorithm perspectives. First, we propose five fundamental benchmark principles in line with the OWOD definition and construct two OWOD benchmarks according to the principles for a fair evaluation. Second, we point out that existing metrics neglect the detection performance of unknown classes and further design two additional metrics specific to the OWOD problem, filling the void of evaluating from the perspective of unknown classes. Finally, we introduce a novel and effective OWOD framework with an auxiliary Proposal ADvisor (PAD) and a Class-specific Expelling Classifier (CEC). The non-parametric PAD improves Unknown Objectness by assisting RPN in identifying more accurate unknown proposals based on the class-agnostic property of the object and aggregation through spatial and appearance similarity, while CEC enhances the Unknown Discrimination by calibrating the over-confident activation boundary and suppressing confusing predictions through a class-specific expelling function. Comprehensive experiments conducted on both fair benchmarks based on our OWOD benchmark principles and the original benchmark demonstrate that our method outperforms other state-of-the-art object detection methods in terms of both existing and our new metrics.

Abstract:
Rain streaks bring complicated pixel intensity changes and additional gradients, greatly obstructing the extraction of image features from background. This causes serious performance degradation in feature-based applications. Thus, it is critical to remove rain streaks from a single rainy image to recover image features. Recently, many excellent image deraining methods have made remarkable progress. However, these human visual system-driven approaches mainly focus on improving image quality with pixel recovery as loss function, and neglect how to enhance image feature recovery ability. To address this issue, we propose a task-driven image deraining algorithm to strengthen image feature supply for subsequent feature-based applications. Due to the extensive use and strong practicability of Scale-Invariant Feature Transform (SIFT), we first propose two separate networks using distinct losses and modules to achieve two goals, respectively. One is difference of Gaussian (DoG) pyramid recovery network (DPRNet) for SIFT detection, and the other gradients of Gaussian images recovery network (GGIRNet) for SIFT description. Second, in the DPRNet we propose an alternative interest point loss that directly penalizes scale response extrema to recover the DoG pyramid. Third, we advance a gradient attention module in the GGIRNet to recover those gradients of Gaussian images. Finally, with the recovered DoG pyramid and gradients, we can regain SIFT key points. This divide-and-conquer scheme to set different objectives for SIFT detection and description leads to good robustness. Compared with state-of-the-art methods, experimental results demonstrate that our proposed algorithm achieves better performance in both the number of recovered SIFT key points and their accuracy.

Abstract:
Class incremental learning (CIL) has drawn wide attention in academic researches. However, most existing methods cannot be applied to some practical scenarios in which unknown classes occur during the inference stage. To solve this problem, we target a more challenging and realistic setting: Incremental Open Set Learning (IOSL), which needs to reject unknown classes from test data while incrementally learning new classes. IOSL has two coupled key challenges: 1) overcoming the catastrophic forgetting of old classes when learning new classes incrementally due to the rarity of old training samples; and 2) minimizing the empirical classification risk on known classes and the open space risk on unknown classes. To address these challenges, we propose an incremental open-set learning method with a “future-look” ability. This ability reserves embedding space for incrementally arriving new classes and potential unknown classes simultaneously to alleviate the catastrophic forgetting indirectly and recognize unknown classes well. Specifically, a normalized prototype learning strategy is designed to minimize the empirical classification risk and implicitly reserve some space. Moreover, we design an extra classes synthesizing module to explicitly reserve more suitable space. This further minimizes the empirical classification risk while reducing the open space risk. Furthermore, we develop an adaptive metric learning loss to mitigate the class imbalance between old and new classes, which focuses on exploiting exemplars fully and selects an adaptive margin for pairs of old and new classes. Extensive experiments on representative classification datasets validate the superiority of our method.

Abstract:
Masked language modeling (MLM) has become one of the most successful self-supervised pre-training task. Inspired by its success, Point-BERT, as a pioneer work in point cloud, proposed masked point modeling (MPM) to pre-train point transformer on large scale unanotated dataset. Despite its great performance, we find the inherent difference between language and point cloud tends to cause ambiguous tokenization for point cloud, and no gold standard is available for point cloud tokenization. Point-BERT uses a discrete Variational AutoEncoder (dVAE) as tokenizer, but it might generate different token ids for semantically-similar patches and the same token ids for semantically-dissimilar patches. To tackle the above problems, we propose our McP-BERT, a pre-training framework with multi-choice tokens. Specifically, we ease the previous single-choice constraint on patch token ids in Point-BERT, and provide multi-choice token ids for each patch as supervision. Moreover, we utilitze the high-level semantics learned by transformer to further refine our supervision signals. Extensive experiments on point cloud classification, few-shot classification and part segmentation tasks demonstrate the superiority of our method, e.g., the pre-trained transformer achieves 94.1% accuracy on ModelNet40, 84.28% accuracy on the hardest setting of ScanObjectNN and new state-of-the-art performance on few-shot learning. Our method improves the performance of Point-BERT on all downstream tasks without extra computational overhead.

Abstract:
Transformer has achieved impressive progress in visual tracking due to their capability of global modeling, which enables them to learn low-frequency features(i.e., high-level semantic information). However, it seems to overlook the high-frequency features(i.e., low-level texture and edge information) which are crucial to identify different intra-class object instances in the tracking task. To address this issue, we propose a transformer based tracker via frequency fusion perspective that investigated whether high-frequency and low-frequency features can be effectively combined to achieve robust tracking. Specifically, we design a simple yet effective two-stage fusion strategy and use an appropriate frequency fusion strategy in tracking process of each stage so as to make full use of frequency domain information. In the feature extraction stage, we use wavelet decomposition of high-frequency subbands to solve the performance loss caused by the transformer’s catastrophic forgetting of high-frequency information. In the prediction head stage, we use a variety of wavelet decomposition subbands to model the multi-frequency information. The two-stage fusion strategy makes our model extract more balanced and beneficial multi-frequency information, enabling it to effectively capture target texture information and local edge information while also being sensitive to global information. Extensive experiments on six challenging benchmarks (i.e., LaSOT _ext , UAV123, TNL2K, LaSOT, TrackingNet, and GOT-10k) demonstrates the superior performance of our tracker.

Abstract:
Cross-domain person re-identification is challenging due to the notorious domain shift problem. Most of the existing unsupervised cross-domain person ReID methods require a large number of unlabeled target-domain samples for adaptation. However, large scale of training data are not always available due to public privacy. Domain generalization methods have inferior adaptation ability without seeing any target domain data. Inspired by the few-shot learning capability of human vision system, we propose a novel setting, one-shot unsupervised cross-domain for person ReID and study the ability of adaptation using the minimum number of image in the target domain during training. Specifically, we first propose a novel Group Normalization (GN) based domain generalizable ReID model. We show that the GN based model could strike a better balance between model discrimination and generalization ability, compared with the Batch Normalization (BN) and Instance Normalization (IN) counterparts, and is more suitable for domain generalizable ReID baseline model. Then besides the supervised feature learning task in the source domain, we introduce two self-supervised learning tasks using the one-shot target domain data to further improve the generalization ability of the ReID model. We carefully design model architecture and perform model training to reduce overfitting to the one-shot target domain. Extensive experiments demonstrate the effectiveness of our approach for one-shot unsupervised cross-domain ReID. Our approach can be extended to few-shot setting and increasing the number of shot up to 1,000 images can steadily increase the performance, which provides practical values to the community.

Abstract:
The visible and infrared image fusion (VIF) method aims to utilize the complementary information between these two modalities to synthesize a new image containing richer information. Although it has been extensively studied, the synthesized image that has the best visual results is difficult to reach consensus since users have different opinions. To address this problem, we propose an adjustable VIF framework termed AdjFusion, which introduces a global controlling coefficient into VIF to enforce it can interact with users. Within AdjFusion, a semantic-aware modulation module is proposed to transform the global controlling coefficient into a semantic-aware controlling coefficient, which provides pixel-wise guidance for AdjFusion considering both interactivity and semantic information within visible and infrared images. In addition, the introduced global controlling coefficient not only can be utilized as an external interface for interaction with users but also can be easily customized by the downstream tasks (e.g., VIF-based detection and segmentation), which can help to select the best fusion result for the downstream tasks. Taking advantage of this, we further propose a lightweight adaptation module for AdjFusion to learn the global controlling coefficient to be suitable for the downstream tasks better. Experimental results demonstrate the proposed AdjFusion can 1) provide ways to dynamically synthesize images to meet the diverse demands of users; and 2) outperform the previous state-of-the-art methods on both VIF-based detection and segmentation tasks, with the constructed lightweight adaptation method. Our code will be released after accepted at https://github.com/BearTo2/AdjFusion.

Abstract:
In weakly supervised video anomaly detection (WVAD), where only video-level labels indicating the presence or absence of abnormal events are available, the primary challenge arises from the inherent ambiguity in temporal annotations of abnormal occurrences. Inspired by the statistical insight that temporal features of abnormal events often exhibit outlier characteristics, we propose a novel method, BN-WVAD, which incorporates BatchNorm into WVAD. In the proposed BN-WVAD, we leverage the Divergence of Feature from the Mean vector (DFM) of BatchNorm as a reliable abnormality criterion to discern potential abnormal snippets in abnormal videos. The proposed DFM criterion is also discriminative for anomaly recognition and more resilient to label noise, serving as the additional anomaly score to amend the prediction of the anomaly classifier that is susceptible to noisy labels. Moreover, a batch-level selection strategy is devised to filter more abnormal snippets in videos where more abnormal events occur. The proposed BN-WVAD model demonstrates state-of-the-art performance on UCF-Crime with an AUC of 87.24%, and XD-Violence, where AP reaches up to 84.93%. Our code implementation is accessible at https://github.com/cool-xuan/BN-WVAD.

Abstract:
With limited labeled samples, few-shot classification poses a challenge to standard deep models and has attracted a surge of concern. Metric learning based approaches stand out for the minimalist and efficient design, aiming to classify query samples using supervision of support sets in the metric space. Prototypical Network has pioneered the use of mean feature embeddings to represent each support class and leaned on the computed prototypes for query classification. However, inherent bias exists in the mean prototypes generating from scarce support samples versus the actual class prototypes, which induces subsequent inference deviation. In this paper, we propose to diminish the bias by leveraging the semantic information of query samples to guide prototype optimization. Specifically, we exploit the semantic correlation between the local of initial mean prototypes and the global of query samples to generate query-guided masks, thus tailoring optimized prototypes that vary by query samples. This exploration of correlation is first utilized to alleviate the prototype bias problem and shows great brevity compared to existing methods. Extensive experiments are conducted on three few-shot image classification benchmark datasets, and demonstrate the effectiveness of our proposed method.

Abstract:
Despite deep neural networks (DNNs) show impressive performance across diverse tasks, they suffer from catastrophic forgetting when dealing with continuous data streams. Incremental learning aims to alleviate this phenomenon and enable DNNs to accumulate new knowledge to cope with the ever-changing world. Recently numerous advanced methods have been developed to enhance the incremental learning capabilities of neural networks. However, these methods mainly focus on the large networks, neglecting the unique needs of edged-device applications, which is surprisingly under-investigated in previous literature. In this paper, we propose two strategies for transferring knowledge from large teacher networks to light-weighted networks in class incremental learning. Specifically, in cases where the initial task contains a large number of categories, our static teacher strategy involves transferring knowledge from the teacher to the student network on the initial task to enhance the plasticity of the student network, and applying regularization constraints on the subsequent task to improve its stability. In a more challenging scenario where each task includes an equal number of categories, the dynamic teacher strategy continuously guides the student network on each task. We evaluate the proposed methods on CIFAR100, Tiny-ImageNet and ImageNet-subset datasets with different types of light-weighted networks (MobileNet, ShuffleNet). We observed that effective knowledge transfer resulting in the student network achieving performance comparable or even outperform the teacher network. Extensive and detailed experiments conducted on three datasets demonstrated the simplicity and effectiveness of our proposed method. Comprehensive analysis are also conducted including different factors and visualization.

Abstract:
Transformer-based tracking methods have been widely studied in the field of visual object tracking. The long-range information capturing ability of the transformer improves the performance of the tracking network. However, the self-attention learning procedure in the transformer module neglects the local information, the target and the background around it, which can be beneficial for trackers to handle background clutter and deformation. In this paper, the local-global self-attention (LGSA) learning is proposed for the object tracking task, which obtains the local and global information simultaneously in one attention learning block. Based on the LGSA, the encoder and the decoder are designed to fuse the features corresponding to the template and search images. Additionally, two tracking networks, LGSAT-T and LGSAT-B instantiated with the proposed encoder and decoder are introduced. Exclusive experiments on the commonly used datasets, including OTB100, GOT-10K, LaSOT, and TrackingNet, demonstrate the effectiveness of LGSA, and indicate the state-of-the-art performance of the proposed tracking network. The code will be released at https://github.com/lgao001/LGSAT.

Abstract:
Image restoration is the process of recovering a clean image from a degraded observation. In order to achieve this, it is essential to refine features at multiple scales. This paper develops an effective omni-kernel modulation module to enhance multi-scale representation learning for image restoration. The module consists of three branches, namely global, large, and local branches, which are designed to learn global-to-local feature representations efficiently. Specifically, the global branch achieves a global perceptive field via the dual-domain channel attention and frequency-gated mechanism. Furthermore, to provide multi-grained receptive fields, the large branch is formulated using different shapes of depth-wise convolutions with unusually large kernel sizes. Moreover, we complement local information with a point-wise depth-wise convolution. Finally, we demonstrate the effectiveness of our omni-kernel modulation module in two cases: general image restoration and all-in-one image restoration tasks. Incorporating our method into a convolutional backbone results in a model that achieves state-of-the-art performance on the 15 datasets for three representative image restoration tasks, including image dehazing, desnowing, and defocus deblurring. Moreover, by integrating our module into a pure Transformer-based backbone, the model demonstrates competitive performance against state-of-the-art algorithms in two all-in-one image restoration settings: the three-task and five-task settings.

Abstract:
The vulnerability of deep neural networks to adversarial perturbations has been widely perceived in the computer vision community. From a security perspective, it poses a critical risk for modern vision systems, e.g., the popular Deep Learning as a Service (DLaaS) frameworks. For protecting deep models while not modifying them, current algorithms typically detect adversarial patterns through discriminative decomposition for natural and adversarial data. However, these decompositions are either biased towards frequency resolution or spatial resolution, thus failing to capture adversarial patterns comprehensively. Also, when the detector relies on few fixed features, it is practical for an adversary to fool the model while evading the detector (i.e., defense-aware attack). Motivated by such facts, we propose a discriminative detector relying on a spatial-frequency Krawtchouk decomposition. It expands the above works from two aspects: 1) the introduced Krawtchouk basis provides better spatial-frequency discriminability, capturing the differences between natural and adversarial data comprehensively in both spatial and frequency distributions, w.r.t. the common trigonometric or wavelet basis; 2) the extensive features formed by the Krawtchouk decomposition allows for adaptive feature selection and secrecy mechanism, significantly increasing the difficulty of the defense-aware attack, w.r.t. the detector with few fixed features. Theoretical and numerical analyses demonstrate the uniqueness and usefulness of our detector, exhibiting competitive scores on several deep models and image sets against a variety of adversarial attacks.

Abstract:
Talking head generation, aiming to create photo-realistic videos from a single reference image and audio input, has emerged as a vibrant area of interest within the computer vision community. Despite notable advancements, several challenges remain unaddressed. For instance, many existing approaches overlook the nuanced relationship between audio semantics and head movement, such as nodding in agreement during affirmative expressions. Additionally, the visual quality of generated content, particularly in depicting teeth, often falls short of achieving authentic realism. To address these limitations, we introduce a groundbreaking audio-semantic enhanced pose-driven talking head generation method. Our approach encompasses a multimodal 3DMM parameter prediction network alongside a high-fidelity video synthesis network, meticulously designed to produce authentic and high-quality talking head videos. The multimodal 3DMM parameter prediction network harnesses both acoustic and audio-deduced semantic information, facilitating accurate head pose predictions that resonate with the semantics of spoken words. Furthermore, to significantly improve the depiction of the mouth area, especially the teeth, our video synthesis stage incorporates a mouth-enhanced network augmented by both local and global discriminators. Comprehensive evaluations across diverse metrics affirm the superiority of our method.

Abstract:
Due to the prevalence of influenza outbreaks and outdoor scenarios with various obstructing decorations, recognizing faces with occlusions has become a pressing challenge to address. However, current research mainly focuses on facial recognition with one kind of occlusion and does not provide compatible solutions for different kinds of common occlusions like glasses, sunglasses, and masks. Therefore, an Adaptive Multi-Type Occluded Face Recognition Model (AMOFR) is proposed to effectively handle multiple occlusion types simultaneously in this paper. In AMOFR, a generator is developed to produce diverse occluded face images for training, achieved by simulating various occlusion types on unoccluded face images. Subsequently, an occlusion type-based adapter is formulated to address a range of occlusion scenarios, guided by prompts from a Visual-Language model. To enhance overall performance by leveraging complete facial information, a feature-level knowledge distillation loss function is implemented, facilitating joint learning of unoccluded-face and occluded-face features. Furthermore, a new sunglasses-wearing dataset (CALFW-SUNGLASSES) is generated for more comprehensive test for AMOFR and further occlusion recognition research. Experimental results on datasets containing different types of occlusions have demonstrated that AMOFR achieves significantly higher accuracy compared to other advanced face recognition models. The implementation codes of AMOFR is available at https://github.com/LIU-YUXI/Adaptive-Multi-occlusion-Face-Recognition.

Abstract:
Pre-trained Vision-Language Models (VLMs), like CLIP, exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics. This limitation hinders the model’s ability to perceive fine-grained visual details and restricts its generalization ability to a broader range of unseen classes. To address this issue, we propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment. The proposed MAP enjoys several merits. First, we introduce learnable visual attribute prompts enhanced by textual attribute semantics to adaptively capture visual attributes for images from unknown categories, boosting fine-grained visual perception capabilities for CLIP. Second, the proposed attribute-level alignment complements the global alignment to enhance the robustness of cross-modal alignment for open-vocabulary objects. To our knowledge, this is the first work to establish cross-modal attribute-level alignment for CLIP-based few-shot adaptation. Extensive experimental results on 11 datasets demonstrate that our method performs favorably against state-of-the-art approaches.

Abstract:
Domain generalization (DG) endeavours to develop robust models that possess strong generalizability while preserving excellent discriminability. Nonetheless, pivotal DG techniques tend to improve the feature generalizability by learning domain-invariant representations, inadvertently overlooking the feature discriminability. On the one hand, the simultaneous attainment of generalizability and discriminability of features presents a complex challenge, often entailing inherent contradictions. This challenge becomes particularly pronounced when domain-invariant features manifest reduced discriminability owing to the inclusion of unstable factors, i.e., spurious correlations. On the other hand, prevailing domain-invariant methods can be categorized as category-level alignment, susceptible to discarding indispensable features possessing substantial generalizability and narrowing intra-class variations. To surmount these obstacles, we rethink DG from a new perspective that concurrently imbues features with formidable discriminability and robust generalizability, and present a novel framework, namely, Discriminative Microscopic Distribution Alignment (DMDA). DMDA incorporates two core components: Selective Channel Pruning (SCP) and Micro-level Distribution Alignment (MDA). Concretely, SCP attempts to curtail redundancy within neural networks, prioritizing stable attributes conducive to accurate classification. This approach alleviates the adverse effect of spurious domain-invariance and amplifies the feature discriminability. Besides, MDA accentuates micro-level alignment within each class, going beyond mere category-level alignment. This strategy accommodates sufficient generalizable features and facilitates within-class variations. Extensive experiments on four benchmark datasets corroborate that DMDA achieves comparable results to state-of-the-art methods in DG, underscoring the efficacy of our method. The source code will be available at https://github.com/longshaocong/DMDA.

Abstract:
Due to the limitations of capture devices and scenarios, egocentric videos frequently have low visual quality, mainly caused by high compression and severe motion blur. With the increasing application of egocentric videos, there is an urgent need to enhance the quality of these videos through super-resolution. However, existing Video Super-Resolution (VSR) works, focusing on third-person view videos, are actually unsuitable for handling blurring artifacts caused by rapid ego-motion and object motion in egocentric videos. To this end, we propose EgoVSR, a VSR framework specifically designed for egocentric videos. We explicitly tackle motion blurs in egocentric videos using a Dual Branch Deblur Network (DB2Net) in the VSR framework. Meanwhile, a blurring mask is introduced to guide the DB2Net learning, and can be used to localize blurred areas in video frames. We also design a MaskNet to predict the mask, as well as a mask loss to optimize the mask estimation. Additionally, an online motion blur synthesis model for common VSR training data is proposed to simulate motion blurs as in egocentric videos. In order to validate the effectiveness of our proposed method, we introduce an EgoVSR dataset containing a large amount of fast-motion egocentric video sequences. Extensive experiments demonstrate that our EgoVSR model can efficiently super-resolve low-quality egocentric videos and outperform strong comparison baselines. Our code, pre-trained models and data can be found at https://github.com/chiyich/EGOVSR/.

Abstract:
Existing deep learning-based image compression methods overlook the unique properties of screen content images (SCIs), like limited color values and abundant repetitive patterns, leading to limited compression performance on SCIs. Therefore, a specialized framework, deep screen content image compression (DSCIC) is proposed, which contains a color context generator (CCG) and a region-based block aggregation (RBA) module. The CCG is designed to generate compression-friendly color contexts based on main color components, embedded in the encoder-decoder to remove color representation redundancy. Furthermore, to effectively reduce repetitive block redundancy in SCIs, the RBA captures repetitive patterns and enables adaptive aggregation in the latent space. It leverages region-based block matching and block content-aware aggregation to utilize repetitive features for further improving compression performance. Extensive experimental results demonstrate that the proposed DSCIC outperforms the most advanced traditional codec VVC-SCC, and is significantly superior to other learning-based image compression methods. Using VVC as the anchor, DSCIC exhibits further BD-Rate savings of 12.185% and 4.889% compared to VVC-SCC and the SOTA deep learning-based method, respectively.

Abstract:
Visual grounding is a task that seeks to predict the specific location of an object or region described by a linguistic expression within an image. Despite the recent success, existing methods still suffer from two problems. First, most methods use independently pre-trained unimodal feature encoders for extracting expressive feature embeddings, thus resulting in a significant semantic gap between unimodal embeddings and limiting the effective interaction of visual-linguistic contexts. Second, existing attention-based approaches equipped with the global receptive field have a tendency to neglect the local information present in the images. This limitation restricts the semantic understanding required to distinguish between referred objects and the background, consequently leading to inadequate localization performance. Inspired by the recent advance in knowledge distillation, in this paper, we propose a DUal knowlEdge disTillation (DUET) method for visual grounding models to bridge the cross-modal semantic gap and improve localization performance simultaneously. Specifically, we utilize the CLIP model as the teacher model to transfer the semantic knowledge to a student model, in which the vision and language modalities are linked into a unified embedding space. Besides, we design a self-distillation method for the student model to acquire localization knowledge by performing the region-level contrastive learning to make the predicted region close to the positive samples. To this end, this work further proposes a Semantics-Location Aware sampling mechanism to generate high-quality self-distillation samples. Extensive experiments on five datasets and ablation studies demonstrate the state-of-the-art performance of DUET and its orthogonality with different student models, thereby making DUET adaptable to a wide range of visual grounding architectures. Code are available on DUET.

Abstract:
Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature fusion stages. However, this paradigm may degrade the tracking performance due to inaccurate language descriptions and lacks further modal interaction. To address these issues, we propose a one-stream vision-language model called One-stream Stepwise Decreasing for Vision-Language Tracking (OSDT). Specifically, we first encode the language description using a language encoder. The obtained language features are then combined with visual images and entered jointly into a visual encoder, in which the encoder’s self-attention mechanism is utilized to facilitate more interactions between language and visual features. Moreover, to mitigate the problems caused by inaccurate language descriptions, we design a stepwise decreasing multi-modal interaction framework, in which a Feature Filter Module (FFM) is introduced to select language features that are more relevant to visual information to provide semantic guidance for visual feature extraction. Furthermore, without additional feature fusion modules, our one-stream model framework can efficiently utilize the proposed feature filtering module for feature selection. Consequently, our tracker can achieve fast tracking speed in the vision-language tracking domain compared to existing state-of-the-art methods. We extensively evaluate our tracker on three benchmarks, i.e. TNL2K, LaSOT, and OTB99, demonstrating competing performance compared to state-of-the-art vision-language tracking methods.

Abstract:
Recently, RGB-T tracking methods have made significant progress, demonstrating remarkable capabilities in addressing the complexities of tracking tasks within demanding environments. However, these methods overlook instability of modal validity in real-world scenarios. This limits the model’s ability to understand the correlation between modalities, thereby hindering the model’s ability to fully leverage the synergistic effects of RGB and TIR. To address this challenge, we propose a novel RGB-T tracking model named MCTrack, from the perspective of leveraging correlation among modalities. First, during the feature extraction stage, we design a novel module based on channel matching modeling to construct bidirectional channel context information flow for two modalities. By leveraging information flow, specific modalities correlation information can be transmitted to two modes, augmenting the correlation between the two modes adaptively. Subsequently, after the feature extraction network, the features of each modality are decoded and transformed to generate more correlated feature representations. During this stage, we extract distinctive and collective features by leveraging the correlation among modalities. Then fusing these features and generated search region features specifically for localization. This aids the model in comprehending the correlation between RGB and TIR under complex scenarios, thereby enhancing its ability to capture and utilize key features. Based on extensive experiments conducted on four popular RGB-T tracking benchmarks, our model demonstrates superior performance, particularly showcasing impressive results on the LasHeR dataset with an achieved Precision of 71.6%.

Abstract:
Homography estimation is a common image alignment method. Unsupervised learning, which uses unlabeled training and exhibits excellent performance, has attracted much attention in this field. When there are multiple planes in the scene, using features over the entire image for matching will lead to compromised results. However, existing methods for learning focused principal plane masks through deep neural networks lack explicit guidance. In this paper, we propose a novel unsupervised method to explicitly model anomaly descriptor removal and mask generation. Specifically, reliable feature descriptors are selected from a novel perspective, and regard the features that are not responsible for alignment as outliers. The pixel-level support vector data description (PL-SVDD) module is designed. This module learns the feature representation of image pixels and fits a hypersphere to exclude the feature redundancy information that is not responsible for alignment from the hypersphere, thereby optimizing the feature descriptor. Based on the optimized image features, a correlation learning (CL) module is designed. This module displays a generated mask through mathematical modeling to select reliable areas for homography estimation. Specifically, the feature descriptor of one unaligned images is modeled as a multivariate Gaussian distribution by Gaussian density estimation (GDE). Then, The Mahalanobis distance is combined with the multivariate Gaussian distribution of the model and the feature descriptor of another image to generate the mask. Experiments show that our method achieves good performance compared with previous methods.

Abstract:
Compared with traditional knowledge distillation, self-distillation does not require a pre-trained teacher network, which is more concise. Among them, data augmentation-based methods provide an elegant solution without modifying the network structure or additional memory consumption. However, when employing data augmentation in the input space, the forward propagations for augmented data bring additional computation costs and the augmentation methods need to be adaptive to the modality of input data. Meanwhile, we note that from a generalization perspective, under the condition of being able to distinguish from other classes, a dispersed intra-class feature distribution is superior to compact intra-class feature distribution, especially for categories with larger sample differences. Based on the above considerations, this paper proposes a feature augmentation-based self-distillation method (FASD) based on the idea of feature extrapolation. For each source feature, two augmentations are generated by subtraction between features. The one is subtracting the temporary class center computed with samples belonging to the same category, and another one is subtracting a sample feature belonging to other categories with the closest distance. Then, the predicted outputs of the augmented features are constrained to be consistent with that of the source feature. The consistent constraint on the previous augmented feature expands the learned class feature distribution, leading to greater overlap with the unknown feature distribution of test samples, thereby improving the generalization performance of the network. The consistent constraint on the latter augmented feature increases the distance between samples from different categories, which enhances the distinguishability between categories. Experimental results on image classification task demonstrate the effectiveness and efficiency of the proposed method. Meanwhile, experiments on text and audio tasks prove the universality of the method for classification tasks with different modalities.

Abstract:
In recent years, Learned Image Compression (LIC) has undergone rapid evolution. However, it is worthy noting that most prevalent LIC methodologies still rely on uniform Scalar Quantization (SQ) for latent features. This overlooks the untapped potential of contextual information, which could be leveraged to significantly reduce statistical redundancies. Prior researches have explored Vector Quantization (VQ)’s adaptability to diverse data distributions, yet it introduces significant computational complexity into LIC, hindering its practical implementation. Consequently, in this work, we propose the Contextual Sequential Quantization (CSQ) method, which progressively discretizes the latent features of LIC by harnessing content contextual information and image textural priors. Our proposed CSQ signifies progress in LIC by blending the computational efficiency of SQ with a substantial approach towards the adaptability of VQ. We further propose the Center Compensation Module (CCM) based on the proposed CSQ. This module strategically determines adaptive quantization centers, leading to a direct enhancement of reconstruction quality without compromising the bit-rate. Moreover, it is worth noticing that existing LIC approaches face challenges in leveraging hyper side information to effectively enhance transformations, which is attributed to the entanglement of the hyperprior generation module with the main transformations. Consequently, we propose to decouple the hyperprior module from main transformations, and design the Hyperprior-Assisted Transformation (HAT) unit to feed hyperprior back into main transformations. This further improves the coding performance. By integrating all together the proposed CSQ, CCM, and HAT, our proposed Non-uniform quantization-based LIC (NLIC) method attains state-of-the-art rate-distortion (R-D) performance among existing LIC methodologies.

Abstract:
Image-text matching aims to bridge vision and language areas, which is a crucial task in multi-modal intelligence. The core idea is to learn features of each modality and aggregate learned features as holistic representations to measure image-text relevance. Most existing methods involve cross-modal interaction during feature learning by modeling fine-grained relationships between two modalities for better results. However, these methods may obtain wrong attention scores when directly computing similarities between regions and words. Besides, current methods mainly rely on simple pooling operations for feature aggregation, which introduces interference from redundant information, resulting in inaccurate matching results. To alleviate these issues, we propose a novel reference-aware adaptive network for image-text matching by jointly using a reference attention module for feature learning and an adaptive aggregation module for feature aggregation. The proposed model enjoys several merits. First, the designed reference attention module effectively reduces wrong attention scores by introducing a set of references during cross-modal interaction. Second, the proposed adaptive aggregation module highlights useful information adaptively while suppressing redundant information during aggregation. Extensive experiments on two standard benchmarks demonstrate that our method performs favorably against state-of-the-art methods.

Abstract:
Text-based video retrieval is a crucial technology for video and multimodal applications. Although in traditional Text-Video Retrieval caption-video pairs are supposed to be entirely relevant, there is still information missing in text when compared to the video content. In a specific application scenario of Text-Video Retrieval, where the given caption corresponds to only a segment of the target video, the challenge of aligning two modalities becomes particularly difficult. To address this issue, we introduce context information as an auxiliary to enrich text representation and enhance alignment. In this work, we propose an effective Linguistic Hallucination framework, which incorporates context captions during training and replaces them with hallucinated textual representations predicted from the source sentence at inference. Specific hallucination loss and consistency loss are designed to supervise the learning process. Besides, Curriculum Learning is introduced at both data-level and model-level, which makes the training procedure more stable and improves the retrieval performance simultaneously. Extensive comparison experiments and ablation studies on benchmark datasets demonstrate the effectiveness of our framework. Moreover, we also apply our proposed method to other cross-modal tasks and the promising experimental results prove its generalization ability. Our codes and datasets are available in https://github.com/silenceFS/Linguistic-Hallucination.

Abstract:
Spatio-temporal action detection is a fundamental task that detects persons and recognizes their actions from videos. It requires reasoning about the spatial-temporal interactions between persons and their surroundings. Recently, more modalities have been found by researchers, which puts higher demands on the reasoning capability of the method, yet a method capable of holistic reasoning is still lacking. To this end, we propose a heterogeneous graph network, which aims to reason the spatial-temporal interactions among different types of nodes (video entities) and edges (inter-entity relations). Concretely, it includes spatial and temporal graphs, which are alternately updated. The spatial graph contains nodes of person appearance, person pose, object appearance, and hand interaction, and the temporal graph has person nodes at different moments. For information aggregation, we propose a person-centric heterogeneous graph reasoning algorithm, which introduces heterogeneity into the graphs through node-type-specific projections and modulated edge-type-specific representations. We find that the introduction of heterogeneity enriches the model’s ability to understand multi-modality, which facilitates better parsing of complex semantic relations in videos and potentially leads to further mining of spatial-temporal interactions between entities in the future. Experimental results on four public datasets demonstrate the superiority of our method. Code is available at https://github.com/actiondetection.

Abstract:
This paper addresses two vital challenges in Unsupervised Domain Adaptation (UDA) with a focus on harnessing the power of Vision-Language Pre-training (VLP) models. Firstly, UDA has primarily relied on ImageNet pre-trained models. However, the potential of VLP models in UDA remains largely unexplored. The rich representation of VLP models holds significant promise for enhancing UDA tasks. To address this, we propose a novel method called Cross-Modal Knowledge Distillation (CMKD), leveraging VLP models as teacher models to guide the learning process in the target domain, resulting in state-of-the-art performance. Secondly, current UDA paradigms involve training separate models for each task, leading to significant storage overhead and impractical model deployment as the number of transfer tasks grows. To overcome this challenge, we introduce Residual Sparse Training (RST) exploiting the benefits conferred by VLP’s extensive pre-training, a technique that requires minimal adjustment (approximately 0.1%~0.5%) of VLP model parameters to achieve performance comparable to fine-tuning. Combining CMKD and RST, we present a comprehensive solution that effectively leverages VLP models for UDA tasks while reducing storage overhead for model deployment. Furthermore, CMKD can serve as a baseline in conjunction with other methods like FixMatch, enhancing the performance of UDA. Our proposed method outperforms existing techniques on standard benchmarks. Our code will be available at: https://github.com/Wenlve-Zhou/VLP-UDA.

Abstract:
Single image dehazing has been actively studied to overcome the quality degradation of hazy images. Most of the existing methods take model-based approaches and the existing learning-based methods usually target specific haze styles only, e.g., daytime, varicolored, and nighttime haze. Therefore, they suffer from the limited performance on arbitrary hazy images with diverse characteristics due to the lack of universal training dataset. In this paper, we first propose a fully data-driven learning-based framework for universal dehazing based on the haze style transfer (HST). We define multiple domains of haze styles by applying the K -means clustering to the background light of diverse real hazy images. We design the haze style modulator to extract the scene radiance features and the haze-related features, respectively. We employ the unpaired image-to-image translation methodology to transfer a source hazy image into different hazy images with diverse styles while preserving the scene radiance. The generated diverse hazy images are used to train the universal dehazing network in a semi-supervised manner, where we implement the dehazing as a special instance of HST into no haze style. The experimental results show that the proposed framework reliably generates realistic and diverse hazy images, and achieves better performance of universal dehazing regardless of the haze styles compared with the existing state-of-the art dehazing methods.

Abstract:
Lane detection is a fundamental task in autonomous driving, which lies in the real-time detection of lanes of streaming video during driving. We address the lack of temporal flow understanding of existing video lane detectors, propose a streaming video lane detection training framework, and focus on building a series of inter-frame temporal information conduction structures. Specifically, we propose the Deformable Spatio-Temporal Attention (DSTA) module, which accurately captures the instantaneous changing features and position shifts between frames and incorporates key information under different spatio-temporal conditions. Also, to maintain long-time memory at a very low computational cost, we design instance caches that suggest possible lanes for the current frame and resist short-time lane disappearance based on historical memory. We experimented with the inclusion of background category prediction, which is able to simply filter low-confidence false predictions of lanes, while also conveying a more holistic and uniform relationship between lanes and background to the model. These methods allow our model to achieve a significant lead in the video lane detection dataset VIL-100, reaching an accuracy of 94.9 at a speed of 39 FPS.

Abstract:
Video question generation task aims to generate meaningful questions about a video targeting an answer. Existing methods merely focus on the static appearance features in the image frames or simply identify a motion in the video to ask general questions. However, a video contains dynamically changing visual content that deserves to be questioned, e.g., changes in object motions, object states and relationships among objects, which is more practical and closer to the dynamic world we live in. In this paper, we propose a difference-aware video question generation model that aims to generate questions about temporal differences in the video, i.e., capturing the dynamic changes between image frames of a video to ask questions. To capture the dynamic changes between image frames, we utilize a temporal difference extractor to localize the differences for each frame pair of a video through an attention mechanism. Then, we introduce an answer-aware module to capture the answer-related image frame pair containing their differences for question generation, which aims to guide our model to focus on answer-related content for questioning. Finally, the output of the answer-aware module is sent to a decoder module to generate questions. Extensive experiments on SVQA and MSVD-QA datasets show that the proposed model outperforms state-of-the-art models, e.g., our model achieves at least 17.1% improvement over existing models in the SVQA dataset. This is because our model can generate questions similar to ground truths that involve changes between image frames in videos. Our code is available at https://github.com/Gary-code/D-VQG.

Abstract:
Single-image super-resolution (SISR) is essential for improving the extraction of useful information from images captured in the real world. Most existing super-resolution methods generally assume that low-resolution (LR) images are generated from high-resolution (HR) images through a known degradation model, such as bicubic downsampling. As a result, these methods do not exhibit favorable performance on real-world images with complex authentic degradations, significantly limiting their practicality, especially in application scenarios where image authenticity is strictly enforced. In this paper, We design a frequency separation network (FSN) to separate low-frequency information and generate high-frequency information, which can reconstruct high-resolution real-world images quickly and accurately. We proposed the various Gaussian filters as the frequency separation (FS) module to gradually separate the frequencies and route them to their respective feature extraction modules. Subsequently, we aggregate all the different frequency features using the adaptive feature fusion (AFF) module to generate the HR image. Therefore, FSN can focus on high-frequency information to restore image details and ensure stable restoration of important information, such as object contours, without generating false texture details. Extensive experiments demonstrated that our FSN achieves consistently superior visual quality and generalization ability with more realistic and natural textures in various scenarios.

Abstract:
Pixel-wise contrastive learning recently offers a new training paradigm in semantic segmentation by directly shaping the pixel embedding space. Compared with pixel-pixel contrast that often requires large memory and high computation cost, pixel-prototype contrast exploits the semantic correlations among pixels in a more efficient way by pulling positive pixel-prototype pairs close and pushing negative pairs apart. However, most existing work treats pixels as anchors to form contrast, either failing to capture the intra-class variance or introducing extra computational overhead. In this work, we propose Prototype-Anchor Contrast (ProAC), a novel prototypical contrastive learning paradigm that strengthens pixel-prototype associations in a simple yet effective fashion. First, ProAC pre-defines class prototypes (serving as cluster centroids) by exploiting the uniformity on the hypersphere in the feature space and thus requires no prototype updating during network optimization, which greatly simplifies the network training process. Second, by treating prototypes as anchors, ProAC builds a novel prototype-to-pixel learning path, where a large amount of negative pixels can naturally be generated to describe rich semantic information without relying on auxiliary sample augmentation techniques. Finally, as a plug-and-play regularization term, ProAC can be attached to most existing segmentation models and assist the network optimization by directly shaping the pixel embedding space. Extensive experiments on different benchmarks show that our ProAC brings an mIoU increase from 1.4% to 2.0% for fully-supervised models and from 0.9% to 6.0% for domain-adaptive models, respectively. It also leads to a gain of mIoU, ranging from 1.8% to 2.7% in more challenging cases, including different resolutions, diverse illuminations and masked scenarios.

Abstract:
Current trackers only rely on a fixed target template to localize the target in each frame, which is however prone to fail in case of fast appearance changes or the presence of distractor objects. Having some historical knowledge about the tracked targets as well as their surrounding scenes can be highly beneficial for robust tracking. This historical information can be propagated through the sequence and used to timely perceive the change in target appearance and explicitly avoid distractor objects. In this work, we propose a Spatial-Temporal Context Attention (STCA) model which utilizes the appearance and state information of previously tracked targets as well as their surrounding scenes to more accurately localize the real target in the current frame. We embed an improved position encoder into the STCA, which enables the target template, context template and search patch to perform extensive interactional fusion through simultaneously self-attention and cross-attention calculation. By embedding the STCA module into Transformer, we construct a target-aware based online tracking network (named TATrack) that has a backbone to extract features better suited to the tracking task, a neck to further suppress distractors and highlight target, and a classification-regression head to make the tracking scores consistently reflect the quality of the bounding boxes. In addition, we also design a simple yet effective online updating approach to select high-quality context templates. Our tracker reaches the latest level on several benchmarks, including LaSOT, TrackingNet, GOT10k, OTB100 and UAV123. The code and trained models are available at https://github.com/hekaijie123/TATrack.

Abstract:
In the realm of unmanned aerial vehicle (UAV) tracking, Siamese-based approaches have gained traction due to their optimal balance between efficiency and precision. However, UAV scenarios often present challenges such as insufficient sampling resolution, fast motion and small objects with limited feature information. As a result, temporal context in UAV tracking tasks plays a pivotal role in target location, overshadowing the target’s precise features. In this paper, we introduce MT-Track, a streamlined and efficient multi-step temporal modeling framework designed to harness the temporal context from historical frames for enhanced UAV tracking. This temporal integration occurs in two steps: correlation map generation and correlation map refinement. Specifically, we unveil a unique temporal correlation module that dynamically assesses the interplay between the template and search region features. This module leverages temporal information to refresh the template feature, yielding a more precise correlation map. Subsequently, we propose a mutual transformer module to refine the correlation maps of historical and current frames by modeling the temporal knowledge in the tracking sequence. This method significantly trims computational demands compared to the raw transformer. The compact yet potent nature of our tracking framework ensures commendable tracking outcomes, particularly in extended tracking scenarios. Comprehensive tests across four renowned UAV benchmarks substantiate the superior efficacy of our approach, delivering real-time performance at 84.7 FPS on a single GPU. Real-world test on the NVIDIA AGX hardware platform achieves a speed exceeding 30 FPS, validating the practicality of our method.

Abstract:
We study a practical domain adaptation task, named source-free object detection (SFOD), which aims to adapt a pre-trained source detector to an unlabeled target domain without access to the original labeled source domain samples. In this paper, we design a new self-training approach for SFOD called Balance Teacher based on the mean teacher model. We target two key issues when using self-training for SFOD: 1) imbalanced label distribution when using pseudo-labels for supervising the model training, and 2) imbalanced image distribution, i.e., significant data variance in the target domain. To address these issues, we first design a Class-balanced Instance Selection (CBIS) module to automatically balance different classes when selecting pseudo-labeled instances during the training process. Then, we propose a Progressive Target Variance Minimization (PTVM) to cope with the imbalanced image distribution in the target domain, where the feature distributions of certainty and uncertainty target samples are progressively aligned to alleviate the data distribution variance. In this way, the teacher model can provide high-quality pseudo-labels and guide the student model to adapt gradually to the target domain. We have conducted extensive experiments on five widely used benchmarks, and the experimental results clearly show the superiority of our method over the state-of-the-art baselines.

Abstract:
We propose a novel Text-to-Image Generation Network, Attention-bridged Modal Interaction Generative Adversarial Network (AMI-GAN), to better explore modal interaction and perception for high-quality image synthesis. The AMI-GAN contains two novel designs: an Attention-bridged Modal Interaction (AMI) module and a Residual Perception Discriminator (RPD). In AMI, we mainly design a multi-scale attention mechanism to exploit semantics alignment, fusion, and enhancement between text and image, to better refine details and context semantics of the synthesized image. In RPD, we design a multi-scale information perception mechanism with our proposed novel information adjustment function, to encourage the discriminator to better perceive visual differences between the real and synthesized image. Consequently, the discriminator will drive the generator to improve the visual quality of the synthesized image. Besides, based on these novel designs, we can design two versions, a single-stage generation framework (AMI-GAN-S), and a multi-stage generation framework (AMI-GAN-M), respectively. The former can synthesize high-resolution images because of its low computational cost; the latter can synthesize images with realistic detail. Experimental results on two widely used T2I datasets showed that our AMI-GANs achieve competitive performance in T2I task.

Abstract:
Face swapping aims to transfer the identity of a source face to a target face image while preserving the target attributes (e.g., facial expression, head pose, illumination, and background). Most existing methods use a face recognition model to extract global features from the source face and directly fuse them with the target to generate a swapping result. However, identity-irrelevant attributes (e.g., hairstyle and facial appearances) contribute a lot to the recognition task, and thus swapping this task-specific feature inevitably interfuses source attributes with target ones. In this paper, we propose an identity-aware variational autoencoder (ID-VAE) based face swapping framework, dubbed VAFSwap, which learns disentangled identity and attribute representations for high-fidelity face swapping. In particular, we overcome the unpaired training barrier of VAE and impose a proxy identity on the latent space by exploiting the weak supervision from an auxiliary image set whose identity is averaged from multiple collected face images. To explicitly guide the identity fusion, we further devise an identity-associated matrix that corresponds different face regions with their identity representations to perform identity-related feature interactions. Finally, we incorporate spatial dimensions into the latent space and exploit the generative priors of a pre-trained face generator, allowing the effective elimination of noticeable swapping artifacts. Extensive experiments on the FaceForensics++ and CelebA-HQ datasets demonstrate that our method outperforms the state-of-the-art significantly.

Abstract:
Benefiting from the rich information provided by different modalities, multi-modal tracking has shown significant improvements compared to single-modal tracking. However, in practical applications, multi-modal tracking still faces two major challenges. Firstly, it is crucial to effectively integrate the complementary information from different modalities in order to improve tracking performance. Secondly, as trackers are often deployed in dynamic environments, it is difficult to ensure complete multi-modal data. Thus, handling modal-missing issues is essential to achieve robust and reliable tracking. To address these challenges, this paper proposes a Knowledge Synergy Network (KSNet) that integrates multi-modal features into a comprehensive representation and incorporates a modal compensation mechanism to handle modal-missing issues. With this framework, a multi-modal tracker (KSTrack) is built and trained using multi-modal data. KSTrack is capable of handling both complete and incomplete multi-modal data during inference. Comprehensive experiments on four large-scale RGB-Thermal (RGB-T) and RGB-Depth (RGB-D) benchmarks show that KSTrack surpasses state-of-the-art multi-modal trackers when using multi-modal data and outperforms single-modal trackers by a large margin when using single-modal data.

Abstract:
Light interference negatively impacts on frame-based visual tasks. Phenomena such as overexposure cause the loss of valuable information and reduce task execution efficiency. Event cameras are neuromorphic vision sensors that output sparse, asynchronous streams of events rather than frames. These cameras feature high temporal resolution, high dynamic range, and low power consumption. As a result, they are not susceptible to overexposure and motion blur, and they are able to recognize light interference such as strobe lights, stray lights, and reflections. However, event cameras are highly sensitive to light intensity changes, so light interference still affects event cameras as noise which easily alias with events triggered by environmental objects. Therefore, to reduce or eliminate the negative impact of light interference on event cameras, we systematically analyze the optical properties and event-triggering principles of these forms of light interference, and then propose ELIR (Event-based Light Interference Removal) method for removing light interference signals in event streams under static and dynamic scenes. The proposed method is validated in object detection tasks. Additionally, we launch the LIED datasets to evaluate the effect of light interference removal in event streams to assist with other studies in this field. Experimental results on the LIED datasets show that our proposed method can remove, on average, over 97% of light interference in static scenes, over 86% in dynamic scenes. Finally, the proposed method is verified on the object detection task, achieving an average PRE over 92%. The dataset is available at https://github.com/shicy17/LIED.

Abstract:
Image-text retrieval is a fundamental task to model a connection between images and natural language. Under its flourishing development in performance, most current methods suffer from N -related time complexity, which hinders their application in practice to a certain extent. Targeting efficiency improvement, we propose a simple and effective keyword-guided pre-screening framework for image-text retrieval. Specifically, we convert the image and text data into keywords and perform keyword matching across the modalities to exclude a large number of irrelevant gallery samples prior to the retrieval network. For the keyword prediction, we transfer it into a multi-label classification problem and propose a multi-task learning scheme by appending the multi-label classifiers to the image-text retrieval network to achieve a lightweight and high-performance keyword prediction. For keyword matching, we introduce the inverted index from the search engine and thus create a win-win situation on both time and space complexities for the pre-screening. Extensive experiments on the two widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of the proposed framework. The proposed framework equipped with only two embedding layers achieves O(1) querying time complexity, while improving the retrieval efficiency and maintaining performance, when applied prior to the common image-text retrieval methods.

Abstract:
Crowd counting is usually handled in a density map regression fashion, which is supervised via an L2 loss between the predicted density map and ground truth. To effectively regulate models, various improved L2 loss functions have been developed to find a better correspondence between predicted density and annotation positions. In this paper, we propose to predict the density map at one resolution but measure its quality via a derived log-formed loss at multiple resolutions. Unlike existing methods that assume density maps at different resolutions are independent, our loss is obtained by modeling the likelihood function inspired by the relationship of density maps across multi-resolutions. We find that the traditional single-resolution L2 loss is a particular case of our derived log-likelihood. We mathematically prove it is superior to a single-resolution L2 loss. Without bells and whistles, the proposed loss substantially improves several baselines and performs favorably compared to state-of-the-art methods on five crowd counting datasets: NWPU-Crowd, ShanghaiTech A & B, UCF-QNRF, and JHU-Crowd++. The source code and trained models are released at https://github.com/streamer-AP/PML_Loss.git.

Abstract:
Multi-view learning can improve classification performance by combining information between different views. Due to the similarity in different views of the dataset, sometimes the features obtained are highly limited and redundant. At the same time, different views accumulate a large amount of noisy information, which will affect the classification performance of the model. To solve these problems, we embed privileged information in the model and introduce dictionary learning, and proposed a new dictionary-based multi-view learning method with privileged information (MVDL-PI). First, two sets of dictionaries (synthetic dictionary and analysis dictionary) and sparse representation matrices of different information domains are obtained for each view information and privilege information through dictionary learning. Then, we obtain consistency information from the regularization terms of the two different sets of synthetic dictionaries and construct a LUPI (Learning using privileged information) classifier by the sparse representation. In addition, we use alternating convex optimization and Lagrange multiplier methods to optimize the model and prove its convergence. In the experiment, we did a number of experiments comparing this method with similar recent methods. The experimental results show that the MVDL-PI method is superior to other methods in terms of stability and classification accuracy.

Abstract:
Recently, deep-learning-based super-resolution methods have achieved excellent performances, but mainly focus on training a single generalized deep network by feeding numerous samples. Yet intuitively, each image has its specific representation, and is expected to acquire an adaptive model. For this issue, we propose a novel convolution modulation (CoMo) mechanism to build image-specific deep networks, by exploiting the principal information of the feature to generate a modulation weight, and thereby adaptively modulating the kernel weights of convolution without any additional parameters, which outperforms the vanilla convolution and several existing attention mechanisms when embedding into the state-of-the-art architectures. To optimize the modulated convolutions in mini-batch training, we introduce an image-specific optimization (IsO) algorithm, which tackles the infeasibility of the conventional optimization algorithms on this issue. Furthermore, we investigate the effect of CoMo on state-of-the-art architectures and design a new CoMoNet architecture by employing the U-style residual learning and hourglass dense block learning, which is an appropriate architecture to utmost improve the effectiveness of CoMo theoretically. Extensive experiments on benchmarks show that the proposed methods achieve superior performances and higher flexibility against the state-of-the-art SISR and blind SR methods. The code is available at github.com/YuanfeiHuang/CoMoNet.

Abstract:
Continual learning algorithms which keep the parameters of new tasks close to that of previous tasks, are popular in preventing catastrophic forgetting in sequential task learning settings. However, 1) the performance for the new continual learner will be degraded without distinguishing the contributions of previously learned tasks; 2) the computational cost will be greatly increased with the number of tasks, since most existing algorithms need to regularize all previous tasks when learning new tasks. To address the above challenges, we propose a self-paced Weight Consolidation (spWC) framework to attain robust continual learning via evaluating the discriminative contributions of previous tasks. To be specific, we develop a self-paced regularization to reflect the priorities of past tasks via measuring difficulty based on key performance indicator (i.e., accuracy). When encountering a new task, all previous tasks are sorted from “difficult” to “easy” based on the priorities. Then the parameters of the new continual learner will be learned via selectively maintaining the knowledge amongst more difficult past tasks, which could well overcome catastrophic forgetting with less computational cost. We adopt an alternative convex search to iteratively update the model parameters and priority weights in the bi-convex formulation. The proposed spWC framework is plug-and-play, which is applicable to most continual learning algorithms (e.g., EWC, MAS and RCIL) in different directions (e.g., classification and segmentation). Experimental results on several public benchmark datasets demonstrate that our proposed framework can effectively improve performance when compared with other popular continual learning algorithms.

Abstract:
Replacing CCD and CMOS image sensors in conventional cameras with digital micromirror devices (DMD), single-pixel cameras low-costly shot images by capturing compressed measurements and computation. However, the compressed measurements lack explicit spatial information, causing difficulties for high-level tasks such as salient object detection (SOD) that are usually designed to have visual inputs. To address the issue, we propose a single-pixel imaging-based SOD network called SPISODNet that enables predicting saliency maps directly from compressed measurements with high accuracy. Specifically, we first design an underlying feature inversion module (UFIM) to capture the underlying scene information, and then develop a context-aware flow (CAF) consisting of a feature focus module (FFM), three bidirectional attention modules (BAMs), and a spatial information-induced attention module (SIAM) to acquire and polish saliency predictions. Extensive experiments demonstrate that our method achieves superior performance for single-pixel imaging-based SOD.

Affiliations: Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; Univ Rennes, INSA Rennes, CNRS, IETR - UMR , Rennes, France; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Advanced Machine Learning and Applications, Institute of Artificial Intelligence and Advanced Communication, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; College of Mathematics and Statistics, Shenzhen University, Shenzhen, China

Abstract:
Unsupervised domain adaptative semantic segmentation is a powerful solution for the distribution shift problem between the source and target domains. However, such methods need specified target domain data that may be unavailable in actual applications due to excess expensive collection. Generalizable semantic segmentation as a new paradigm appears in recent research, which aims to generalize well on distinct unseen domains only using source domain data. The existing methods focus on learning domain-invariant features by using global distribution alignment strategies, which may lead to a decreased discriminability of the model. To cope with this challenge, we propose a fine-grained self-supervision (FGSS) framework for generalizable semantic segmentation that takes into account both discriminability and generalizability from the perspective of the intra-class relationship. The FGSS framework contains single-view and multi-view versions. In the single-view version, we propose a fine-grained self-supervision strategy to distinguish the sub-parts of the semantic class for better class discriminability. In the multi-view version, we propose a class prototype feature enhancement strategy to generate another view (i.e. another representation of the original representation). Then, we propose a multi-view mutual supervision loss to enforce consistency between different views and further enhance the generalizability of the model. Experimental results on five widely-used datasets, i.e., GTAV, SYNTHIA, BDD100K, Cityscapes, and Mapillary, demonstrate that our FGSS framework achieves superior performance compared to state-of-the-art methods.

Abstract:
Target appearance and motion variations are the primary challenges in visual tracking. To tackle these challenges, top-performing trackers commonly rely on constructing complex appearance or motion models. However, the efficacy of these models in enhancing track performance can be limited by the lack of effective and seamless integration. The utilization of simplistic handcrafted fusion methods may even exacerbate the issue, resulting in a decline in tracking performance. To address this issue, we propose an end-to-end coarse-to-fine verifying approach in our motion-driven tracker. At the coarse level, we developed a motion prediction module (MPM) that efficiently extracts and utilizes motion information by leveraging the differences between adjacent frames. The MPM constructs not only a position prior for the decoder but also hybrid features that combine both motion and appearance. At the fine level, we employ a deformable transformer-based appearance model to accurately verify a local region centered on the predicted locations from the MPM. To further enhance the generalization capability of our tracker, we propose the use of an instance domain discriminator (IDD) during the training phase. This discriminator is based on domain adaptation theory and aims to sharpen the distinction between the target and other instances, thereby improving the robustness of tracking. Experimental results on five popular benchmarks, including GOT10k, LaSOT, TrackingNet, OTB, and VOT, validate the effectiveness of our proposed tracker.

Abstract:
Enhancing the accuracy of dense classification with limited labeled data and abundant unlabeled data, known as semi-supervised semantic segmentation, is an essential task in vision comprehension. Due to the lack of annotation in unlabeled data, additional pseudo-supervised signals, typically pseudo-labeling, are required to improve the performance. Although effective, these methods fail to consider the internal representation of neural networks and the inherent class-imbalance in dense samples. In this work, we propose an information transfer theory, which establishes a theoretical relationship between shallow and deep representations. We further apply this theory at both the semantic and pixel levels, referred to as IIT-SP, to align different types of information. The proposed IIT-SP optimizes shallow representations to match the target representation required for segmentation. This limits the upper bound of deep representations to enhance segmentation performance. We also propose a momentum-based Cluster-State bar that updates class status online, along with a HardClassMix augmentation and a loss weighting technique to address class imbalance issues based on it. The effectiveness of the proposed method is demonstrated through comparative experiments on PASCAL VOC and Cityscapes benchmarks, where the proposed IIT-SP achieves state-of-the-art performance, reaching mIoU of 68.34% with only 2% labeled data on PASCAL VOC and mIoU of 64.20% with only 12.5% labeled data on Cityscapes.

Abstract:
This work addresses the problem of cross-domain few-shot classification which aims at recognizing novel categories in unseen domains with only a few labeled data samples. We think that the pre-trained model contains the redundant elements which are useless or even harmful for the downstream tasks. To remedy the drawback, we introduce an L^2 -SP regularized dense-sparse-dense (DSD) fine-tuning flow for regularizing the capacity of pre-trained networks and achieving efficient few-shot domain adaptation. Given a pre-trained model from the source domain, we start by carrying out a conventional dense fine-tuning step using the target data. Then we execute a sparse pruning step that prunes the unimportant connections and fine-tunes the weights of sub-network. Finally, initialized with the fine-tuned sub-network, we retrain the original dense network as the output model for the target domain. The whole fine-tuning procedure is regularized by an L^2 -SP term. In contrast to the existing methods that either tune the weights or prune the network structure for domain adaptation, our regularized DSD fine-tuning flow simultaneously exploits the benefits of sparsity regularity and dense network capacity to gain the best of both worlds. Our method can be applied in a plug-and-play manner to improve the existing fine-tuning methods. Extensive experimental results on benchmark datasets demonstrate that our method in many cases outperforms the existing cross-domain few-shot classification methods in significant margins. Our code will be released soon.

Abstract:
Incomplete multi-view clustering is an important and challenging task, which has attracted significant attention in recent years. The key objective of incomplete multi-view clustering is to excavate the underlying avaliable consistency of multi-view data, so as to enable the effective reconstruction of missing views for clustering. In this paper, we introduce a completion framework that deeply explores the underlying consistency and effectively completes the missing views. Following that, we propose a novel Twin Reciprocal Completion for Incomplete multi-view clustering, termed TRC-IMC for short. To be specific, TRC-IMC jointly conducts the Completion in Feature space (CF) and the Completion in Subspace (CS) to reciprocally complete the data with missing views. The underlying high-order consistency of multi-view data can be fully explored in both the feature space and subspace to guide the completion process of missing views. Extensive experiments are conducted on eight real-world multi-view datasets, and experimental results indicate the promising performance of our method, compared to several state-of-the-arts.

Abstract:
A robust reversible watermarking (RRW) algorithm enables the extraction of the watermark and the restoration of the cover image without attacks while ensuring the watermark’s extraction when the image is under attack. Existing RRW methods mainly focused on achieving robustness against geometric attacks such as rotation and scaling by embedding watermarks within the global inscribed circle of an image. However, the geometric transformations targeted by the existing methods cannot cope with combined attacks that include cropping, which is a real application scenario for geometric attacks. To extend the robustness of the watermarking algorithm, this paper proposes a local Zernike moments (ZMs) embedding strategy based on feature point extraction and selection. For each local circular domain, the same watermark is embedded in the magnitude of the ZMs. After embedding, all the compensation information used to recover the robust embedded regions is embedded outside these local circular domains in a reversible way. When attacks occur, especially combined attacks involving cropping, by using side information, the local watermarked regions in the image can be localized to extract the robust watermark. Experimental results show the superiority of the proposed method under various combined attacks that include cropping operations.

Abstract:
Encoding only the task-related information from the raw data, i.e., disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose DisTIB (Transmitted Information Bottleneck for Disentangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.

Abstract:
In large-scale surveillance of urban or rural areas, an effective placement of cameras is critical in maximizing surveillance coverage or minimizing economic cost of cameras. Existing Surveillance Camera Placement (SCP) methods generally focus on physical coverage of surveillance by implicitly assuming uniform distribution of interested targets or objects across all blocks, which is, however, uncommon in real-world scenarios. In this paper, we are the first to propose a target-aware SCP (tSCP) model, which prioritizes optimizing the task based on uneven target densities, allowing cameras to preferentially cover blocks with more interested targets. First, we define target density as the likelihood of interested targets occurring in a block, which is positively correlated with the importance of the block. Second, we combine aerial imagery with a lightweight object detection network to identify target density. Third, we formulate tSCP as an optimization problem to maximize target coverage in surveillance area, and solve this problem with a target-guided genetic algorithm. Our method optimizes the rational and economical utilization of cameras in large-scale video survillance. Compared with the state-of-the-art methods, our tSCP achieves the highest target coverage with a fixed number of cameras (8.31%-14.81% more than its peers), or utilizes the minimum number of cameras to achieve a preset target coverage. Codes are available at https://github.com/wu-hongxin/tSCP_main.

Abstract:
Coverless steganography requires no modification of the cover image and can effectively resist steganalysis, which has received widespread attention from researchers in recent years. However, existing coverless image steganographic methods are achieved by constructing a mapping between the secret information and images in a known dataset. This image dataset needs to be sent to the receiver, which consumes substantial resources and poses a risk of information leakage. In addition, existing methods cannot achieve high-accuracy extraction when facing various attacks. To address the aforementioned issues, we propose a robust generative steganography based on image mapping (GSIM). This method establishes prompts based on the topic and quantity requirements first and then generate the candidate image database according to the prompts, which can be independently generated by both the sender and receiver without the need for transmission. In order to improve the robustness of the algorithm, our proposed GSIM utilizes prompts and fractional-order Chebyshev-Fourier moments (FrCHFMs) to construct the mapping between the generated images and the predefined binary sequences, as well as uses speeded-up robust features (SURFs) as auxiliary features in the information extraction phase. The experimental results show that GSIM is superior to existing coverless image steganographic methods in terms of capacity, security, and robustness.

Abstract:
Recently, more and more images are compressed and sent to the back-end devices for machine analysis tasks (e.g., object detection) instead of being purely watched by humans. However, most traditional or learned image codecs are designed to minimize the distortion of the human visual system without considering the increased demand from machine vision systems. In this work, we propose a preprocessing enhanced image compression method for machine vision tasks to address this challenge. Instead of relying on the learned image codecs for end-to-end optimization, our framework is built upon the traditional non-differential codecs, which means it is standard compatible and can be easily deployed in practical applications. Specifically, we propose a neural preprocessing module before the encoder to maintain the useful semantic information for the downstream tasks and suppress the irrelevant information for bitrate saving. Furthermore, our neural preprocessing module is quantization adaptive and can be used in different compression ratios. More importantly, to jointly optimize the preprocessing module with the downstream machine vision tasks, we introduce the proxy network for the traditional non-differential codecs in the back-propagation stage. We provide extensive experiments by evaluating our compression method for several representative downstream tasks with different backbone networks. Experimental results show our method achieves a better trade-off between the coding bitrate and the performance of the downstream machine vision tasks by saving about 20% bitrate.

Abstract:
Multi-view stereo (MVS) aims to reconstruct the dense 3D geometry of a scene by processing and relating images captured from different viewpoints. Despite impressive successes, most existing techniques simply supervise cost volumes or depth maps through conventional classification or regression methods, thereby inadequately exploring the depth representation’s full potential. Moreover, reconstructing areas with occlusions or weak textures continues to be a long-standing challenge within MVS. Another critical issue, frequently neglected, is the potential inaccuracy of ground truth depths, as evidenced in datasets like DTU. To address these problems, we introduce EA-MVSNet, an innovative error-aware MVS framework designed to enhance depth prediction. The key contributions of this work include three parts: (1) We present a novel error-aware depth representation that enhances depth prediction accuracy through error-aware learning, thereby improving reconstruction quality. (2) We develop a Deformable Feature Pyramid Network (DFPN), meticulously designed to augment reconstruction details in occluded and texture-deficient areas. (3) We introduce a cross-view consistency guidance module into the learning process, effectively mitigating the detrimental effects of ground truth depth inaccuracies and fostering faster convergence. Comprehensive experiments on the DTU dataset and Tanks and Temples dataset validate the superiority of our EA-MVSNet. Compared to the preceding UniMVSNet, EA-MVSNet achieves a notable 7.6% decrease in overall reconstruction error on the DTU dataset, and boosts the mean F-score by 3.0% and 4.1% in the intermediate and advanced groups of the Tanks and Temples dataset, respectively, surpassing most recent state-of-the-art methods.

Abstract:
The segmentation of 3D shapes is a critical aspect of shape analysis. However, most existing methods for 3D shape segmentation treat each face of the original mesh model with equal importance. This uniform approach becomes problematic in areas where the faces are smaller but denser, especially around the junctions of different segments. In such regions, greater importance should be assigned compared to the flatter areas. To address this issue, this paper proposes a novel 3D shape segmentation method that incorporates attentive nonuniform sampling into the segmentation pipeline. By leveraging a transformer-based mechanism, our method adaptively identifies the intricate details of 3D shapes, calculating varying degrees of attention to each face. Consequently, the mesh model is downsampled by eliminating faces with lower attention, thereby optimizing the segmentation process. Our approach outperforms most state-of-the-art methods on multiple public datasets, making it a promising avenue for future research.

Abstract:
Few-Shot Learning (FSL) leverages prior knowledge and generalization strategies to quickly adapt to new tasks or recognize new objects with minimal input. Recently, CLIP-based methods, aided by contrastive language-image pre-training, have demonstrated impressive few-shot performance. However, these methods solely employ fixed-length uni-modal prompts at the initial encoder layer, neglecting the multi-level adaptation and cross-modal interaction for the intermediate features. To address this issue, we propose Hierarchy-Aware Interactive Prompt Learning (HIPL), by jointly exploring hierarchical prompt learning and cross-modal prompt interaction for CLIP-based FSC. The proposed HIPL enjoys several merits. First, we design a hierarchical prompt aggregation module to progressively generate higher-level prompts via the attention mechanisms, equipping the CLIP with hierarchical adaptation capability. Second, a cross-modal prompt interaction module is proposed to facilitate deep interaction between stage-wise prompts, ensuring mutual synergy between vision and textual features. To the best of our knowledge, this is the first work to learn multi-level prompts by progressive aggregation. Our extensive experiments demonstrate that HIPL outperforms previous methods in few-shot classification and base-to-new generalization. Our code is available at https://github.com/Yxt1212/HIPL

Abstract:
Few-shot classification (FSC) is a challenging task due to limitation in accessing training data. Recent methods often employ highly complex networks to obtain high-quality features, but this may not be suitable for resource-limited applications. To tackle this challenge, we introduce Few-Shot Classification Model Compression (FSC-MC), a new task aimed at enhancing the FSC performance of lightweight and low-capacity models by learning from more complex models. We also propose a novel two-level learning strategy called School Learning to accomplish the FSC-MC task by mimicking the real learning process in the social school life. In this new learning paradigm, the first level performs preview learning, in which each student is equipped with a preparer to perform self-learning on the base set. The second level is the team learning, consisting of a complex teacher network and several lightweight student networks organized into a team. One student network is randomly chosen as the leader network, while the remaining student networks serve as member networks. The leader network simultaneously learns knowledge from the teacher network and all member networks. Conversely, each member network receives knowledge from both the teacher network and the leader network. Ultimately, the leader network is deployed for FSC evaluation, resulting in effective model compression. Extensive experiments in the FSC-MC setting demonstrate that School Learning outperforms 17 state-of-the-art knowledge distillation methods including both offline methods and online methods, enabling lightweight models to achieve outstanding FSC performance.

Abstract:
Generative Adversarial Network (GAN)-based image cartoonization has made great progress. They usually use a “single-encoding adversarial feedback architecture” to generate cartoon image in a similar cartoon-style domain. However, this architecture cannot generate a satisfactory cartoonized image with both high style similarity and visual fidelity. In this work, to relieve this problem, we propose a novel dual-encoding matching adversarial learning dubbed DEMAL for image cartoonlization. Particularly, we first design a dual-encoding matching (DEM) by using a pair of dual encoders and a statistical matching module (SM) to match the content-style feature encodings extracted separately in the statistical space. We then construct double-structure style discriminators to adversarially learn global and local feature representations of cartoon-style via the improved loss function. Furthermore, we also propose a pre-training strategy for the DEMAL to achieve the best FID and ArtFID distance. Extensive experiments have demonstrated that our proposed DEMAL achieves high visual fidelity and style similarity compared to the previous representative baseline cartoonization methods. Code is available at https://github.com/ZYDeeplearning/DEMAL-Model.

Abstract:
Most RGB-T trackers heavily rely on bottom-up attention and thus overlook top-down cross-modal guidance for learning target features. Consequently, the discriminative power of the learnt target features is weak. To address this issue, we propose a novel RGB-T tracker (called TGTrack) that designs a Top-down Cross-modal Guidance mechanism to learn target features in two stages. In the first stage, our TGTrack effectively generates top-down cross-modal guidance signals with multi-modal encoders-decoders and prior vectors. In the second stage, these signals are transmitted and integrated to improve the discriminative power of our target features by the attention layers of the cross-modal encoders. Moreover, we introduce an Attention-Driven Spatio-Temporal Updater for updating discriminative target features. Through cross-frame attention guidance, it can effectively eliminates irrelevant features within the search region. As a result, our TGTrack can effectively avoid the complex multi-modal fusion modules and thus achieve robust RGB-T tracking. Extensive experiments on three popular RGB-T tracking benchmarks (i.e., LasHeR, RGBT234, and RGBT210) demonstrate that our TGTrack achieves new state-of-the-art performances.

Abstract:
Accurate polyp segmentation is crucial for precise diagnosis and prevention of colorectal cancer. However, precise polyp segmentation still faces challenges, mainly due to the similarity of polyps to their surroundings in terms of color, shape, texture, and other aspects, making it difficult to learn accurate semantics. To address this issue, we propose a novel semantic enhanced perceptual network (SEPNet) for polyp segmentation, which enhances polyp semantics to guide the exploration of polyp features. Specifically, we propose the Polyp Semantic Enhancement (PSE) module, which utilizes a coarse segmentation map as a basis and selects kernels to extract semantic information from corresponding regions, thereby enhancing the discriminability of polyp features highly similar to the background. Furthermore, we design a plug-and-play semantic guidance structure for the PSE, leveraging accurate semantic information to guide scale perception and context fusion, thereby enhancing feature discriminability. Additionally, we propose a Multi-scale Adaptive Perception (MAP) module, which enhances the flexibility of receptive fields by increasing the interaction of information between neighboring receptive field branches and dynamically adjusting the size of the perception domain based on the contribution of each scale branch. Finally, we construct the Contextual Representation Calibration (CRC) module, which calibrates contextual representations by introducing an additional branch network to supplement details. Extensive experiments demonstrate that SEPNet outperforms 15 SOTA methods on five challenging datasets across six standard metrics.

Abstract:
Various network compression methods, such as pruning and quantization, have been proposed to synergistically reduce resource requirements. However, existing joint compression works are based on black-box optimization and do not interpret the interaction mechanism between these two compression techniques, leading to a slow and unstable convergence of compression strategy. To address this issue, we present Markov-PQ, the first interpretable pruning-quantization co-compression framework using a Markov Chain. In Markov-PQ, the joint strategy search is modeled as a Markov Chain and decoupled with Bayes Rule into pruning and quantization strategy searching. Specifically, the quantization state accounts for the co-compression state from the last time and is updated by a learnable transition probability matrix. To ensure differentiability, we design a forward-hard and backward-soft quantization. The pruning state is influenced not only by the last co-compression state but also by the concurrent quantization state. In addition, to perceive the current layer-wise bit sensitivity and alleviate the long-tail problem, a complexity-aware regularizer is devised to re-evaluate the filter importance. Extensive experiments demonstrate the superiority of Markov-PQ. For example, with an accuracy loss of only 0.33%, we can achieve a 56.12× acceleration for ResNet-18 on ImageNet2012.

Abstract:
Visual cryptography (VC) schemes provide a distinguished image encryption technique to protect image security since it can visually decrypt the secret image by superimposing the encrypted shadows. VC schemes for both threshold access structures and general access structures are generally constructed based on the OR operation to minimize the pixel expansion. However, VC schemes with optimal pixel expansion typically have low contrast. Stacking operation OR frequently produces recovered images with poor visual quality and are never able to deliver flawless recovery for secret images. Therefore, we studied XOR-based VC (XVC) schemes that employ linear programming to maximize their contrast. Three schemes for general access structures and three schemes for threshold access structures are designed to maximize their contrast. The proposed schemes’ construction is reduced to a linear programming to maximize the contrast by determining the ideal combinations of basis matrices in terms of primary column matrices and unit matrices, respectively. The comparison study and experimental results demonstrate that the contrast of the previous VC and XVC schemes can be further improved.

Abstract:
Existing unsupervised Multi-View Stereo (MVS) methods generally construct supervision on the basis of the photometric consistency loss, which suffers from unreliable supervision and limited scalability. In this paper, a novel unsupervised MVS framework with Self-constructed Stereo Correspondences, termed SSC-MVS, is proposed to provide reliable supervision for the network and improve scalability of unsupervised MVS. Specifically, a pseudo depth-based learning strategy is first presented to supervise the MVS network with a pseudo depth, which is used to characterize the accurate stereo correspondences. Additionally, a consistency-based training mechanism is designed, where the depth consistency between two differently-augmented inputs is constrained to further improve the robustness of the network in real MVS scenes. Experimental results on widely-used MVS datasets demonstrate that the proposed SSC-MVS obtains the state-of-the-art performance among the unsupervised methods and has the potential to outperform the fully-supervised methods. The code is available at https://github.com/jzhu98/ssc-mvs.

Abstract:
Hyperspectral imaging plays a pivotal role across diverse applications, like remote sensing, medicine, and cytology. The utilization of 2D sensors to acquire 3D hyperspectral images (HSIs) via a coded aperture snapshot spectral imaging (CASSI) system has proven successful, owing to its hardware-friendly implementation and fast sampling speed. Nevertheless, for less spectrally sparse scenes, the use of a single snapshot and unreasonable coded aperture design limits the efficacy of CASSI systems and renders HSI reconstruction more ill-posed, leading to compromised spatial and spectral fidelity. This paper proposes a novel Progressive Content-Aware CASSI (PCA-CASSI) framework, which progressively captures HSIs using multiple optimized content-aware coded apertures and fuses all snapshot measurements for reconstruction. By unlocking the full potential of CASSI systems and elevating their performance ceilings, this framework offers researchers new avenues for improving imaging quality. Furthermore, we develop the RndHRNet, a Range-Null space Decomposition (RND)-inspired deep unfolding network with multiple iterative phases for HSI recovery. Each unfolded recovery phase efficiently exploits the physical information within the coded apertures via explicit RND and adaptively explores the spatial-spectral correlation by dual transformer blocks. Through comprehensive experiments, our approach demonstrates superior performance compared to existing state-of-the-art methods in both the multiple- and single-shot compressive HSI imaging tasks with substantial improvements. Code is available at https://github.com/xuanyuzhang21/PCA-CASSI.

Abstract:
A major challenge of the video inpainting task is aggregating spatial and temporal information in the corrupted video effectively. In this paper, we propose a dynamic graph memory bank to settle this challenge. To model the long-range temporal dependency, a memory bank is built and updated dynamically with the input visual information flow. The relationships among the memory items are modeled through a graph-based message propagation. Benefiting from the dynamic graph memory bank, both contents and their relationships in the corrupted video are well exploited as the inpainting process going on. Besides, the spatial misalignment across different frames may degrade the quality of features in the dynamic graph memory bank. To alleviate this issue, we propose a motion-guided feature alignment module. The proposed module cooperates with the dynamic graph memory bank to improve the network’s information aggregation ability in spatial and temporal dimensions. Extensive experiments on the YouTube-VOS and DAVIS datasets demonstrate the superiority of our approach when compared with the state-of-the-arts.

Abstract:
Capturing displayed images using portable cameras has become familiar among multimedia pirates, necessitating the urgent requirement of camera-shooting resilient watermarking schemes. In this paper, we consider the stealers who only record parts of images, and propose a robust watermarking scheme at the image instance level. This scheme consists of an encoding end, a noise layer, and a decoding end. The encoding end first selects specific watermarking regions associated with segmented image instances. Afterwards, an encoder is employed to embed watermark sequences into the RGB color model of these watermarking regions. At last, templates are embedded to product the final watermarked images. Specifically, our suggested template-based resynchronization comprises a template embedding module at the encoding end and a geometric correction module at the decoding end. The former embeds templates by a correlation-aware multiplicative spread spectrum with an adaptive amplitude, while the latter learns a calibrator to estimate the perspective projection. Experiments on both simulation and real-world scenarios support that the proposed scheme effectively resists camera-shooting attacks with various shooting conditions, regardless of whether the entire displayed images have been captured.

Abstract:
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. However, the limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. To address this, data perturbation (augmentation) has emerged as a crucial method to increase data diversity. Nevertheless, existing perturbation methods often focus on either image-level or feature-level perturbations independently, neglecting their synergistic effects. To overcome these limitations, we propose CPerb, a simple yet effective cross-perturbation method. Specifically, CPerb utilizes both horizontal and vertical operations. Horizontally, it applies image-level and feature-level perturbations to enhance the diversity of the training data, mitigating the issue of limited diversity in single-source domains. Vertically, it introduces multi-route perturbation to learn domain-invariant features from different perspectives of samples with the same semantic category, thereby enhancing the generalization capability of the model. Additionally, we propose MixPatch, a novel feature-level perturbation method that exploits local image style information to further diversify the training data. Extensive experiments on various benchmark datasets validate the effectiveness of our method.

Abstract:
Continual learning strives to acquire knowledge across sequential tasks without forgetting previously assimilated knowledge. Current state-of-the-art methodologies utilize dynamic architectural strategies to increase the network capacity for new tasks. However, these approaches often suffer from a rapid growth in the number of parameters. While some methods introduce an additional network compression stage to address this, they tend to construct complex and hyperparameter-sensitive systems. In this work, we introduce a novel solution to this challenge by proposing Memory-Boosted transformer (MoBoo), instead of conventional architecture expansion and compression. Specifically, we design a memory-augmented attention mechanism by establishing a memory bank where the “key” and “value” linear projections are stored. This memory integration prompts the model to leverage previously learned knowledge, thereby enhancing stability during training at a marginal cost. The memory bank is lightweight and can be easily managed with a straightforward queue. Moreover, to increase the model’s plasticity, we design a memory-attentive aggregator, which leverages the cross-attention mechanism to adaptively summarize the image representation from the encoder output that has historical knowledge involved. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our method. For example, on ImageNet-100 under 10 tasks, our method outperforms the current state-of-the-art methods by +3.74% in average accuracy and using fewer parameters.

Abstract:
Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, remains a challenge. In response, this work introduces a novel method leveraging video coding, particularly by harnessing inherent motion vectors (MVs) present in videos. We present CodingHomo, an unsupervised framework for homography estimation. Our framework features a Mask-Guided Fusion (MGF) module that identifies and utilizes beneficial features among the MVs, thereby enhancing the accuracy of homography prediction. Additionally, the Mask-Guided Homography Estimation (MGHE) module is presented for eliminating undesired features in the coarse-to-fine homography refinement process. CodingHomo outperforms existing state-of-the-art unsupervised methods, delivering good robustness and generalizability. The code and dataset are available at: https://github.com/liuyike422/CodingHomo.

Abstract:
Neural Volume Rendering (NVR) has advanced explosively since the advent of Neural Radiance Field (NeRF), a technique for novel view synthesis of complex scenes based on a finite set of input views. Existing ray casting-based NVR approaches process rays concurrently to leverage parallelism but fails to consider its impact on cache locality, which ultimately undermines the efficiency of corresponding dedicated hardware accelerator designs. We further observed that there exhibits spatial correspondence between features and voxels in NVR that can be exploited by processing in the order of voxel, not ray. This paper introduces a novel approach to meticulously reorder the execution of rays, ensuring that rays with similar memory access patterns are processed in parallel, thereby enhancing cache locality. On the basis of that, we also propose an efficient backend architecture and a corresponding memory subsystem, facilitating accurate data prefetching to hide off-chip memory latency. To validate the proposed architecture, we implement our design in VerilogHDL and evaluate the performance by post-synthesis simulation with real scene data. The evaluation results demonstrate that our design markedly enhances the efficiency of NVR processing, achieving a considerable speedup ( 1.62× ) compared to the state-of-the-art NVR accelerator, while necessitating significantly less silicon area ( 5.12× ) and power ( 32.79× ).

Abstract:
Existing one-stage object detectors are commonly implemented in a multi-task learning based manner, which simultaneously solves two different sub-tasks: object classification and localization. To achieve this, the detection heads with two independent branches are typically utilized to extract specific image features for each task separately. However, due to the lack of interaction between the parallel branches, the difference in learning objectives of classification and localization will lead to spatial misalignment between the predictions of these two tasks. In this work, we propose a novel Cross-attentive Task-aligned Object Detection (CTOD) method to handle this problem by explicitly promoting the prediction consistency for both tasks. Specifically, we first design a Dual Task Interaction (DTI) module, which generates task-interactive embeddings for each branch from task-specific features by using a task cross-attention mechanism. Then based on these embeddings, we propose a Spatial Feature Aggregation (SFA) module that calculates offsets and weights to aggregate information from nearby feature points at each spatial location of the task-specific feature maps. Meanwhile, we also generate adjustment parameters from the task-interactive embeddings to finally align the prediction results of the two tasks obtained from the enhanced task-specific features described above. Extensive experiments are conducted on the MS-COCO dataset. When using ResNeXt-101- 64× 4 d-DCN as the backbone, our CTOD method achieves a detection result of 51.8 AP with single-model and single-scale testing, outperforming the recently proposed one-stage detectors ATSS, VFNet, LD and TOOD by 4.1, 1.9, 1.3 and 0.7 AP, respectively. The analysis of qualitative results also illustrates the effectiveness and superiority of CTOD in solving the task misalignment problem for object detection. Our code is available at https://github.com/Mr-Bigworth/CTOD.

Abstract:
Visual anomaly detection aims at classifying and locating the regions that deviate from the normal appearance. Embedding-based methods and reconstruction-based methods are two main approaches for this task. The embedding-based methods typically predict the anomaly by measuring the distances between the deep representations of the test samples and a limited number of nominal samples, which enables these methods to be efficient but struggle in providing a fine-grained pixel-level anomaly location. The reconstruction-based methods rely on the pixel-level reconstruction errors to locate the anomaly, thereby the anomaly predictions are fine-grained. However, there are repetitive feature extractions and usually extra modules to guarantee the quality of the reconstructed images, resulting in unsatisfactory detection efficiency. In a nutshell, the prior methods are either not efficient or not precise enough for the industrial detection. To deal with this problem, we derive POUTA (Produce Once Utilize Twice for Anomaly detection), which improves both the accuracy and efficiency by reusing the discriminant information potential in the reconstructive network. We observe that the encoder and decoder representations of the reconstructive network are able to stand for the features of the original and reconstructed image respectively. And the discrepancies between the symmetric reconstructive representations provides roughly accurate anomaly information. To refine this information, a coarse-to-fine process is proposed in POUTA, which calibrates the semantics of each discriminative layer by the high-level representations and supervision loss. Equipped with the above modules, POUTA is endowed with the ability to provide a more precise anomaly location than the prior arts. Besides, the representation reusage also enables to exclude the feature extraction process in the discriminative network, which reduces the parameters and improves the efficiency. Extensive experiments show that, POUTA is superior or comparable to the prior methods with even less cost. Furthermore, POUTA also achieves better performance than the state-of-the-art few-shot anomaly detection methods without any special design, showing that POUTA has strong ability to learn representations inherent in the training data.

Abstract:
In the field of Joint photographic experts group (JPEG) reversible data hiding (RDH), due to the weak correlation between the adjacent alternating current (AC) coefficients in the JPEG image, the existing JPEG RDH methods cannot effectively find those extension coefficients with high embedding efficiency and prioritize them for carrying message bits. In this paper, a new convolutional neural network (CNN)-based JPEG RDH scheme is proposed. First, the Laplacian distribution model is applied to roughly pre-estimate the expansion probability of the AC coefficients. Then, the approximate pre-estimated expansion probability and the actual expansion probability of the AC coefficients are used to train the carefully designed CNN-based estimation model, and the embedding efficiency of each AC coefficient can be calculated through the output of the CNN model. In the embedding stage, a new adaptive embedding strategy called coefficient selection strategy is proposed, which is more efficient than those previously proposed selection strategies based on block selection and frequency selection. Finally, the AC coefficient with greater embedding efficiency will be preferentially used for data hiding. Extensive experimental results demonstrate the effectiveness of our proposed CNN-based method compared with the state-of-the-art JPEG RDH methods.

Abstract:
Cross-view geo-localization aims to match images of the same target from different platforms, e.g., drone and satellite. It is a challenging task due to the changing appearance of targets and environmental content from different views. Most methods focus on obtaining more comprehensive information through feature map segmentation, while inevitably destroying the image structure, and are sensitive to the shifting and scale of the target in the query. To address the above issues, we introduce simple yet effective part-based representation learning, shifting-dense partition learning (SDPL). We propose a dense partition strategy (DPS), dividing the image into multiple parts to explore contextual information while explicitly maintaining the global structure. To handle scenarios with non-centered targets, we further propose the shifting-fusion strategy, which generates multiple sets of parts in parallel based on various segmentation centers, and then adaptively fuses all features to integrate their anti-offset ability. Extensive experiments show that SDPL is robust to position shifting, and performs competitively on two prevailing benchmarks, University-1652 and SUES-200. In addition, SDPL shows satisfactory compatibility with a variety of backbone networks (e.g., ResNet and Swin). https://github.com/C-water/SDPL_release.

Abstract:
Although few-shot learning aims to address data scarcity, it still requires large, annotated datasets for training, which are often unavailable due to cost and privacy concerns. Previous studies have utilized pre-trained diffusion models, either to synthesize auxiliary data besides limited labeled samples, or to employ diffusion models as zero-shot classifiers. However, they are limited to conditional diffusion models needing class prior information (e.g., carefully crafted text prompts) about unseen tasks. To overcome this, we leverage unconditional diffusion models without needs for class information to train a meta-model capable of generalizing to unseen tasks. The framework contains (1) a meta-learning without data approach that uses synthetic data during training; and (2) a diffusion model-based data augmentation to calibrate the distribution shift during testing. During meta-training, we implement a self-taught class-learner to gradually capture class concepts, guiding unconditional diffusion models to generate a labeled pseudo dataset. This pseudo dataset is then used to jointly train the class-learner and the meta-model, allowing for iterative refinement and clear differentiation between classes. During meta-testing, we introduce a data augmentation that employs the diffusion models used in meta-training, to narrow the gap between meta-training and meta-testing task distribution. This enables the meta-model trained on synthetic images to effectively classify real images in unseen tasks. Comprehensive experiments showcase the superiority and adaptability of our approach in four real-world scenarios. Code available at https://github.com/WalkerWorldPeace/MLWDUDM.

Abstract:
More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging study aimed at retrieving the same person across cameras, time, and modalities. Existing methods usually employ dual-stream networks with integrated constraints, or compensate for modality information to reduce the significant modality discrepancies among heterogeneous images. However, the effectiveness of designed constraints is often limited due to substantial cross-modality differences, and methods that compensate for modality information may introduce noise and additional computational cost. In this paper, we propose a novel Multi-Stage Auxiliary Learning strategy called MSALNet. Specifically, in our approach, the training process is bifurcated into two stages: 1) training with auxiliary modality pairs obtained from grayscale histogram equalization, and 2) training with visible and infrared image pairs to gradually extract more discriminative modality-shared features. We propose the Heterogeneous Feature Compensation Learning (HFCL) module for information compensation and fusion between visible and infrared features, generating auxiliary branches to learn more cross-modality-related information. Additionally, we propose the Modality Similarity Reinforcement (MSR) module to improve the consistency of cross-modality feature representation by suppressing interference information and leveraging pixel similarity probability distribution as supervisory information. Lastly, we design the Distance Center Alignment (DCA) loss to reduce intra-class variations within and between modalities, enhancing the distinguishability among different identities. Experimental results demonstrate MSALNet’s superior performance over most existing methods on two mainstream VI-ReID datasets and effectively saves computational cost.

Abstract:
Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS. The project-related materials are available at: https://github.com/yinghaidada/STMT.

Abstract:
Motion Compensated Temporal Filter (MCTF) has been repeatedly proven to be an effective pre-processing tool that improves the coding performance. The philosophy is that by smoothing with temporal filter, the noise of the to-be-coded image can be reduced, thereby shrinking the prediction residuals and improving the rate-distortion (RD) performance. While abundant efforts have been devoted to the design of the MCTF filter weights, how motion vector variance and texture complexity influence MCTF has been relatively under-explored. In this work, we propose an enhanced MCTF method (EMCTF) based on multi-hypothesis reference, motion vector variance, and texture complexity. We take initial steps towards the incorporation of motion vector variance and texture complexity in the filtering weights design. Motion compensation blocks based on multi-hypothesis reference, can be efficiently aggregated in an effort to obtain the final inference. The proposed method is implemented on the top of Versatile Video Encoder (VVenC). Experimental results show that for faster preset, fast preset, medium preset and slow preset, the proposed EMCTF achieves 0.85%, 0.91%, 0.85% and 0.82% Bjøntegaard delta rate (BD-rate) savings, respectively. Moreover, the EMCTF introduces little additional encoding complexity increase, facilitating its future applications in real-world scenarios.

Abstract:
Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.

Abstract:
Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the style guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes. Our project page: https://windlikestone.github.io/HyperStyle3D-website/.

Abstract:
Multi-label image classification is an essential yet challenging task that requires to recognize multiple objects of images. To this end, recent studies have sought to acquire visual representations for each label by attention models, and then train binary classifiers for prediction. However, these methods have two major drawbacks: 1) They rely heavily on the precise alignments between two modalities, which is still challenging for current attention models; 2) They ignore patch-level representations rich in local object features, which are also of great importance for label recognition. In this paper, we propose a semantic-guided representation enhancement framework, which augments patch-level representations with object-level representations for robust label recognition. Concretely, the proposed framework consists of two significant components: 1) an inter-modal attention module that accounts for coarsely locating object regions and producing object-level representations for each label; 2) an intra-modal attention module that aggregates object representations to enhance patch representations based on their correlations. In this way, both local clues and global glances of objects are fully exploited simultaneously, rather than relying solely on object-level representations obtained by the inter-modal attention, thus improving the performance of label recognition. Experimental results show that our framework outperforms the state-of-the-art methods by 0.5%, 0.6%, 0.7% and 0.8% in mAP on Pascal VOC 2007, Microsoft COCO, NUS-WIDE and Visual Genome datasets, respectively. Codes and models are available on https://github.com/jasonseu/SGRE.

Abstract:
Optimal transport (OT) studies the most economical transformation of one probability measure into another, attracting attention across diverse fields and inspiring various OT-solving algorithms. However, adjusting the probability measure according to specific application requirements, such as achieving unbiased generated images or generating images with specific attributes, necessitates recalculating the OT mapping. This process may result in inefficiency and limited usage flexibility of existing algorithms. To address this, we propose a measure-driven neural solver for OT, the key of which is to construct a network module to learn Brenier’s height representation, and then compute the gradient of Brenier’s potential to derive the OT mapping. Our algorithm has two main advantages: i) It enables direct calculation or fine-tuning of the OT mapping when the target sample measure changes, enhancing efficiency. ii) For unbiased image generation or attribute-specific face generation, adjusting the posterior probability measure of the latent space in the pre-trained model suffices, without the need for additional auxiliary components, this highlights the flexibility of our algorithm. Extensive experiments demonstrate the excellent performance of our algorithm in debiased generation and controllable generation, and its flexibility and efficiency. In addition, both of these generation ways can enhance the classification performance of minority groups.

Abstract:
Person Re-identification (Re-ID) is a crucial technique for public security and has made significant progress in supervised settings. However, the cross-domain (i.e., domain generalization) scene presents a challenge in Re-ID tasks due to unseen test domains and domain-shift between the training and test sets. To tackle this challenge, most existing methods aim to learn domain-invariant or robust features for all domains. In this paper, we observe that the data-distribution gap between the training and test sets is smaller in the sample-pair space than in the sample-instance space. Based on this observation, we propose a Generalizable Metric Network (GMN) to further explore sample similarity in the sample-pair space. Specifically, we add a Metric Network (M-Net) after the main network and train it on positive and negative sample-pair features, which is then employed during the test stage. Additionally, we introduce the Dropout-based Perturbation (DP) module to enhance the generalization capability of the metric network by enriching the sample-pair diversity. Moreover, we develop a Pair-Identity Center (PIC) loss to enhance the model’s discrimination by ensuring that sample-pair features with the same pair-identity are consistent. We validate the effectiveness of our proposed method through a lot of experiments on multiple benchmark datasets and confirm the value of each module in our GMN.

Abstract:
Despite the impressive performance of Multi-view Stereo (MVS) approaches given plenty of training samples, the performance degradation when generalizing to unseen domains has not been clearly explored yet. In this work, we focus on the domain generalization problem in MVS. To evaluate the generalization results, we build a novel MVS domain generalization benchmark including synthetic and real-world datasets. In contrast to conventional domain generalization benchmarks, we consider a more realistic but challenging scenario, where only one source domain is available for training. The MVS problem can be analogized back to the feature matching task, and maintaining robust feature consistency among views is an important factor for improving generalization performance. To address the domain generalization problem in MVS, we propose a novel MVS framework, namely RobustMVS. A Depth-Clustering-guided Whitening (DCW) loss is further introduced to preserve the feature consistency among different views, which decorrelates multi-view features from viewpoint-specific style information based on geometric priors from depth maps. The experimental results further show that our method achieves superior performance on the domain generalization benchmark.

Abstract:
Pedestrian safety is a huge concern for deploying autonomous vehicles in urban environments. Accidents involving pedestrians pose a higher degree of severity, sometimes causing serious injuries and fatalities. It’s a challenging task to predict whether a pedestrian will cross the road since they can move in any direction and change motion suddenly. The inherent uncertainty in pedestrian motion has been addressed with probabilistic models in previous works. However, these models are too computationally expensive for real-time predictions. In this paper, we propose a novel reinforcement learning (RL) framework which produces soft labels for the training dataset in order to address the observed data uncertainty. We formulate novel state representations incorporating predictive uncertainty to learn more informative soft labels that improve the model performance and reliability. Finally, we validate the proof of concept with two benchmark datasets and show with extensive experiments on competitive prediction models that our method (even using fewer input modalities) significantly improves the accuracy and f1 score by up to 12% and 13%, respectively. We also show that soft labeling as a form of regularization increases model reliability where the model is more accurate when the confidence level is high and more aware of its limitations with indication of low confidence.

Affiliations: Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia; School of Information Science and Engineering, Shandong Normal University, Jinan, China; School of Electronic and Information Engineering, Tongji University, Shanghai, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology, Shenzhen, China

Abstract:
Unsupervised cross-modal hashing presents significant advantages in heterogeneous modality retrieval, offering label scalability, high retrieval efficiency, and low storage costs. However, the lack of explicit semantic supervision in this process results in a noticeable semantic deficit, impacting retrieval performance. In this paper, we address this challenge with a dual-pronged approach: Cross-Domain Transfer Hashing (CDTH), a lightweight weakly-supervised cross-modal hashing model. Our method leverages a semantically rich auxiliary domain to augment the target unsupervised cross-modal hash learning process. Simultaneously, we design a lightweight target cross-modal hashing network to reduce semantic requirements, lessening the burden of parameter optimization. Within the auxiliary domain, we perform direct semantic transfer with hashing network parameter transfer and indirect correlation semantic transfer by constructing an auxiliary semantic correlation graph with the identified cross-domain semantic consistent samples. In the target domain, we generate pseudo-labels using CLIP and establish a target weak semantic correlation graph. These two graphs collaborate to bolster the target cross-modal hashing training process. Extensive experiments on three publicly available datasets affirm the superiority of our approach in both retrieval accuracy and training efficiency. The source code for our method is accessible at: https://github.com/WangBowen7/CDTH.

Abstract:
Existing feature matching methods tend to extract feature descriptors by relying on the visual appearance, leading to false matches which are obviously false from the geometric perspective. This paper proposes ContextMatcher, which goes beyond the visual appearance representation by introducing the geometric context to guild the feature matching. Specifically, our ContextMatcher includes visual descriptors generation, the neighborhood consensus module, and the geometric context encoder. To learn visual descriptors, Transformers situated in different branches are leveraged to obtain feature descriptors. In one branch, convolutions are integrated into self-attention layers elegantly to compensate for the lack of the local structure information. In another branch, a cross-scale Transformer is proposed through injecting heterogeneous receptive field sizes into tokens. To leverage and aggregate the geometric contextual information, a neighborhood consensus mechanism is proposed by re-ranking initial pixel-level matches to make a constraint of geometric consensus on neighborhood feature descriptors. Moreover, local feature descriptors are boosted through combining with the geometric properties of keypoints for refining matches to the sub-pixel level. Extensive experiments on relative pose estimations and image matching show that our proposed method outperforms existing state-of-the-art methods by a large margin.

Abstract:
Visual object tracking in natural scenes is a popular but challenging task, owing to the difficulties of feature representation from various changes of the targets, such as size change, deformation, illumination change, rotations, motion blur, background clutter, etc. High-speed hyperspectral imaging systems capture hyperspectral videos (HSVs) in wide spectral ranges and provide abundant spectral and spatial information to tell targets apart from backgrounds, alleviating the model drift in appearance-based tracking methods. However, different hyperspectral imagers, such as near-infrared (NIR), red-to-near-infrared (RedNIR), and visible (VIS), obtain heterogeneous types of data that could not be handled by common object trackers. In this paper, a domain adaptive Transformer framework is proposed for hyperspectral object tracking. Considering the HSVs are from different types of sensors, their heterogeneous features are learned in an adversarial way by domain label reverse learning with a gradient reversed layer. To fully utilize the spectral information in HSV frames, a band-wise spatial attention module (BSAM) is designed to emphasize the salient area near the target of interest. We adopt a Siamese-like Transformer tracker as the main structure for tracking. Our tracker outperforms top-ranking methods on a hyperspectral object tracking benchmark dataset containing three types, 87 hyperspectral videos in total. The comparison experiments validate the effectiveness of the proposed method. The source code and trained models of this work will be publicly available soon at https://github.com/LianYi233/Trans-DAT.

Abstract:
Existing pyramid-based upsamplers (e.g. SemanticFPN), although efficient, usually produce less accurate results compared to dilation-based models when using the same backbone. This is partially caused by the contaminated high-level features since they are fused and fine-tuned with noisy low-level features on limited data. To address this issue, we propose to use powerful pre-trained \boldsymbol h igh-level \boldsymbol f eatures as \boldsymbol g uidance (HFG) so that the upsampler can produce robust results. Specifically, only the high-level features from the backbone are used to train the class tokens, which are then reused by the upsampler for classification, guiding the upsampler features to more discriminative backbone features. One crucial design of the HFG is to protect the high-level features from being contaminated by using proper stop-gradient operations so that the backbone does not update according to the noisy gradient from the upsampler. To push the upper limit of HFG, we introduce a \boldsymbol c ontext \boldsymbol a ugmentation \boldsymbol e ncoder (CAE) that can efficiently and effectively operate on the low-resolution high-level feature, resulting in improved representation and thus better guidance. We name our complete solution as the High-Level Feature Guided Decoder (HFGD). We evaluate the proposed HFGD on three benchmarks: Pascal Context, COCOStuff164k, and Cityscapes. HFGD achieves state-of-the-art results among methods that do not use extra training data, demonstrating its effectiveness and generalization ability.

Abstract:
Point cloud registration is a critical issue in 3D reconstruction and computer vision, particularly challenging in cases of low overlap and different datasets, where algorithm generalization and robustness are pressing challenges. In this paper, we propose a point cloud registration algorithm called Neighborhood Multi-compound Transformer (NMCT). To capture local information, we introduce Neighborhood Position Encoding for the first time. By employing a nearest neighbor approach to select spatial points, this encoding enhances the algorithm’s ability to extract relevant local feature information and local coordinate information from dispersed points within the point cloud. Furthermore, NMCT utilizes the Multi-compound Transformer as the interaction module for point cloud information. In this module, the Spatial Transformer phase engages in local-global fusion learning based on Neighborhood Position Encoding, facilitating the extraction of internal features within the point cloud. The Temporal Transformer phase, based on Neighborhood Position Encoding, performs local position-local feature interaction, achieving local and global interaction between two point cloud. The combination of these two phases enables NMCT to better address the complexity and diversity of point cloud data. The algorithm is extensively tested on different datasets (3DMatch, ModelNet, KITTI, MVP-RG), demonstrating outstanding generalization and robustness.

Abstract:
Few-shot action recognition aims to recognize novel action classes with limited labeled samples and has recently received increasing attention. The core objective of few-shot action recognition is to enhance the discriminability of feature representations. In this paper, we propose a novel multi-view representation learning network (MRLN) to model intra-video and inter-video relations for few-shot action recognition. Specifically, we first propose a spatial-aware aggregation refinement module (SARM), which mainly consists of a spatial-aware aggregation sub-module and a spatial-aware refinement sub-module to explore the spatial context of samples at the frame level. Then, we design a temporal-channel enhancement module (TCEM), which can capture the temporal-aware and channel-aware features of samples with the elaborately designed temporal-aware enhancement sub-module and channel-aware enhancement sub-module. Third, we introduce a cross-video relation module (CVRM), which can explore the relations across videos by utilizing the self-attention mechanism. Moreover, we design a prototype-centered mean absolute error loss to improve the feature learning capability of the proposed MRLN. Extensive experiments on four prevalent few-shot action recognition benchmarks show that the proposed MRLN can significantly outperform a variety of state-of-the-art few-shot action recognition methods. Especially, on the 5-way 1-shot setting, our MRLN respectively achieves 75.7%, 86.9%, 65.5% and 45.9% on the Kinetics, UCF101, HMDB51 and SSv2 datasets.

Abstract:
The abundant spectral signatures and spatial contexts are effectively utilized as the key to hyperspectral image (HSI) classification. Existing convolutional neural networks (CNNs), only focus on locally spatial context information and lack the ability to learn global spectral sequence representations, whereas the transformer performs well in learning the global dependence of sequential data. To solve this issue, inspired by the transformer, we propose an interactive global spectral and local spatial feature fusion transformer called ISSFormer. Specifically, we achieve an elegant integration of self-attention and convolution in a parallel design, i.e., the multi-head self-attention mechanism (MHSA) and the local spatial perception mechanism (LSP). ISSFormer can learn both local spatial feature representation and global spectral feature representation simultaneously. More significantly, we propose a bi-directional interaction mechanism (BIM) of features across the parallel branch to provide complementary clues. The local spatial features and the global spectral features interact through the BIM which could emphasize the local spatial details and add spatial constraints to overcome spectral variability, and can further improve classification performance. With extensive experiments on three benchmark datasets, including Indian Pines, Pavia University, and WHU-Hi-HanChuan, ISSFormer can accomplish superior classification accuracy and visualization performance.

Abstract:
Windowed six degrees of Freedom (Windowed-6DoF) virtual reality (VR) content that provides users an immersive feeling of walking through a 3D 360 VR space with constrained rotational movements around X and Y axes and constrained translational movements along Z axis is important for the development of VR. To facilitate this windowed-6DoF immersive feeling, light fields (LFs) from multiple perspectives within the windowed 6-DoF space are captured. In contrast to employing a large-scale camera array for LF capture, utilizing hand-held plenoptic cameras offers a more portable and versatile solution, thereby promoting practical applications. However, how to stitch the LFs at different rotational angles containing motion parallax is challenging. In this paper, a novel LF stitching method is proposed to generate windowed-6DoF LFs. First, multi-concentric spherical modeling is proposed to parameterize the recorded LFs to eliminate projection biases in the registration process. Then, a global-local adaptive LF registration is proposed by developing incremental multi-layer global-local adaptive homographies based on the 4D light field feature (LiFF), incremental strategy and depth layer maps (DLMs) to eliminate parallax errors. Testing on the LFs captured in both indoor and outdoor scenes with different focal lengths, quantities of LFs and scales of translation and rotation, the proposed method outperforms the existing approaches in terms of subjective quality, objective quality, light field consistency and content production robustness, which can produce VR content of superior quality more reliably.

Abstract:
Video object segmentation (VOS) plays an important role in video analysis and understanding, which in turn facilitates a number of diverse applications, including video editing, video rendering, and augmented reality / virtual reality. However, existing deep learning-based approaches rely heavily on a large number of pixel-wise annotated video frames to achieve promising results, which is notoriously laborious and costly. To address this, in this paper, we formulate unsupervised video object detection by exploring simulated dense labels and explicit motion clues. Specifically, we first propose an effective video label generator network based on the sparsely annotated frames and the flow motion between them. It can largely alleviate our dependence and limitation on the sparse labels. Furthermore, we propose a transformer-based architecture to model the appearance and motion clues simultaneously with the cross-attention module, in order to maximally overcome non-linear motion with potential occlusions. Extensive experiments show that the proposed method outperforms recent VOS methods on four popular benchmarks (i.e., DAVIS-16, FBMS, Youtube-VOS and SegTrack-v2). Moreover, the proposed method can be further applied to a wide range of wild scenes such as wild forests and animals. Because of its effectiveness and generalization, we believe that our method could serve as a useful basis for alleviating the dependence on dense annotation in video data.

Abstract:
Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.

Abstract:
Event cameras, offering high temporal resolutions and high dynamic ranges, have brought a new perspective to address common challenges in monocular depth estimation (e.g., motion blur and low light). However, existing CNN-based methods insufficiently exploit global spatial information from asynchronous events, while RNN-based methods show a limited capacity for effective temporal cues utilization for event-based monocular depth estimation. To this end, we propose a event-based monocular depth estimator with recurrent transformers, namely EReFormer. Technically, we first design a transformer-based encoder-decoder that utilizes multi-scale features to model global spatial information from events. Then, we propose a Gate Recurrent Vision Transformer (GRViT), introducing a recursive mechanism into transformers, to leverage rich temporal cues from events. Finally, we present a Cross Attention-guided Skip Connection (CASC), performing cross attention to fuse multi-scale features, to improve global spatial modeling capabilities. The experimental results show that our EReFormer outperforms state-of-the-art methods by a margin on both synthetic and real-world datasets. Our open-source code is available at https://github.com/liuxu0303/EReFormer.

Abstract:
Popular convolutional neural networks mainly use paired images in a supervised way for image watermark removal. However, watermarked images do not have reference images in the real world, which results in poor robustness of image watermark removal techniques. In this paper, we propose a self-supervised convolutional neural network (CNN) in image watermark removal (SWCNN). SWCNN uses a self-supervised way to construct reference watermarked images rather than given paired training samples, according to watermark distribution. A heterogeneous U-Net architecture is used to extract more complementary structural information via simple components for image watermark removal. Taking into account texture information, a mixed loss is exploited to improve visual effects of image watermark removal. Besides, a watermark dataset is conducted. Experimental results show that the proposed SWCNN is superior to popular CNNs in image watermark removal. Codes can be obtained at https://github.com/hellloxiaotian/SWCNN.

Abstract:
High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel dimension, while Dite-HRNet employs dynamic convolution and pooling to capture long-range interactions with low computational complexity in the spatial dimension. The core idea behind both approaches is to efficiently capture information in either the channel or spatial dimension. However, shuffle operations and dynamic operations are not hardware-friendly. As a result, both Lite-HRNet and Dite-HRNet cannot achieve the desired inference speed on specialized devices, including Neural Processing Units (NPUs) and Graphics Processing Units (GPUs). To overcome these limitations, we present a simple Hardware-Friendly Lightweight High-resolution Network (HF-HRNet) based on our proposed Hardware-Friendly Uniform-sized Mug (HUM) block. HUM block mainly consists of the Cascaded Depthwise (CAD) block and Multi-Scale Context Embedding (MCE) block. The CAD block cascades depthwise convolutions to obtain a larger receptive field in the spatial dimension, while the MCE block aggregates multi-scale spatial feature information from different scales and adjusts channel features. Extensive experiments are conducted on human pose estimation (COCO, MPII) and semantic segmentation (Cityscapes), resulting in a better trade-off between inference speed and accuracy on both NPUs and GPUs. It is noteworthy that on the COCO test-dev set, HF-HRNet-30 outperforms Dite-HRNet-30 and Lite-HRNet-30 by 1.9 AP and 2.8 AP, respectively, while running about 13 times faster and 9 times faster on NPUs, respectively. Our code are publicly available for use: https://github.com/zhanghao5201/HF-HRNet.

Abstract:
Superpixel segmentation divides an original image into mid-level regions to reduce the number of computational primitives for subsequent tasks. The two-stage approaches work better but have high computational complexity among the existing deep superpixel algorithms. In contrast, the FCN style approaches cannot extract specific image features for the superpixel task. To combine the advantages of both types of methods, we propose a carefully designed framework termed Efficient Superpixel Network (ESNet) to explicitly enhance the capability of the network to describe clustering-friendly features and simultaneously preserve the simple network structure. Concretely, two points are concerned with ESNet. First, meaningful features need to be constructed for effective superpixel clustering; hence we propose the Pyramid-gradient Superpixel Generator(PSG) to decouple the ESNet into two joint parts, i.e., the feature extractor and the superpixel generator. Second, the superpixel generator is designed in an efficient manner, which performs multi-scale sampling of input images, and can work independently by replacing the introduced feature extractor with two initial convolutional layers. Extensive experiments show that our framework achieves state-of-the-art performances on multi-datasets and is 5.3× smaller on inference than the best existing one-stage FCN-based methods.

Abstract:
In this paper, we investigate the dynamics-aware adversarial attack problem of adaptive neural networks. Most existing adversarial attack algorithms are designed under a basic assumption – the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed adaptive neural networks, which adaptively deactivate unnecessary execution units based on inputs to improve computational efficiency. It results in a serious issue of lagged gradient, making the learned attack at the current step ineffective due to the architecture change afterward. To address this issue, we propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. More specifically, we reformulate the gradients to be aware of the potential dynamic changes of network architectures, so that the learned attack better leads the next step than the dynamics-unaware methods when network architecture changes dynamically. Extensive experiments on representative types of adaptive neural networks for both 2D images and 3D point clouds show that our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods. Code is available at https://github.com/antao97/LGM.

Abstract:
Existing caricature-visual face recognition methods train the models based on caricature-visual image pairs from the same identities. Unfortunately, in many real-world applications, facial caricatures and visual facial images are usually unpaired in the training set due to the difficulty of collecting facial caricatures drawn by artists. In this paper, we study caricature-visual face recognition under the practical setting that only unpaired facial caricature and visual facial images are available as training samples, and define this setting as unpaired caricature-visual face recognition. To this end, we develop a novel feature decomposition-restoration-decomposition method (FDRD), which mainly consists of a backbone network, an identity-oriented feature decomposition module, and a modality-oriented feature restoration module, to extract modality-irrelevant identity features. To effectively train FDRD in the case of limited facial caricature training samples, we develop a two-stage learning framework. In the first stage, we perform single-modality restoration, enabling the model to have the basic ability of feature decomposition and restoration for each modality. In the second stage, we perform cross-modality recognition by exchanging new modality features between the two modalities, facilitating the model to focus on the decoupling of identity features and modality features. Experimental results demonstrate that our method performs favorably against several state-of-the-art face recognition methods and cross-modality methods. Our code is available at https://github.com/Capricorn-Karma/FDRD.

Abstract:
Shadow removal in a single image has received increasing attention in recent years. However, removing shadows over dynamic scenes remains largely under-explored. In this paper, we propose the first data-driven video shadow removal model, termed PSTNet, by exploiting three essential characteristics of video shadows, i.e., physical property, spatio relation, and temporal coherence. Specifically, a dedicated physical branch was established to conduct local illumination estimation, which is more applicable for scenes with complex lighting and textures, and then enhance the physical features via a mask-guided attention strategy. Then, we develop a progressive aggregation module to enhance the spatio and temporal characteristics of features maps, and effectively integrate the three kinds of features. Furthermore, to tackle the lack of datasets of paired shadow videos, we synthesize a dataset (SVSRD-85) with aid of the popular game GTAV by controlling the switch of the shadow renderer. Experiments against 9 state-of-the-art models, including image shadow removers and image/video restoration methods, show that our method improves the best SOTA in terms of RMSE error for the shadow area by 14.7%. In addition, we develop a lightweight model adaptation strategy to make our synthetic-driven model effective in real world scenes. The visual comparison on the public SBU-TimeLapse dataset verifies the generalization ability of our model in real scenes.

Abstract:
The sparse collaborative tracking (SCT) method has been developed for object tracking recently, and it is very efficient and robust to various occlusions. In SCT, sparse representation (SR) plays an essential role because it needs to perform several manipulations of sparse matrix representation (SMR) or nonnegative SMR in each iteration. So one of the most challenging problems in SCT is how to efficiently solve the SMR. However, existing SR algorithms are solely developed for the vectors-based SR. They partition SMR into a set of vector-based SR problems and solve them in the level-2 BLAS (Basic Linear Algebra Subprograms) manner, i.e., matrix-vector operations, which is computationally much less efficient than the direct level-3 BLAS (direct matrix-matrix operations). To solve this problem, by extending the standard SR algorithm from the vector version to the matrix for SMR and nonnegative SMR, BLAS3-based Sparse Learning (BLAS3-SpaL) is first developed, and then the corresponding BLAS3-SpaL-based SCT method (FastSCT-BLAS3SpaL) is further developed for fast robust-object-tracking in this paper. The experiments verified that it achieves robust object tracking by reducing accumulation errors and speeds up tracking with more than double speed.

Abstract:
This article studies change detection within pairs of optical images remotely sensed from overhead views. We consider that a high-performance solution to this task entails highly effective multi-level feature interaction. With that in mind, we propose a novel approach characterized by two attentive feature aggregation schemes that handle cross-level features in different processes. For the Siamese-based feature extraction of the bi-temporal image pair, we attach emphasis on constructing semantically strong and contextually rich pyramidal feature representations to enable comprehensive matching and differencing. To this end, we leverage a feature pyramid network and re-formulate its cross-level feature merging procedure as top-down modulation with multiplicative channel attention and additive gated attention. For the multi-level difference feature fusion, we progressively fuse the derived difference feature pyramid in an attend-then-filter manner. This makes the high-level fused features and the adjacent lower-level difference features constrain each other, and thus allows steady feature fusion for specifying change regions. In addition, we build an upsampling head as a replacement for the normal heads followed by static upsampling. Our implementation contains a stack of upsampling modules that allocate features for each pixel. Each has a learnable branch that produces attentive residuals for refining the statically upsampled results. We conduct extensive experiments on four public datasets and results show that our approach achieves state-of-the-art performance. Code is available at https://github.com/xingronaldo/CLAFA.

Abstract:
Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and-excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at https://github.com/VUT-HFUT/Micro-Action.

Abstract:
Thriving ocean applications bring explosive growth of underwater images (UWIs), which urgently demand to be compressed efficiently for transmission in the narrow underwater acoustic channel. However, existing image compression networks achieve suboptimal performance on UWIs. More efficient UWI compression can be achieved by utilizing characteristics of UWIs: (1) Within an UWI, the details distribution is associated with the underwater imaging transmission map (T-map); (2) Different UWIs have higher correlation than terrestrial images because they often present gauzy-covered indistinct appearance and share some universal ocean objects that widely appear in different underwater scenes. This paper fully exploits the two characteristics in terms of quantization and entropy coding, two key components of image compression network. Specifically, we propose an efficient underwater image compression network (EUICN) including underwater T-map-based quantization (UTMQ) and mixture entropy coding (MEC). In which, UTMQ extracts the imaging features from T-map, which are integrated with latent features by a novel dual-spatial attention module (DSAM) to generate a feature reserved mask, adaptively reserving reasonable numbers of features for different regions. For more efficient entropy coding, MEC is designed, which includes three correlation information extraction modules ( i.e. , hyperprior, local and novel universal information) and a probability prediction module. Especially, the universal information extraction module utilizes a comprehensive underwater feature dictionary, which covers various universal ocean objects, to match with latent features to select the universal correlation features. After that, the probability prediction module is designed to consolidate the hyperprior, local, and universal information to predict more accurate probability of latent features. Extensive experiments show that our EUICN achieves better performance than SOTA learned and conventional codecs in terms of PSNR and MS-SSIM.

Abstract:
Motion compression technologies can significantly reduce the redundant information of motion data and increase the efficiency of storage and transmission. Current methods mainly utilize some ready-made universal algorithms, such as signal processing and dimensionality reduction, to model the statistical characteristics of motion data, while the individual structure of motion data is ignored. In this paper, we propose to use a deep neural network with specially designed architecture to represent motion data considering the similarity between the articulated structure of a human skeleton and the architecture of neural networks. The network parameters are then taken as the compressed data. We design a structurally connected network which just looks like a human skeleton. Within the network, only the neurons corresponding to the joints connected to each other in a human skeleton are connected. It effectively exploits the correlations between connected joints to cut down the unnecessary connections between the neurons, which leads to the significant improvement of compression efficiency. Additionally, we extract the two inherent DOFs instead of the original three DOFs of each joint by representing its movement on a sphere according to the rigidity of the articulated human skeleton. This actually achieves the theoretically lossless pre-compression with the ratio of 3:2. Extensive experiment results demonstrate the superior performances of the proposed model at the high compression ratios over other state-of-the-art methods.

Affiliations: Chongqing Key Laboratory of Intelligence Perception and Blockchain Technology and Chongqing Engineering Laboratory of Detection Control and Integrated System, School of Computer Science and Information Engineering, Chongqing Technology and Business University, Chongqing, China; College of Computer Science, Chongqing University, Chongqing, China; SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, Palaiseau, France; School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, Jiangsu, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract:
Palm-vein identification is a highly secure pattern biometrics that has become an active research area in recent years. Despite the recent progress in deep neural networks (DNNs) for vein identification, existing solutions for feature representation continue to lack robustness due to the limited training samples. To address this limitation, data augmentation approaches, including Generative Adversarial Networks (GANs), have been investigated, but these schemes suffer from the following issues. First, it is practically unfeasible to use all the generated samples for classifier training due to the limited storage space and computation resources. Further, some of these generated samples may be non-representative or ineffective, seriously compromising models’ generalization capabilities. Second, the augmented dataset is fed to the target classifier repeatedly, resulting in overfitting after substantial training epochs. To tackle the above problems, we propose Advein \mathbb AU , an Adversarial vein AUtomatic AUgmentation approach that generates challenging samples to train a more robust vein classifier for palm-vein identification by alternatively optimizing the vein classifier and a set of latent variables. First, we consider a conditional deep convolution generative adversarial net (cDCGAN) to learn the distribution of real data and the generated data, and then a latent variable from the latent variable space is mapped to the sample space. Second, we combine the trained generator with the vein classifier to constitute Advein \mathbb AU , where the input sets of the generator and the classifier are alternatively updated by adversarial training. Specifically, a latent variable set is learned to increase the training loss of a target network through generating adversarial samples, while the classifier learns more robust features from harder examples to improve the generalization. To avoid collapsing inherent meanings of images, an exponential moving average (EMA) teacher and cosine similarity are employed for regularization to reduce the search space. Unlike previous works where GANs synthesize new realistic images, our model aims to search for a latent variable set, based on which the generator can produce challenging samples along with the training process to improve the classifier’s performance. Finally, we conduct extensive experiments on three public palm-vein datasets to evaluate the performance of Advein \mathbb AU , and the experimental results demonstrate that the proposed Advein \mathbb AU is capable of generating harder samples to improve the performance of the vein classifier.

Abstract:
Existing video text detection methods mostly track texts with appearance feature only, thus are easily influenced by the change of perspective and illumination. In this paper, we propose an end-to-end video text detector that tracks texts based on robust feature representation fusing multiple descriptors. First, we introduce a character center segmentation branch to extract semantic feature, which encodes the category and position information of characters. And for extracting the topology feature of each text instance, we propose a relative position awareness branch to encode the relative position information among texts. Then, an adaptive feature fusion network is proposed to dynamically fuse multiple descriptors to generate a robust feature representation for more robust tracking. In addition, to promote the research and evaluation in this field, we also construct a large Bilingual Road scene Video Text dataset, named BiRViT-1K, which contains 1000 videos of Chinese and English texts. Experimental results show the proposed semantic and topology features are beneficial to the text detection and tracking performance, and the proposed method achieves state-of-the-art performance on four public video text benchmarks ICDAR 2015 Video, YVT, RT-1K and BOVText, and two Chinese scene text benchmarks CASIA10K and MSRA-TD500.

Abstract:
Accident detection in surveillance or dashcam videos is a common task in the field of traffic accident analysis by using videos. However, as accidents occur sparsely and randomly in the real world, the data records are more scarce than the training data for standard detection tasks such as object detection or instance detection. Moreover, the limited and diverse accident data makes it more difficult to model the accident pattern for fine-grained accident detection tasks analyzing the accident in detail. Extra prior information should be introduced in the tasks such as the common vision feature which could offer relatively effective information for many vision tasks. The big model could generate the common vision feature by training on abundant data and consuming a lot of computing time and resources. Even though the accident video data is special, the big model could also extract common vision features. Thus, in this paper, we propose to apply knowledge distillation to fine-grained accident detection which analyzes the spatial temporal existence and severity for solving the issues of complex computing (distillation to the small model) and keeping good performance under limited accident data. Knowledge distillation could offer extra general vision feature information from the pre-trained big model. Common knowledge distillation guides the student network to learn the same representations from the teacher network by logit mimicking or feature imitation. However, single-level distillation could only focus on one aspect of mimicking classification logit or deep features. Multiple tasks with different focuses are required for fine-grained accident detection, such as multiple accident classification, temporal-spatial accident region detection, and accident severity estimation. Thus in this paper, multiple-level distillation is proposed for the different modules to generate the unified video feature concerning all the tasks in fine-grained accident detection analysis. The various experimental results on a fine-grained accident detection dataset which provides more detailed annotations of accidents demonstrate that our method could effectively model the video feature for multiple tasks.

Abstract:
Numerous studies have employed prompt learning structures to enhance dense prediction tasks by integrating additional semantic or geometric information. While the inclusion of extra information has shown improvements in performance, it also poses challenges for applications that cannot provide extra input. To address this issue, this study evaluates the performance of different prompts and introduces an additional-input-free method, called self-prompting perceptual edge learning (SPPEL), which extracts edge-embedded semantic prompts directly from the image feature itself using trainable handcrafted edge operators within a plug-and-play module. To obtain the edge features, our approach incorporates an adversarial structure that compares the similarity between two edge features generated by the Hog and Kirsch operators, where the edge features are measured using multiplication, finetuned through a trainable all-one embedding, and enhanced with channel-to-channel attention. We conduct extensive evaluations of SPPEL on 7 tasks, utilizing 7 different backbones and applying 5 distinct methods. Our experimental results demonstrate that SPPEL achieves strong competitiveness in various settings with an average improvement of 1.7% across all 7 tasks, including ADE20K, COCO (Instance Segmentation), COCO (Object Detection), Pascal VOC2012, STARE, CHASE DB1, and HRF, while incurring a parameter increase of less than 3% (the detailed computation analysis of parameters and Gflops are shown in different experimental tables). Code will be released at: https://github.com/chenhao-zju/sppel

Abstract:
Most modern approaches in temporal action localization (TAL) mainly focus on time domain information, while neglecting the advantages of information from other domains. How to effectively utilize information from different domains and their interactions in a reasonable manner has been an attractive yet challenging issue in TAL. In this paper, we propose a novel cross time-frequency Transformer model (TFFormer) for TAL. A dual-branch network architecture is designed to capture the time and frequency features at multiple scales, using the multi-scale transformer in the time branch and the DB1 Discrete Wavelet Transform (DWT) in the frequency branch. To fuse these features from different domains, we propose a cross time-frequency attention mechanism that includes a time pathway and a frequency pathway, enhancing the interaction between the temporal and frequency features. Furthermore, a gated control mechanism is designed to aggregate features from different scales, characterizing the respective contributions of features at different scales. We also design a new regression loss function for locating the time boundaries. Extensive experiments were carried out on four challenging benchmark datasets, including two third-person datasets and two first-person datasets. The proposed method achieves impressive results on these datasets. Specifically, TFFormer achieves an average mAP of 23.2% on Ego4D and 25.6% on EPIC-Kitchens 100, which outperform previous state-of-the-arts by a large margin. It also obtains competitive results on ActivityNet v1.3 and THUMOS14, with an average mAP of 36.2% and 67.8%. We also conducted extensive ablation studies to validate the effectiveness of each component in the proposed method.

Abstract:
In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, prompt-based learning, which adds the task-relevant prompt to adapt the pre-trained models to downstream tasks, has drastically boosted the performance of many natural language downstream tasks. In this work, we extend this notable transfer ability benefited from prompt into vision models as an alternative to fine-tuning. To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt diverse frozen pre-trained models to a wide variety of downstream vision tasks. The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific vision prompts for downstream input images with the pre-trained model frozen. By only training a small number of additional parameters, Pro-tuning can generate compact and robust downstream models both for CNN-based and transformer-based network architectures. Comprehensive experiments evidence that the proposed Pro-tuning outperforms fine-tuning on a broad range of vision tasks and scenarios, including image classification (under generic objects, class imbalance, image corruption, natural adversarial examples, and out-of-distribution generalization), and dense prediction tasks such as object detection and semantic segmentation.

Abstract:
Cross-domain classification of hyperspectral data is a critical challenge in remote sensing, especially when labels are unavailable in the target domain. Deep learning-based domain adaptation (DA) methods have been widely used in recent years. However, curren methods primarily focus on the global domain structure of the source and target domains when considering domain adaptation, neglecting the subdomain structure within each class. Additionally, current methods directly employ predicted outputs without further exploring the confidence level of the target domain samples. These limitations lead to confusion in domain adaptation and hinder effective feature selection in neural networks. In this paper, we propose the Pseudo-Label-Assisted Subdomain Adaptation (PASDA) method, which addresses these limitations by jointly considering the subdomain structure of the source and target domains and adopting a sample selection strategy. PASDA aligns the subdomains while learning domain-invariant features as a foundation. Furthermore, it selects high-quality pseudo-labeled samples from the target domain to enhance the learning of domain-invariant features. For generating pseudo-labels in the target domain, we employ the Reweighted Pruning Label Propagation (RPLPA) strategy to reweight the output of the predicted target domain. Finally, the high-confidence samples with pseudo-labels are selected to finetune the network. The entropy regularized dual classifier constraint is introduced to enhance the discriminative feature extraction ability for the target domain. Extensive experiments on three public HSI cross-domain datasets, Pavia, Houston, and HyRANK, using overall accuracy (OA), average accuracy (AA) and kappa coefficient (Kappa) as the evaluation indicators of classification performance, demonstrate the superiority of our method. Compared with the existing state-of-the-art (SOTA) unsupervised domain adaptation (UDA) methods, our method improves OA by 2% and AA by 4%.

Abstract:
The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at https://github.com/zchoi/SPT.

Abstract:
Unsupervised domain adaptation aims to transfer the knowledge learned from a labeled source domain to an unlabeled target domain with different data distributions. However, in practice, source samples are not always available due to privacy protection and storage resource limitations. To address this concern, Source-Free Domain Adaptation (SFDA) has recently attracted growing research attention, as it only needs a pre-trained source model without direct access to source data. In this paper, we propose a novel Adversarial SOurce GEneration (ASOGE) method for SFDA, which introduces an additional generative module to produce synthetic labeled source samples and uses them to facilitate cross-domain adaptation. Unlike early studies that train the generator independently and perform the adaptation only after the generator is finished, ASOGE integrates the generation and adaptation stages within a collaborative framework by making them play an adversarial game. In the generation stage, the labeled source samples are not produced blindly; instead, they are hard-to-align samples that provide knowledge more worth learning for the adaptation stage. To achieve a fine-grained domain alignment, a class-aware discrepancy between source and target domains is measured via contrastive learning. Extensive experiments on benchmark datasets demonstrate the effectiveness of ASOGE compared to the state-of-the-art methods.

Abstract:
Siamese network-based trackers have shown remarkable success in aerial tracking. Most previous works, however, usually perform template matching only between the initial template and the search region and thus fail to deal with rapidly changing targets that often appear in aerial tracking. As a remedy, this work presents Building Appearance Collection Tracking (BACTrack). This simple yet effective tracking framework builds a dynamic collection of target templates online and performs efficient multi-template matching to achieve robust tracking. Specifically, BACTrack mainly comprises a Mixed-Temporal Transformer (MTT) and an appearance discriminator. The former is responsible for efficiently building relationships between the search region and multiple target templates in parallel through a mixed-temporal attention mechanism. At the same time, the appearance discriminator employs an online adaptive template-update strategy to ensure that the collected multiple templates remain reliable and diverse, allowing them to closely follow rapid changes in the target’s appearance and suppress background interference during tracking. Extensive experiments show that our BACTrack achieves top performance on four challenging aerial tracking benchmarks while maintaining an impressive speed of over 87 FPS on a single GPU. Speed tests on embedded platforms also validate our potential suitability for deployment on UAV platforms.

Abstract:
Video super-resolution (VSR) is important in video processing for reconstructing high-definition image sequences from corresponding continuous and highly-related video frames. However, existing VSR methods have limitations in fusing spatial-temporal information. Some methods only fuse spatial-temporal information on a limited range of total input sequences, while others adopt a recurrent strategy that gradually attenuates the spatial information. While recent advances in VSR utilize Transformer-based methods to improve the quality of the upscaled videos, these methods require significant computational resources to model the long-range dependencies, which dramatically increases the model complexity. To address these issues, we propose a Collaborative Transformer for Video Super-Resolution (CTVSR). The proposed method integrates the strengths of Transformer-based and recurrent-based models by concurrently assimilating the spatial information derived from multi-scale receptive fields and the temporal information acquired from temporal trajectories. In particular, we propose a Spatial Enhanced Network (SEN) with two key components: Token Dropout Attention (TDA) and Deformable Multi-head Cross Attention (DMCA). TDA focuses on the key regions to extract more informative features, and DMCA employs deformable cross attention to gather information from adjacent frames. Moreover, we introduce a Temporal-trajectory Enhanced Network (TEN) that computes the similarity of a given token with temporally-related tokens in the temporal trajectory, which is different from previous methods that evaluate all tokens within the temporal dimension. With comprehensive quantitative and qualitative experiments on four widely-used VSR benchmarks, the proposed CTVSR achieves competitive performance with relatively low computational consumption and high forward speed.

Abstract:
Recently, deep learning-based methods have been successfully applied to the field of exposure correction. However, most of the existing methods treat different locations of an image in the same way, ignoring the inhomogeneous recovery difficulty and spatially-varying visual patterns in the image, which is sub-optimal and not perfectly efficient. In this paper, we propose a difficulty-aware dynamic network (DDNet) for lightweight exposure correction. Specifically, we propose a difficulty-aware strategy that determines the difficulty of feature patches according to a difficulty mask. Then, only the difficult patches are further refined instead of the whole features, which greatly reduces the overall computational complexity. Moreover, in order to achieve spatially-varying processing with a minimal computational burden, we design a spatial-aware dynamic convolution (SDConv), which is generated by predicting a set of basic kernels and a spatial-aware weight map. Benefiting from these designs, our method can strike a good trade-off between performance and complexity. Extensive experiments on several datasets demonstrate that our approach outperforms the state-of-the-art methods both qualitatively and quantitatively while requiring cheaper computational costs.

Abstract:
Cross-view geo-localization is an extremely challenging task due to drastic discrepancies in scene context and object scale between different views. Existing works normally concentrate on aligning the global appearance between two views but underestimate these two discrepancies. In practice, only a small region in the retrieved aerial image can be matched to the whole query ground image (i.e. scene context change). On the other hand, the retrieved aerial images are only able to describe the coarse-grained information but the query ground images can capture the fine-grained details (i.e. object scale change). In this paper, we propose a novel self-distillation framework called Patch Similarity Self-Knowledge Distillation (PaSS-KD), which provides the local and multi-scale knowledge as fine-grained location-related supervision to guide cross-view image feature extraction and representation in a self-enhanced manner. Specifically, we develop an auxiliary image-to-patch retrieval task to explore the scene context change and devise a multi-scale patch partition strategy to sense the object scale change across views. Additionally, our self-distilling framework can be removed to avoid additional computation cost at the inference stage. Extensive experiments show that our method not only achieves the state-of-the-art image retrieval performance on the CVUSA and CVACT benchmarks, but also significantly boosts the fine-grained localization accuracy on the VIGOR dataset. Remarkably, for 10 meter-level localization, we improve the relative accuracy by a factor of 0.8× and 1.6× on the VIGOR dataset under same-area and cross-area evaluation, respectively.

Abstract:
The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has × 5 to × 27 times less encoding time for object detection. It is noteworthy that our model can attain near-lossless task performance with only 0.002-0.003% of the uncompressed feature data size.

Abstract:
Generalized zero-shot learning(GZSL) aims to recognize images from seen and unseen classes with side information, such as manually annotated attribute vectors. Traditional methods focus on mapping images and semantics into a common latent space, thus achieving the visual-semantics alignment. Since the unseen classes are unavailable during training, there is a serious problem of recognition bias, which will tend to recognize unseen classes as seen classes. To solve this problem, we propose a Domain-aware Prototype Network(DPN), which splits the GZSL problem into the seen class recognition and unseen class recognition problem. For the seen classes, we design a domain-aware prototype learning branch with a dual attention feature encoder to capture the essential visual information, which aims to recognize the seen classes and discriminate the novel categories. To further recognize the fine-grained unseen classes, a visual-semantic embedding branch is designed, which aims to align the visual and semantic information for unseen-class recognition. Through the multi-task learning of the prototype learning branch and visual-semantic embedding branch, our model can achieve excellent performance on three popular GZSL datasets.

Abstract:
When humans explain their reasoning, such as their classification decisions, they often break down an image into parts and highlight the evidence from those parts to support the concepts they have in mind. Drawing inspiration from this cognitive process, several self-explaining models have been proposed to explain predictions by part-level concepts. However, these models can be limited by their structure and difficulty in determining the effect of individual parts on the output category. To address these challenges, we introduce a self-explaining architecture that uses a plug-in transparent embedding space (TesNet) to connect high-level input patches (e.g. feature maps or tokens) with output categories. The transparent embedding space is spanned by basis concepts and constructed on the Grassmann manifold. The basis concepts are enforced to be category-aware, and within-category concepts are orthogonal to each other, ensuring the embedding space is disentangled. To reduce concept redundancy and restore the concept space structure, we introduce two concept pruning methods and a new re-training strategy to build a slimming transparent embedding space. We verify the scalability of TesNet through experiments on deep networks such as VGG, ResNet, DenseNet, and Vision Transformer. Additionally, we design several metrics for self-explaining models to quantify interpretability and compare them with state-of-the-art self-explaining methods. Our experiments demonstrate that TesNet is much more effective for classification tasks, providing better interpretability on predictions and improving final accuracy.

Abstract:
Video-based 3D human pose estimation has achieved great progress, however, it is still difficult to learn precise 2D-3D projection under some hard cases. Multi-level human knowledge and motion information serve as two key elements in the field to conquer the challenges caused by various factors, where the former encodes various human structure information spatially and the latter captures the motion change temporally. Inspired by this, we propose a DualFormer (dual-path transformer) network which encodes multiple human contexts and motion detail to perform the spatial-temporal modeling. Firstly, motion information which depicts the movement change of human body is embedded to provide explicit motion prior for the transformer module. Secondly, a dual-path transformer framework is proposed to model long-range dependencies of both joint sequence and limb sequence. Parallel context embedding is performed initially and a cross transformer block is then appended to promote the interaction of the dual paths which improves the feature robustness greatly. Specifically, predictions of multiple levels can be acquired simultaneously. Lastly, we employ the weighted distillation technique to accelerate the convergence of the dual-path framework. We conduct extensive experiments on three different benchmarks, i.e., Human 3.6M, MPI-INF-3DHP and HumanEva-I. We mainly compute the MPJPE, P-MPJPE, PCK and AUC to evaluate the effectiveness of proposed approach and our work achieves competitive results compared with state-of-the-art approaches. Specifically, the MPJPE is reduced to 42.8mm which is 1.5mm lower than PoseFormer on Human3.6M, which proves the efficacy of the proposed approach.

Abstract:
Camouflaged object detection (COD) is an important yet challenging task, with great application values in industrial defect detection, medical care, etc. The challenges mainly come from the high intrinsic similarities between target objects and background. In this paper, inspired by the biological studies that object detection consists of two steps, i.e., search and identification, we propose a novel framework, named DCNet, for accurate COD. DCNet explores candidate objects and extra object-related edges through two constraints (object area and boundary) and detects camouflaged objects in a coarse-to-fine manner. Specifically, we first exploit an area-boundary decoder (ABD) to obtain initial region cues and boundary cues simultaneously by fusing multi-level features of the backbone. Then, an area search module (ASM) is embedded into each level of the backbone to adaptively search coarse regions of objects with the assistance of region cues from the ABD. After the ASM, an area refinement module (ARM) is utilized to identify fine regions of objects by fusing adjacent-level features with the guidance of boundary cues. Through the deep supervision strategy, DCNet can finally localize the camouflaged objects precisely. Extensive experiments on three benchmark COD datasets demonstrate that our DCNet is superior to 12 state-of-the-art COD methods. In addition, DCNet shows promising results on two COD-related tasks, i.e., industrial defect detection and polyp segmentation.

Abstract:
Non-local self-similarity has been well exploited in the single image super-resolution task as an effective prior. However, due to the difficulty of modeling the 4D correspondence globally, the potential of the non-local prior is less revealed for light field (LF) super-resolution. Meanwhile, existing non-local models only utilize the global spatial correspondence, but largely neglect the global geometric correspondence. To address the aforementioned problems, we propose a Decoupled Selective Matching Network (DSMNet) for LF super-resolution, by designing a novel selective matching mechanism to flexibly extract non-local information from specific 4D positions in an LF. Such a mechanism matches the reference patch with several auxiliary patches dynamically searched from predefined windows, which promotes efficiency while improving performance compared to the existing non-local models. Specifically, our DSMNet decouples the whole LF into Sub-Aperture Images (SAIs) and Epipolar Plane Images (EPIs). For each SAI patch, we separately perform the selective matching inside the current SAI and cross different SAIs to exploit the global spatial correspondence efficiently. For each EPI patch, we separately perform the selective matching in EPIs of different orientations to embed robust LF geometric information into features by enhancing EPI textures, which exploits the global geometric correspondence in an efficient manner. Comprehensive experiments validate that DSMNet outperforms state-of-the-art LF super-resolution methods both quantitatively and qualitatively. Code is available at https://github.com/Yutong2022/DSMNet.

Abstract:
Underwater images often suffer from serious color bias and blurred features because of the effect of the water bodies on the light. To enhance underwater images, we present SU-DDPM, a method of real-time underwater image enhancement (UIE) based on a denoising diffusion probabilistic model (DDPM). SU-DDPM outperforms other baseline and generative adversarial network models in underwater image enhancement, thus establishing a new state-of-the-art baseline. SU-DDPM processes images more rapidly than the diffusion model, which makes it competitive with other deep learning-based methods. We demonstrate that if conditional DDPM is used directly for the UIE task, the processing speed is slow, and the enhanced images are of poor quality and show color bias. The quality of the enhanced image is improved by combining the degraded image with the reference image in the diffusion stage to create a fusion–DDPM model. The specificity of the UIE task allows us to accelerate the inference process by changing the initial sampling distribution and reducing the number of iterations in the denoising stage of the model. We evaluate SU-DDPM on the UIE task using challenging real underwater image datasets and a synthetic image dataset and compare it to state-of-the-art models. SU-DDPM ensures increased enhancement quality, and enhancement processing speed is comparable to the speed of real-time enhancement models.

Abstract:
Blur artifacts can seriously degrade the visual quality of images, and numerous deblurring methods have been proposed for specific scenarios. However, in most real-world images, blur is caused by different factors, e.g., motion, and defocus. In this paper, we address how other deblurring methods perform in the case of multiple types of blur. For in-depth performance evaluation, we construct a new large-scale multi-cause image deblurring dataset (MC-Blur), including real-world and synthesized blurry images with different blur factors. The images in the proposed MC-Blur dataset are collected using other techniques: averaging sharp images captured by a 1000-fps high-speed camera, convolving Ultra-High-Definition (UHD) sharp images with large-size kernels, adding defocus to images, and real-world blurry images captured by various camera models. Based on the MC-Blur dataset, we conduct extensive benchmarking studies to compare SOTA methods in different scenarios, analyze their efficiency, and investigate the buildataset’s capacity. These benchmarking results provide a comprehensive overview of the advantages and limitations of current deblurring methods, revealing our dataset’s advances. The dataset is available to the public at https://github.com/HDCVLab/MC-Blur-Dataset.

Abstract:
High-resolution (HR) image harmonization is of great significance in real-world applications such as image synthesis and image editing. However, due to the high memory costs, existing dense pixel-to-pixel harmonization methods are mainly focusing on processing low-resolution (LR) images. Some recent works resort to combining with color-to-color transformations but are either limited to certain resolutions or heavily depend on hand-crafted image filters. In this work, we explore leveraging the implicit neural representation (INR) and propose a novel image Harmonization method based on Implicit neural Networks (HINet), which to the best of our knowledge, is the first dense pixel-to-pixel method applicable to HR images without any hand-crafted filter design. Inspired by the Retinex theory, we decouple the MLPs into two parts to respectively capture the content and environment of composite images. A Low-Resolution Image Prior (LRIP) network is designed to alleviate the Boundary Inconsistency problem, and we also propose new designs for the training and inference process. Extensive experiments have demonstrated the effectiveness of our method compared with state-of-the-art methods. Furthermore, some interesting and practical applications of the proposed method are explored. Our code is available at https://github.com/WindVChen/INR-Harmonization.

Abstract:
Video deblurring is a challenging task because only input blurry sequences are available. To further constrain the optimization process, existing methods explore various additional information, e.g., events, depth, and sharpness prior. However, they consume large computational costs or generate unpleasant visual results due to the insufficient exploitation of spatio-temporal information. In this work, we propose a novel spatio-temporal sharpness map learned by a prior-based generation network implicitly. The proposed generation network blends both spatial and temporal sharpness priors in a blurry sequence, while few extra parameters are added. We show that the proposed map has better spatial continuity and guidance for video deblurring than the previous methods. Furthermore, different from the simply concatenation in the previous work, we allow the sharpness map to accommodate to more effective video deblurring via a dual-stream network. Specifically, the network is decomposed by two branches, namely the inter-frame and intra-frame reconstructions. The inter-frame reconstruction obtains the sharp patches of consecutive frames from the sharpness map to restore textures well. Meanwhile, the intra-frame branch is responsible for recovering structures of the latent frame, where a novel histogram statistical method is developed to quantify and count textures in the features under the modulation of the sharpness map. Quantitative and qualitative experiments successfully validate the effectiveness of our proposed method.

Abstract:
Video captioning evaluation aims at assessing the semantic consistency between video and candidate text, which should include measurement from two aspects: faithfulness (whether the information conveyed by candidate is correct w.r.t. video) and comprehensiveness (whether the main video content is covered by candidate). However, previous approaches have difficulty in evaluating faithfulness and comprehensiveness due to heavy reliance on references or heterogeneous of visual and textual data. In this paper, we propose a vision-involved evaluation metric based on a novel DuAl-Reconstruction Transformer, named DARTScore. DARTScore formulates the caption evaluation task as a dual-reconstruction problem to evaluate both faithfulness and comprehensiveness explicitly. Since the word in a candidate is usually related to several frames, DARTScore adaptively collects relevant frames to reconstruct the word and computes the reconstruction accuracy as faithfulness to inherently reflect whether the word information is contained in the video. In the inversive way, DARTScore reconstructs each frame with relevant words to evaluate comprehensiveness. By integrating fine-grained bidirectional reconstruction accuracies, DARTScore drills into each word in candidate and each frame in video to fully evaluate the semantic consistency. Furthermore, we collect and annotate two Chinese datasets with a large domain gap, named CRAETE-EVAL and VATEX-ZH-EVAL, to systematically evaluate existing metrics and fill the blank of Chinese video captioning evaluation. Experimental results show that DARTScore achieves higher correlation with human judgments, has lower reference reliance, and generalizes well to data from different domains.

Abstract:
In this paper, we present a simple, flexible and effective vision-language (VL) tracking pipeline, termed MMTrack, which casts VL tracking as a token generation task. Traditional paradigms address VL tracking task indirectly with sophisticated prior designs, making them over-specialize on the features of specific architectures or mechanisms. In contrast, our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target in an auto-regressive manner. The design without other prior modules avoids multiple sub-tasks learning and hand-designed loss functions, significantly reducing the complexity of VL tracking modeling and allowing our tracker to use a simple cross-entropy loss as unified optimization objective for VL tracking task. Extensive experiments on TNL2K, LaSOT, LaSOT _\mathrmext and OTB99-Lang benchmarks show that our approach achieves promising results, compared to other state-of-the-arts.

Abstract:
Decomposing a scene into its 3D geometry, surface material textures, and illumination is a challenging but important problem in computer vision and graphics. While recent neural implicit representation based works have shown tremendous advantages, existing methods are not applicable to images illuminated by a single dynamic point light. We propose an entirely self-supervised end-to-end neural implicit representation based reflectance decomposition algorithm for objects under a dynamic point light. Our method adopts a staged training framework to estimate the geometry, light source position, and surface material textures through volume rendering, self-shadow inverse rendering, and physical model based surface rendering respectively. This scheme allows accurate recovery of the surface material textures which are coupled to the dynamic light, improving the reflectance decomposition capability. For evaluation, we collect a new dataset of several synthetic and real world objects illuminated by a moving point light. Experiments show that our method achieves superior reflectance decomposition performance compared to state-of-the-art methods, and the recovered elements can be deployed in existing graphics pipelines to perform relighting, material editing, and scene composition.

Abstract:
Common mechanisms for achieving object camouflage include reducing differences and increasing distractions. Such camouflage mechanisms hinder the object detectors to accurately distinguish the camouflaged objects from their surroundings. Considering that, we reexamine the camouflaged object detection (COD) task from the perspective of camouflage mechanisms and make the first attempt to discover the target objects in a de-camouflaging manner. We argue that this process can not only lead to a better understanding of camouflage, but also provide a new perspective for detecting camouflaged objects. For that, we first analyze some existing camouflage mechanisms together with their induced problems. Afterwards, considering the inner relationships between SOD and COD, we resort to the SOD task to synergistically achieve de-camouflaging for COD. Specifically, we incorporate the SOD task into the COD model and present a multi-task learning framework for COD, which models the intrinsic relationships between the two tasks from different perspectives, i.e., task-conflicting attribute and task-consistent attribute, to destroy the camouflage conditions for highlighting those inconspicuous yet valuable cues of camouflaged objects. In more detail, modeling the task-conflicting attribute is to well identify camouflaged objects by alleviating such interfering information from salient ones, and is achieved by a Gate Classification (GC) strategy and a Region Distraction Module (RDM). While, modeling the task-consistent attribute, which is achieved by an adversarial learning (AL) scheme and a Boundary Injection Module (BIM), is intended to enhance the boundary differences between the camouflaged objects and their backgrounds for fully segmenting the camouflaged objects. Extensive results demonstrate the superiorities of our proposed model over existing ones in camouflaged object detection.

Abstract:
Despite the remarkable progress made in learning-based stereo-matching algorithms, it is an open challenge for stereo-matching in disparity discontinuities and textureless regions. In this paper, we propose the deep Markov Random Field based cost aggregation network (DMCA-Net) for stereo matching, which is an end-to-end model-driven network architecture. This architecture introduces an efficient feature extraction network to extract richer textual and contextual feature information for stereo feature similarity representation at multi-stages and levels. Furthermore, with the aim of alleviating the edge-fattening phenomenon at disparity discontinuities and generating accurate disparities in textureless regions, we proposed the differentiable Markov Random Field model for cost aggregation, where the model’s data term utilizes image detail information, such as boundary and contour features, to guide matching cost aggregation, and the model’s smoothness term penalizes the adjacency similarity of the cost between the four-nearest neighboring pixel pairs to predict the disparity in textureless regions. The detailed experiment demonstrates that DMCA network achieves competitive performance on the SceneFlow, KITTI 2012, KITTI 2015, and Middlebury 2014 datasets.

Abstract:
Advancements in computer vision and deep learning have led to difficulty in distinguishing Deepfake and real videos. In particular, forgery audios are also generated to accompany fake videos and make them more realistic, which makes Deepfake detection more difficult. Existing Deepfake detection methods that use multimodal information ignore the representation gap between different modalities, resulting in limited performance. To address this problem, in this paper, a novel Deepfake detection method utilizing multimodal contrastive learning (MCL) is proposed to better explore intra-modal and cross-modal forgery clues. To reduce the cross-modal gap and explore multimodal forgery artifacts, a cross-modal contrastive learning strategy is designed to learn a compositional embedding from multimodal information, which facilitates pulling together representations across uni-modalities and multi-modalities. Moreover, to supplement the intra-frame forgery clues mining ability of the video network, the frame knowledge is distilled to the video network without adding additional computation. Specifically, to mine intra-modal clues, three modality features are first extracted from audio, frame and video, respectively. Secondly, the audio and frame features are separately composed with the video feature to derive two cross-modal representations. Subsequently, these cross-modal features are contrastive with the intra-modal features to reduce cross-modal gap. By jointly pulling together the unimodal and multimodal features through MCL, a more effective representation that contains intra-modal and cross-modal forgery artifacts can be learned. Finally, a noise-based feature augmentation (NFA) module is proposed to adaptively perturb the audio-visual feature and further improve generalization performance. Extensive experiments demonstrate that the proposed framework outperforms SOTA methods.

Abstract:
Accurate and efficient keypoint detection and description is a fundamental step in various computer vision tasks. In this paper, we extract robust descriptors and detect accurate keypoints by learning local Features with Domain adaptation (DomainFeat). Specifically, our Domainfeat includes image-level domain invariance supervision, pixel-level domain consistency supervision, Pixel-Adaptive keypoint Detection(PA-Det), and cross-domain dataset with domain stable point supervision. First, we introduce the image-level domain invariance supervision to make the high-level feature distributions from different domains close by fusing domain-invariant representations in the decoder. Furthermore, to compensate for the inconsistency between descriptors corresponding to the keypoints at the pixel level, we propose the pixel-level domain consistency supervision. Then we present the Pixel-Adaptive keypoint Detection to efficiently detect accurate keypoints, which can improve accuracy by enhancing the local consistency of heatmaps. Finally, we propose an efficient approach to construct data and supervision labels in diverse domains, which can tackle complex application scenarios. With these novel modules and supervision methods, our DomainFeat can make feature detectors more accurate and descriptors more robust. Extensive experiments confirm that Domainfeat achieves state-of-the-art performance on benchmarks such as Aachen-Day-Night localization, HPatches image matching, and the challenging DNIM dataset.

Abstract:
Low-light image enhancement (LIE) is important for many high-level vision tasks as the poor visibility of underexposed images can severely degrade the performance of the subsequent image recognition, analysis, etc. Although recent deep-learning-based LIE methods exhibit promising performance, most of them require a large number of paired training images, thereby limiting the practicability to real scenarios. In this paper, we propose a pseudo-supervised LIE method with the integration of mutual learning. Specifically, for the given low-light image, we first use a quadratic curve to generate a pseudo-clear image, which is served as the auxiliary ground truth for supervision, then the pseudo-paired images are simultaneously input to two parallel homogeneous branches to learn the expected enhanced result through the knowledge distillation of two branches via mutual learning. As both the generated image and the input low-light image underlies the desired solution, the mutual learning strategy enables the two branches learn from each other and produce the final results. Extensive experiments demonstrate that the proposed method outperforms most existing unsupervised LIE methods in terms of both qualitative and quantitative evaluations, and also achieves competitive performance against many supervised and semi-supervised methods.

Abstract:
We address personalized image enhancement in this study, where we enhance input images for each user based on the user’s preferred images. Previous methods apply the same preferred style to all input images (i.e., only one style for each user); in contrast to these methods, we aim to achieve content-aware personalization by applying different styles to each image considering the contents. For content-aware personalization, we make two contributions. First, we propose a method named masked style modeling, which can predict a style for an input image considering the contents by using the framework of masked language modeling. Second, to allow this model to consider the contents of images, we propose a novel training scheme where we download images from Flickr and create pseudo input and retouched image pairs using a degrading model. We conduct quantitative evaluations and a user study, and our method trained using our training scheme successfully achieves content-aware personalization; moreover, our method outperforms other previous methods in this field. Our source code is available at https://github.com/satoshi-kosugi/masked-style-modeling.

Abstract:
Collecting a substantial number of labeled samples is infeasible in many real-world scenarios, thereby bringing out challenges for supervised classification. The research on Few-Shot Classification (FSC) aims to address this issue. Current FSC methods mainly leverage ideas such as meta-learning, self-supervised learning, and data augmentation. Among them, data augmentation appears to be an extremely efficient approach to alleviate the aforementioned data-deficiency problem. Here, we propose a novel data augmentation based FSC method termed Fourier-Augmentation based Data-Shunting (FADS). FADS mainly contains two operations, namely Fourier-based data augmentation (FDA) and data shunting. (i) Fourier transform has a desirable property for classification tasks: the image’s phase and amplitude components in the frequency domain correspond to its high-level structure (i.e., semantic) and low-level style (i.e., statistic) information, which do not interfere with each other. Inspired by this observation, we design the FDA operation, which changes the amplitude spectrum of the to-be-augmented images to obtain new images of the same category. (ii) Then we design the data shunting operation to cooperate with the FDA to accomplish FSC. Specifically, it splits the augmented data into different groups to get independent, weak decisions and then fuses them to obtain a unified, strong decision. We conduct experiments on four benchmark datasets. Results show that utilizing our method brings a performance gain of 0.3%-2% in terms of classification accuracy, compared with the classical methods.

Abstract:
Domain Generalization (DG) aims to develop models that can learn from data in source domains and generalize to unseen target domains. Recently, some domain generalization algorithms have emerged, but most of them were designed with complex modules. Among all the prior methods under DG settings, contrastive learning has become a promising solution for simplicity and efficiency. However, existing contrastive learning neglects distribution shifts that causes severe domain confusions. In this paper, we propose an instance paradigm contrastive learning framework, introducing contrast between original features and novel paradigms to alleviate domain-specific distractions. And then we explore hard-pair information, an essential factor in contrastive learning, based on domain label and feature similarity. Moreover, to produce domain-invariant instance paradigms, we generate multiple views of the original images and design a novel channel-wise attention mechanism to dynamically combine features from all the views. Furthermore, a test-time feature integration module is designed to mimic the paradigms during the training process to improve generalization ability. Extensive experiments show that our method achieves state-of-the-art performance. The proposed algorithm can also serve as a plug-and-play module which improves performance of existing methods with a relatively large margin.

Abstract:
Object visual navigation aims to steer an agent toward a target object based on visual observations. It is highly desirable to reasonably perceive the environment and accurately control the agent. In the navigation task, we introduce an Agent-Centric Relation Graph (ACRG) for learning the visual representation based on the relationships in the environment. ACRG is a highly effective structure that consists of two relationships, i.e., the horizontal relationship among objects and the distance relationship between the agent and objects. On the one hand, we design the Object Horizontal Relationship Graph (OHRG) that stores the relative horizontal location among objects. On the other hand, we propose the Agent-Target Distance Relationship Graph (ATDRG) that enables the agent to perceive the distance between the target and objects. For ATDRG, we utilize image depth to obtain the target distance and imply the vertical location to capture the distance relationship among objects in the vertical direction. With the above graphs, the agent can perceive the environment and output navigation actions. Experimental results in the artificial environment AI2-THOR demonstrate that ACRG significantly outperforms other state-of-the-art methods in unseen testing environments.

Abstract:
Most of the existing bounding box-based trackers rely on a classification subnetwork and a regression subnetwork to predict the location and scale of the bounding box. They learn the classification subnetwork by processing each sample individually and applying the suggested classification confidence to produce the final prediction. They typically involve heuristic positive sample configurations, which inevitably introduce mislabelled training samples and therefore deteriorate their tracking performance. Moreover, the parallel prediction of the bounding box position and scale may lead to misalignment of classification and regression. To address these issues,we propose a simple yet effective soft constraint-based tracking framework without positive samples (named SoftCT). SoftCT adaptively senses the target’s pixel position through a soft constraint mechanism, which eliminates potential performance gaps caused by artificially marking the target’s pixel position. In addition, SoftCT computes the state of the bounding box by aggregating such positional information, thereby allowing the tracker to avoid misalignment in classification and regression due to uninformed communication. Specifically, SoftCT directly senses the position of the target pixel and fuses this information into the bounding box prediction, rather than requiring explicit annotation or regression of the target pixel. Extensive experiments on six tracking benchmarks including GOT-10k, TrackingNet, LaSOT, UAV123, LaSOText and TNL2K demonstrate that our tracker achieves state-of-the-art performance, confirming its effectiveness and efficiency.

Abstract:
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn’t Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoder, hierarchical multimodal fusion decoder, and optimal transport solver. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we constructed a large-scale dataset, XMSMO-News, by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics. Our dataset and source code will be publicly available in GitHub.

Abstract:
Multi-person 3D motion prediction is an emerging task that involves predicting the future 3D motion of multiple individuals based on current observations. In contrast to motion prediction for a single person, this task requires a strong emphasis on learning the interacting dynamics among multiple individuals. Broadly speaking, current methods can be categorized into two groups: The first group involves the straightforward adaptation of models originally developed for single-person scenarios to multi-person scenarios, which is evidently suboptimal. The second group focuses on utilizing off-the-shelf tools like graph convolutional networks to model interactions. While this approach has shown improved results, the interactions primarily consider entire human identities rather than finer details. This motivates the introduction of our novel solution to address this limitation and enhance the task’s performance. In this work, we strive to craft a novel framework that can effectively address two key issues ignored in previous works, namely the multi-granularity interaction and time-varying inter-person dynamics. In implementation in accord with above aims, the proposed model has mainly comprised two modules: a person-level interaction module and a part-level interaction module. The former is designed to learn the holistic and dynamic interaction among multiple persons in a coarse-grained sense. Critically, we would emphasize that a unique trait of the former module is learning temporal dynamics. For example, it recognizes that two individuals exhibit a strong correlation during handshaking but less correlation after parting ways. The latter part-level interaction module learns the interaction between the body joints of different persons. This module operates at a more fine-grained level, distinguishing it from existing approaches. By aggregating information from both granularities, our model enables accurate motion prediction. To validate the effectiveness of the proposed model, we conducted comprehensive experiments on three benchmark datasets: 3DPW, CMU-Mocap, and MuPoTS-3D. The results of these evaluations unequivocally demonstrate the empirical superiority of our model compared to previous state-of-the-art methods.

Abstract:
Blind image super-resolution (BISR) aims to construct high-resolution image from low-resolution (LR) image that contains unknown degradation. Although the previous methods demonstrated impressive performance by introducing the degradation representation in BISR task, there still exist two problems in most of them. First, they ignore the degradation characteristics of different image regions when generating degradation representation. Second, they lack effective supervision on the generation of both degradation representation and super-resolution result. To solve these problems, we propose the dual circle contrastive learning (DCCL) with the high-efficiency modules to implement BISR. In our proposed method, we design the degradation extraction network to obtain the degradation representations from different texture regions of LR image. Meanwhile, we propose DCCL coupled with the degrading network to guarantee the obtained degradation representation to contain the degradation of LR image as much as possible. The application of DCCL also makes the SR results contain degradation as little as possible. Additionally, we develop an information distillation module for our proposed BISR model to guarantee the SR images with high quality. The experimental results demonstrate that our proposed method achieves the state-of-the-art BISR performance.

Abstract:
Motion segmentation is an essential task in artificial intelligence and computer vision. However, scene motion in real-world intelligent systems usually integrates multiple types of models, so specifying only one type of basic model may lead to the failure of scene-motion segmentation tasks. In this paper, we propose a novel and efficient heterogeneous model-fitting-based motion segmentation method (HMFMS) to accurately segment moving objects. HMFMS includes a new co-attention-induced heterogeneous model construction algorithm (HMC), an adaptive heterogeneous model refinement algorithm (HMR), and a heterogeneous model segmentation algorithm (HMS). First, we propose HMC to generate high-quality accumulated correlation matrices, by evaluating the quality of heterogeneous model hypotheses, based on the density estimation technique. Next, we propose HMR to construct sparse affinity matrices from the accumulated correlation matrices by applying information theory, effectively suppressing the values of correlations between different objects. Finally, we fuse the sparse affinity matrices and perform motion segmentation by using HMS, to obtain more accurate segmentation results. Experimental results show that HMFMS obtains superior performance on four challenging datasets (i.e., Hopkins155, Hopkins12, MTPV62 and KT3DMoSeg), compared with several subspace-based and model-fitting-based motion segmentation methods. More remarkably, HMFMS outperforms the state-of-the-art MCMS method by 57.1% and 1.8 times in terms of accuracy and computational efficiency on the representative KT3DMoSeg, respectively.

Abstract:
Contour based scene text detection methods have rapidly developed recently, but still suffer from inaccurate front-end contour initialization, multi-stage error accumulation, or deficient local information aggregation. To tackle these limitations, we propose a novel arbitrary-shaped scene text detection framework named CT-Net by progressive contour regression with contour transformers. Specifically, we first employ a contour initialization module that generates coarse text contours without any post-processing. Then, we adopt contour refinement modules to adaptively refine text contours in an iterative manner, which are beneficial for context information capturing and progressive global contour deformation. Besides, we propose an adaptive training strategy to enable the contour transformers to learn more potential deformation paths, and introduce a re-score mechanism that can effectively suppress false positives. Extensive experiments are conducted on four challenging datasets, which demonstrate the accuracy and efficiency of our CT-Net over state-of-the-art methods. Particularly, CT-Net achieves F-measure of 86.1 at 11.2 frames per second (FPS) and F-measure of 87.8 at 10.1 FPS for CTW1500 and Total-Text datasets, respectively.

Abstract:
Cross-modal hashing has gained considerable attention in cross-modal retrieval due to its low storage cost and prominent computational efficiency. However, preserving more semantic information in the compact hash codes to bridge the modality gap still remains challenging. Most existing methods unconsciously neglect the influence of modality-private information on semantic embedding discrimination, leading to unsatisfactory retrieval performance. In this paper, we propose a novel deep cross-modal hashing method, called Semantic Disentanglement Adversarial Hashing (SDAH), to tackle these challenges for cross-modal retrieval. Specifically, SDAH is designed to decouple the original features of each modality into modality-common features with semantic information and modality-private features with disturbing information. After the preliminary decoupling, the modality-private features are shuffled and treated as positive interactions to enhance the learning of modality-common features, which can significantly boost the discriminative and robustness of semantic embeddings. Moreover, the variational information bottleneck is introduced in the hash feature learning process, which can avoid the loss of a large amount of semantic information caused by the high-dimensional feature compression. Finally, the discriminative and compact hash codes can be computed directly from the hash features. A large number of comparative and ablation experiments show that SDAH achieves superior performance than other state-ofthe- art methods.

Abstract:
Recent research has shown that architectures utilizing reinforcement learning (RL) are effective in cost-based image steganography. However, these architectures only learn embedding probabilities rather than costs, and are trained for a specific embedding payload, making it difficult to extend the trained model to serve other payloads. In this paper, we propose a payload-independent cost learning framework using RL called PICO-RL. This framework directly learns universal costs that can be applied to any payload. PICO-RL incorporates an optimal probability approximation (OPA) module that can calculate the required probability map for embedding simulation directly from a learned cost map for any payload, eliminating the need for time-consuming searches for a valid probability scaling parameter. Additionally, PICO-RL uses an advanced steganalysis environment network to provide more effective reward feedback for learning. During RL training, the learned cost maps of different payloads converge and eventually become similar under the OPA constraint, resulting in payload independence. Experimental results demonstrate that a well-trained PICO-RL model, which acts as a universal cost function, defines costs with superior security performance against steganalysis and has better coding compatibility when encoding with practical steganographic codes.

Abstract:
Currently, existing semi-supervised crowd counting methods usually learn unlabeled images through pseudo-labels and spatial consistency regularization paradigm. However, due to extremely limited labeled data and noise in density map pseudo-labels, the counting performance of the model is greatly limited. Although multi-task learning can help the model improve its feature representation ability, it seriously ignores the importance of multi-task collaboration. Therefore, to overcome the above problems, we propose a multi-task pseudo-label self-correction (MTPS) framework for crowd counting, which combines different tasks to enhance the correlation between tasks and reduce training bias. For labeled data, multi-task collaboration enables the model to fully explore the potential information in limited samples; for unlabeled data, multi-task collaboration can obtain more accurate pseudo-labels than density estimation task. In addition, to effectively suppress the problem of inaccurate pseudo-labels caused by noise, we propose a pseudo-label self-correction strategy based on multi-task collaboration. This strategy starts from the perspective of task to task to gradually reduce the interference of background noise and obtain higher quality pseudo-labels. A large number of experiments on three public datasets show that the proposed MTPS achieves superior counting performance.

Abstract:
Band selection aims at selecting a subset of representative bands from original hyperspectral images (HSIs) to alleviate data redundancy. There are at least two issues existing in previous methods. First, most of them ignore global or local structural information without considering both two aspects. Second, the high-order correlations among spectral bands are not explored during learning. In this paper, we propose a tensorial global-local graph self-representation (TGSR) method for hyperspectral band selection. Specifically, we segment the HSI into diverse superpixels to show the inherent spectral-spatial structures. Based on the generated superpixels, we learn the global and local graphs to explore complex structural information from global pixels and local regions. To alleviate the computational burden, a transformation is designed for easy graph convolution of global graph and pixel spectral matrix. With global and local knowledge, we formulate a global-local graph self-representation model to conduct band correlation learning in a self-weighted manner. To explore the high-order correlations among bands, we reorganize the self-representation coefficient matrices into a tensor with low-rank constraint. We design an alternating optimization algorithm to solve the proposed model. The most representative band is selected from each band subset by performing spectral clustering on the constructed affinity matrix. Experiments on HSI datasets verify the effectiveness of our method over the state-of-the-art methods. The source code is released at https://github.com/ZhangYongshan/TGSR.

Abstract:
High-dimensional image representation is a challenging task since data has the intrinsic low-dimensional and shift-invariant characteristics. Currently, popular methods, such as tensor-Singular Value Decomposition (t-SVD), have limited ability in expressing shift-invariant subspace knowledge underlying data. To these problem, we propose a high-dimensional image representation framework based on Tensor Convolution-like Low-Rank Dictionary (TCLRD), which considers the shift-invariant low-dimensional structure of a tensor-valued data by convolution-like low-rank dictionary learning and coefficient coding, to promote the high-dimensional image representation ability. To be specific, we first define the TCLRD framework with low-rank constraint for dictionary and coefficient, in which tensor factorization and tensor-tensor product over frequency domain can be understood as convolution-like operation when describing shift-invariant. Then, the tensor Schatten-p norm is introduced to verify that TCLRD has rational mathematical interpretation. We study the TCLRD minimization problem in tensor completion with the ADMM-based optimization algorithm. The efficient solving scheme with TCLRD is extendable to various low-rank models like tensor robust principal component analysis and subspace clustering, and prove their theoretical guarantees based on generalization error. Extensive experimental results demonstrate the proposed TCLRD methods are beyond state-of-the-arts in typical tasks, including image denoising, HSI completion and image clustering.

Abstract:
Change captioning aims to describe the semantic change between a pair of images with natural language while remaining immune to viewpoint change. Based on the encoder-decoder architecture, most existing methods primarily focus on encoding effective change representations for transmission to the decoder. However, they suffer from an insufficient understanding of visual semantics, inadequate single-pass feature comparison, and a confounding bias caused by imbalanced viewpoint change data. These impair change representations and hinder unbiased caption generation. In this paper, we analyze and identify the confounding bias from a causality perspective and propose a Relation-aware Multi-pass Comparison Deconfounded (RMCD) network for change captioning, which elevates the encoding of change representations and mitigates the bias. Specifically, in the encoding stage, to sufficiently understand visual semantics, a position-guided context aggregating module is presented to capture the positional and contextual relations among objects in the image. Then, to achieve comprehensive change representations, we present a multi-pass feature comparison module to recognize semantic differences at various feature levels and progressively integrate them. In the decoding stage, to generate de-biased captions, the causal intervention is employed to remove the confounding bias which introduces spurious correlations between encoded change representations and captions. The newly achieved state-of-the-art performance on four publicly available benchmark datasets and further visual analysis demonstrate the superiority of our method.

Abstract:
We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: 1) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and 2) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast’s 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast’s 3D center typically lies within a predefined vertical plane during much of their performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.

Abstract:
Depth perception plays an essential role in the viewer experience for immersive virtual reality (VR) visual environments. However, previous research investigations in the depth quality of 3D/stereoscopic images are rather limited, and in particular, are largely lacking for 3D viewing of 360-degree omnidirectional content. In this work, we make one of the first attempts to develop an objective quality assessment model named depth quality index (DQI) for efficient no-reference (NR) depth quality assessment of stereoscopic omnidirectional images. Motivated by the perceptual characteristics of the human visual system (HVS), the proposed DQI is built upon multi-color-channel, adaptive viewport selection, and interocular discrepancy features. Experimental results demonstrate that the proposed method outperforms state-of-the-art image quality assessment (IQA) and depth quality assessment (DQA) approaches in predicting the perceptual depth quality when tested using both single-viewport and omnidirectional stereoscopic image databases. Furthermore, we demonstrate that combining the proposed depth quality model with existing IQA methods significantly boosts the performance in predicting the overall quality of 3D omnidirectional images.

Abstract:
Large-scale vision-language pre-trained models like CLIP are extensively employed in few-shot tasks due to their robust generalization capabilities. Existing methods usually incorporate additional techniques to acquire knowledge for new tasks building upon the general knowledge in CLIP. However, they do not realize that the task-related knowledge might be implicitly embedded within the general knowledge well-learned. In this paper, we propose a novel framework to reallocate and evolve the general knowledge for specific few-shot tasks (REGK), mimicking the human “Attention Allocation” cognition mechanism. With a learnable mask-tuning selection, REGK focuses on selecting the task-related parameters of CLIP while learning specific few-shot knowledge without altering CLIP underlying framework. Specifically, we initially observe that inheriting the strong knowledge representation capability in CLIP is more advantageous for few-shot learning than its task-solving ability. Subsequently, a two-stage tuning framework is introduced to reallocate and control the mask-tuning on different tasks. It allows model automatically mask-tuning on different few-shot tasks with selective sparsity training. In this way, we achieve reliable transfer of task-related knowledge and effective exploration of new knowledge from limited data to enhance few-shot learning. Extensive experiments validate the superiority and potentiality of our model.

Abstract:
Streaming perception, a critical task in computer vision, involves the real-time prediction of object locations within video sequences based on prior frames. While current methods like StreamYOLO mainly rely on coordinate information, they often fall short of delivering precise predictions due to feature misalignment between input data and supervisory labels. In this paper, a novel method, Future Feature-based Supervised Contrastive Learning (FFSCL), is introduced to address this challenge by incorporating appearance features from future frames and leveraging supervised contrastive learning techniques. FFSCL establishes a robust correspondence between the appearance of an object in current and past frames and its location in the subsequent frame. This integrated method significantly improves the accuracy of object position prediction in streaming perception tasks. In addition, the FFSCL method includes a sample pair construction module (SPC) for the efficient creation of positive and negative samples based on future frame labels and a feature consistency loss (FCL) to enhance the effectiveness of supervised contrastive learning by linking appearance features from future frames with those from past frames. The efficacy of FFSCL is demonstrated through extensive experiments on two large-scale benchmark datasets, where FFSCL consistently outperforms state-of-the-art methods in streaming perception tasks. This study represents a significant advancement in the incorporation of supervised contrastive learning techniques and future frame information into the realm of streaming perception, paving the way for more accurate and efficient object prediction within video streams.

Abstract:
Structured pruning is an efficient compression technique that significantly reduces the inference latency and energy consumption of convolutional neural networks (CNNs) by eliminating redundant filters. However, existing works suffer from expensive algorithm costs in multi-hardware deployment scenarios involving several budgets across multiple hardware devices. To tackle this challenge, we propose a novel all-in-one hardware-oriented compression framework (AHC), which integrates structured pruning and data pruning to rapidly generate vast hardware-efficient models with ultra-low pruning and fine-tuning costs. Specifically, AHC develops a unified hardware-aware pruning (UHP), which rapidly generates numerous hardware-efficient models for several budgets across multiple hardware devices in once pruning process, thereby reducing pruning costs in multi-hardware deployment scenarios. Moreover, AHC proposes a progressive data pruning (PDP), which gradually removes samples that have a negligible impact on enhancing the predictive ability of pruned models, thereby accelerating the fine-tuning process with negligible performance loss. Extensive experiments demonstrate the superiority of the AHC over state-of-the-art (SOTA) structured pruning methods in terms of algorithm costs, latency, and accuracy. In particular, compared with SOTA hardware-oriented pruning method, AHC achieves comparable performances while reducing 5.3× pruning costs and 2.7× fine-tuning costs in multi-hardware deployment scenarios. Code is available at https://github.com/HXuan-Wang/AHC.

Abstract:
Current multi-view subspace clustering methods typically consist of a within-view module, which explores inherent characteristics using the self-expressive coefficient matrix, and a cross-view module, which promotes consensus among all views toward similar strengths. However, the self-expressive coefficients are directly influenced by the characteristics and distributions of input features, and coefficient matrices with varying strengths may indicate the same clustering structure. Therefore, directly regularizing the coefficient matrices towards a common matrix is unnecessary and may even diminish the clustering performance. We find that it is the relative data relationship, rather than the absolute similarity, that plays a pivotal role in clustering. Building on this realization, we propose a relative comparison measure that enables a more contextual understanding of the data relationship. Subsequently, we develop a Relative Comparison-based Consensus Learning (RCCL) model for multi-view subspace clustering, which encourages the relative data similarities to be consistent across different views. Our RCCL model advances in identifying the underlying data relationship, avoiding unnecessary constraints on absolute consistency, and thereby delving into the fundamental nature of multi-view consensus. We introduce an elegant transformation operator for relative comparison and solve RCCL under the framework of alternating direction method of multipliers. Extensive experiments unequivocally demonstrated the superiority of RCCL.

Abstract:
Accurately identifying correct correspondences (inliers) in two-view images is a fundamental task in computer vision. Recent studies usually adopt Graph Neural Networks or stack local graphs into global ones to establish neighborhood relations. However, the smoothing properties of Graph Convolutional Neural network (GCN) cause the model to fall into local extreme, which leads to the issue of indistinguishability between inliers and outliers. Especially when the initial correspondences contain a large number of incorrect correspondences (outliers), these studies suffer from severe performance degradation. To address the above issues and refocus perspective information on distinct features, we design a Consistency Guided ResFormer Network (CGR-Net) that uses consistent correspondences to guide model perspective focusing, thereby avoiding the negative impact of outliers. Specifically, we design an efficient Graph Score Calculation module, which aims to compute global graph scores by enhancing the representation of important features and comprehensively capturing the contextual relationships between correspondences. Then, we propose a Consistency Guided Correspondences Selection module to dynamically fuse global graph scores and consistency graphs and construct a novel consistency matrix to accurately recognize inliers. Extensive experiments on various challenging tasks demonstrate that our CGR-Net outperforms state-of-the-art methods. Our code is released at https://github.com/XiaojieLi11/CGR-Net.

Abstract:
Video deblurring is a challenging task as the blur is often spatially variant. Existing methods mainly engage in building the spatial-temporal correspondence among the frames. As one of the widely-used frameworks, the long-range temporal propagation usually suffers from the expensive computation cost and error accumulation caused by the numerous connections among temporal frames. Meanwhile, the exploration of spatial-variant information from the neighbor frames is often ignored in video deblurring. To tackle these issues, we tailor an efficient short-range multi-scale framework slimming the long-range propagation and exploiting the most relevant neighbor temporal knowledge. For capturing spatial knowledge, we further propose a spatial feature extractor, named the spatially variant adaptive block, to adaptively generate the location-wise kernel to cater to the spatially variant character of blur. For efficient temporal exploitation, a simple inter-frame shift as a motion compensation is developed to avoid expensive long temporal relevance modeling. Both quantitative and qualitative evaluation results on benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

Abstract:
Facial expressions are an essential part of human emotional communication, and micro-expressions (MEs), as transient and imperceptible non-verbal signals, can potentially reveal real human emotions. However, subtle motion variations, limited and unbalanced samples make micro-expression recognition (MER) challenging. In this paper, we design a novel dual-branch learning framework of multi-level flow-driven attention for micro-expression recognition (MFDAN), which innovatively integrates optical flow prior to guide the attention learning in the image encoding branch, enabling the model to focus on the most discriminative facial regions for subtle motion patterns. Firstly, we extract optical flow information by an optical flow encoding module. Then, in the image coding module, we construct a Transformer structure containing an optical flow-driven attention mechanism, which can effectively locate the interest region of micro-expressions in the image according to the position information of optical flow to capture more sensitive and fine-grained micro-expressions. By interoperating prior knowledge with data learning, and introducing the Dropkey operation and Focal Loss, our method can handle subtle micro-expression features on small imbalanced datasets. Through extensive experiments on three independent datasets and a composite database, including SMIC-HS, SAMM, and CASME II, robust leave-one-subject-out (LOSO) evaluation results show that our method outperforms state-of-the-art methods especially on the composite database.

Abstract:
Visible-infrared person re-identification (Re-ID) plays a crucial role in matching people across camera views in the darkness and normal lighting. To reduce annotation cost, it is advantageous to learn Re-ID model from unlabeled visible-infrared image pairs. However, large modality gap makes it difficult to discover the underlying cross-modality sample relations. Compared with cross-modality sample pairs in the target domain, it is easier to obtain more single-modality visible image samples from other domains. In this work, we study unsupervised transfer learning to extract modality-shared knowledge from auxiliary unlabeled visible images in a source domain and leverage this knowledge to learn cross-modality matching in the unlabeled target domain. Our framework consists of two stages: RGB-gray asymmetric mutual learning and unsupervised cross-modality self-training. In the first stage, to extract visible-infrared shared information from auxiliary unlabeled visible images, we regard RGB images and grayscale fake infrared images transformed from RGB images as two views to learn view-shared information and simultaneously preserve RGB-specific information. Based on information theoretic analysis, we learn an RGB-gray feature extractor and further introduce an auxiliary gray feature extractor to quantify RGB-gray shared knowledge. This knowledge is then transferred to the RGB-gray feature extractor without eliminating RGB-specific information. We call this process Cross-Modality Asymmetric Mutual Learning (CMAM). In the second stage, for unsupervised cross-modality self-training in the target domain, we fuse the complementary knowledge in two models by mutual learning and employ bipartite cross-modality pseudo labeling to alleviate modality gap. For a more extensive evaluation, we collected a new public multi-modality dataset, SYSU-MM02, constructed from untrimmed videos. Our method achieves the state-of-the-art performance on three benchmark datasets. Project page: https://www.isee-ai.cn/project/sysumm02.html.

Abstract:
Despite recent advancements in masked skeleton modeling and visual-language pre-training, no method has yet been proposed to explore capturing and utilizing the rich semantic information embedded in both modalities for enhanced action recognition. To address this challenge, we propose a novel Motion-Aware Mask Feature Reconstruction (MMFR) method for the challenging task of skeleton-based action recognition. MMFR ingeniously integrates masked skeleton feature reconstruction with visual-language pre-trained model within a consolidated framework, aiming to leverage the synergistic potential of both domains. Specifically, It employs visual-language model to infuse semantic understanding into the skeleton feature reconstruction process via probability distribution distillation. Moreover, we introduce a multi-granularity semantic contrast module that refines vision-text alignment precision and augments contextual information for accurate mask reconstruction. Extensive experiments demonstrate MMFR’s superiority in skeleton-based action recognition, as well as its efficacy in zero-shot scenarios.

Abstract:
Video anomaly detection is a challenging task due to the unpredictable nature of abnormal actions, sophisticated semantics and a lack in training data. The visual representations of most existing approaches are limited by short-term sequences which cannot provide necessary clues for achieving reasonable detections. In this paper, we propose to comprehensively represent the motion patterns in human actions by learning from long-term sequences. Firstly, a Stacked State Machine (SSM) model with distinctive basis functions is proposed to represent the temporal dependencies which are consistent across long-term observations. Secondly, the dependencies are leveraged in filtering out problematic motion estimations which are influenced by short-term observation noises, plausible motion parameters are obtained in this way. Finally, SSM model predicts future states based on past ones, the divergence between the predictions with inherent normal patterns and observed ones determines anomalies which violate normal motion patterns. To address the challenges in drone-based surveillance, a dataset which is more diversified than existing ones is built. Extensive experiments are carried out to evaluate the proposed approach on the dataset and existing ones. Improvements over state-of-the-art methods can be observed. The proposed dataset will be made publicly available. Code is available at https://github.com/AllenYLJiang/Anomaly-Detection-in-Sequences.

Abstract:
Aerial tracking has received growing attention due to its broad practical applications. However, single-view aerial trackers are still limited by challenges such as severe appearance variations and occlusions. Existing multi-view trackers utilize cross-drone information to address these issues but struggle to overcome heterogenous differences. In this paper, we propose a novel Transformer-based consistent representation mining (CRM) module to capture invariant target information and suppress the heterogenous differences in cross-drone information. First, CRM divides the heterogenous input into regions and measures semantic relevance by modeling the relations between these regions. Then reliable target regions are roughly localized by selecting the top k most relevant regions. Next, the global perception is performed on these reliable regions via multi-head sparse self-attention, further enhancing the understanding of the target and suppressing background regions. In particular, CRM, as a plug-and-play module, can be flexibly embedded into different tracking frameworks (CRM-Siam and CRM-DiMP). Besides, the multi-view correction strategy is designed to ensure timely correction of multi-view information and full utilization of its own information. Extensive experiments on the multi-drone dataset, MDOT, demonstrate that CRM-assisted trackers effectively improve the accuracy and robustness of the multi-drone tracking system, outperforming other outstanding trackers. The code and models are available at https://github.com/xyl-507/CRM.

Abstract:
In this paper, we address the problem of personalized gaze estimation. Due to the anatomical differences between individuals, current personalized gaze models often rely on fine-tuning or fully-supervised methods with labeled calibration samples, which may not be practical in real-world applications. To tackle this limitation, we propose an approach called Self-Supervised Test-Time Adaptation for Personalized Gaze Estimation (TTAGaze), which enables adaptation with small unlabeled data at test time. Our goal is to develop a gaze estimation model specifically adapted to a target person using only a few unlabeled images. We call this setting as unsupervised few-shot personalized adaptation in gaze estimation, which is more aligned with real-world scenarios compared to existing approaches. Additionally, Our approach leverages self-supervised learning and meta-learning. The model consists of the main task (gaze estimation) and a self-supervised auxiliary task. During training, the two task are trained using a coupled method. At test time, adaptation is achieved by optimizing the self-supervised loss adapted to an unseen person with a few unlabeled data. The model parameters are learned via model-agnostic meta-learning (MAML) to facilitate effective unsupervised few-shot personalized adaptation in gaze estimation. Experimental results demonstrate that the proposed method outperforms alternative approaches on several widely-used benchmark datasets.

Abstract:
Recently learned image compression methods have achieved better rate-distortion performance than traditional non-learning image compression standards. Some previous image compression methods combine the local modeling capability of CNN with the long-range attention of Transformer to generate the latent representation. However, previous methods ignored the fact that Transformer pays attention to low-frequency feature learning while CNN focuses on high-frequency feature learning, resulting in insufficient fusion of these two structures. In this paper, we propose a novel image compression method with Frequency Decomposition Network (FDNet), which processes low-frequency and high-frequency components in different ways. More specifically, FDNet initially implements a dynamic frequency filter to adaptively decompose the features into low-frequency and high-frequency components. As invertible neural networks do not lose any information during the feature transformation and can be implemented by CNN residual networks, the invertible neural network block (INNB) is used to extract high-frequency local information. Then FDNet takes a hybrid attention block (HAB), which is composed of window-based multi-head self-attention (W-MSA) and channel attention, to extract window-based and global spatial low-frequency information. Besides, previous channel entropy models adopt CNN networks to remove high-frequency redundancy of the latent representation. However, there exists low-frequency redundancy between different channels of the latent representation. To solve this issue, FDNet further introduces the hybrid attention block to the channel entropy model. W-MSA and channel attention of the hybrid attention block can remove the window-based and global low-frequency redundancy, respectively. Extensive experiments demonstrate that FDNet achieves promising rate-distortion performance on the Kodak, CLIC and Tecnick datasets.

Abstract:
Deep learning-based automatic license plate recognition methods have made significant advancements and are now widely used in real-world applications. Currently, license plate character recognition primarily relies on classification and the Connectionist Temporal Classification approach. While these methods achieve high recognition accuracy, they face challenges in accurately estimating confidence in occlusion scenarios. To solve this problem, we propose a novel style reconstruction-based network that can transform input license plates into standardized images. It computes character prediction confidence using a lightweight matching module, which effectively reduces the confidence score for occluded sections. Our network ingeniously integrates deep learning with traditional character segmentation methods, offering a fresh perspective on license plate recognition. Besides, to address the class imbalance in existing license plate datasets, we propose a novel synthetic license plate generation method that exchanges styles between real and standardized license plates. In order to comprehensively evaluate license plate recognition models across different regions, we construct and release the Chinese Balanced License Plates (CBLP) dataset, which includes over 30,000 images from all provinces in mainland China. Experimental evaluations on multiple datasets demonstrate that our methods achieve state-of-the-art performance. The code and dataset are available at https://github.com/tj-cvrsg/lpsrnet.

Abstract:
Group activity recognition is a challenging task because it involves diverse individual actions and complex relations. Most existing methods enhance individual representation by introducing relation inference using appearance features. Some methods utilize extra knowledge, such as action labels, to enhance relation inference and refine the individual representation, but the knowledge they explored is simple and insufficient. In this paper, we propose a novel idea of knowledge concretization and further develop a Knowledge Augmented Relation Inference framework (KARI) for group activity recognition. Specifically, we first concretize knowledge from training data, and then represent them as Class-Class co-occurrence Map (C-C Map) and Class-Position distribution Map (C-P Map). On top of them, KARI explores concretized knowledge to integrate visual and semantic representation in a unified architecture for group activity recognition. Experimental results on two public datasets show that the proposed framework performs favorably compared with state-of-the-art approaches.

Abstract:
Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal Voc test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains.

Abstract:
Previous works condition aging patterns utilizing one-hot or artificial predefined distributions. Nevertheless, different age groups show different intraclass variations. This property made it challenging to express differences in apparent age across all age groups discriminately. Adaptive aging feature distribution by learning the target age group in training data is a promising solution. Unfortunately, existing datasets commonly suffer from diverse degrees of semantic-level attribute imbalance, which leads to the tendency for previous approaches to generate paradoxical appearances. To address the aforementioned issues, we propose a novel framework containing three modules: the Causal Aging (CA) module, the Shapley Value Quantization (SVQ) module, and the Differentiated Age Embedding Transformation (DAT) module. Specifically, to eliminate the effect of attribute imbalance on the adaptive distribution of learning target age groups, we design the CA module, which controls the effect of momentum on aging features by De-confound training. Meanwhile, the influence of the aging-independent attribute, which appears abundantly in training data, on the target aging feature is eliminated by counterfactual inference subtraction. Subsequently, the SVQ module quantifies the contribution of different attributes to age based on the results of the CA module. This operation allows us to obtain adaptive age distributions for different age groups. Eventually, the DAT module takes a target age vector, sampled from the target age distribution quantized by SVQ, and modulates the age representation of the generated image. Extensive experimental results on four face aging datasets show that our model achieves convincing performance compared to the current state-of-the-art methods.

Abstract:
Facial expression recognition (FER) remains a challenging task due to the ambiguity and subtlety of expressions. To address this challenge, current FER methods predominantly prioritize visual cues while inadvertently neglecting the potential insights that can be gleaned from other modalities. Recently, vision-language pre-training (VLP) models integrated textual cues as guidance, culminating in a powerful multi-modal solution that has proven effective for a range of computer vision tasks. In this paper, we propose a Cross-Modal Emotion-Aware Prompting (CEPrompt) framework for FER based on VLP models. To make VLP models sensitive to expression-relevant visual discrepancies, we devise an Emotion Conception-guided Visual Adapter (EVA) to capture the category-specific appearance representations with emotion conception guidance. Moreover, knowledge distillation is employed to prevent the model from forgetting the pre-trained category-invariant knowledge. In addition, we design a Conception-Appearance Tuner (CAT) to facilitate the interaction of multi-modal information via cooperatively tuning between emotion conception and appearance prompts. In this way, semantic information about emotion text conception is infused directly into facial appearance images, thereby enhancing a comprehensive and precise understanding of expression-related facial details. Quantitative and qualitative experiments show that our CEPrompt outperforms state-of-the-art approaches on three real-world FER datasets. The code is available at https://github.com/HaoliangZhou/CEPrompt.

Abstract:
Prompt tuning, an emerging parameter-efficient strategy, leverages the powerful knowledge of large-scale pre-trained image-text models (e.g., CLIP) to swiftly adapt to downstream tasks. Despite its effectiveness, adapting prompt tuning to text-video retrieval encounters two limitations: i) existing methods adopt two isolated prompt tokens to prompt two modal branches separately, making it challenging to learn a well-aligned unified representation, i.e., modality gap; ii) video encoders typically utilize a fixed pre-trained visual backbone, neglecting the incorporation of spatial-temporal information. To this end, we propose a simple yet effective method, named Unified Modality-aware Prompt Tuning (UMP), for text-video retrieval. Concretely, we first introduce a Unified Prompt Generation (UPG) module to dynamically produce modality-aware prompt tokens, enabling the perception of prior semantic information on both video and text inputs. These prompt tokens are simultaneously injected into two branches that can bridge the semantics gap between two modalities in a unified-adjusting manner. Then, we design a parameter-free Spatial-Temporal Shift (STS) module to facilitate both intra- and inter-communication among video tokens and prompt tokens in the spatial-temporal dimension. Notably, extensive experiments on four widely used benchmarks show that UMP achieves new state-of-the-art performance compared to existing prompt-tuning methods without bringing excessive parameters. Code is available at: https://github.com/zchoi/UMP_TVR.

Abstract:
Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.

Abstract:
Transformer attention plays an important role in current top-performing trackers. However, it is bottom-up, driven by stimulus and lacks intrinsic prior guidance. This bottom-up attention mechanism leads to an emphasis on all objects in the input images, rather than the task related objects. As a result, the performance of the bottom-up attention based trackers is deteriorated in complicated scenes. To address this issue, we propose a robust tracker that combines bottom-up attention with top-down attention to comply with the existing ViT framework, named TBTrack. TBTrack can not only utilize the existing bottom-up attention mechanisms to model the long-range relationship of input tokens, but also utilize a newly added top-down attention mechanism to pay more attention to task related object and further eliminate interference from similar objects and backgrounds. Specifically, we firstly design a top-down prior generation module using an adaptive learning parameter combined with the template inputs to obtain top-down task guided signals. Then, we inject the prior signals into a bottom-up attention module to obtain a top-down and bottom-up attention combination block (TB-Block). Finally, we stack these TB-Blocks to construct our tracker (TBTrack) with top-down prior guidance capability, which focuses more on the task related object. Through extensive experiments, our TBTrack achieves impressive performance on multiple tracking benchmarks, including GOT-10k, LaSOT, LaSOT _ext , TNL2K, TrackingNet, UAV123 and so on. The code and trained models will be publicly available.

Abstract:
Intra-camera supervision (ICS) person reidentification (Re-ID) assumes that a person’s identity labels are independently annotated within each camera, lacking inter-camera association for person identities. Recently, several ICS methods have achieved significant results by using two stages: intra-camera learning and inter-camera learning for model training. However, in the intra-camera learning stage, these methods only focus on pedestrian features within each camera, which increases the variance of the same person across different cameras. In the inter-camera learning stage, due to lighting variations and background shifts, the generated pseudo-labels from feature similarity contain significant noise, and the unassociated outlier samples are not fully utilized. To address these issues, we propose a Contrastive Mean Teacher (CMT) framework combining Mean-teacher paradigm and contrastive learning. Specifically, by conducting both intra-camera and inter-camera learning simultaneously, we can fully leverage predefined intra-camera labels and inter-camera-associated labels. This method can effectively learn pedestrian features under various cameras. Moreover, the teacher model provides more stable predictions, which helps to establish a better inter-camera association and improves the model’s generalization capabilities. Finally, we design a background filtering module that employs attention mechanisms to guide instance normalization, further reducing variations in identity features caused by lighting and background changes. We validate our method on three large-scale person re-identification datasets, and the results show that our approach outperforms all existing ICS methods. Specifically, our approach achieves a state-of-the-art accuracy 88.9% mAP and 95.8% Rank-1 on the challenging Market1501 benchmarked with ResNet-50, even surpassing the performance of state-of-the-art fully supervised methods.

Abstract:
This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.

Abstract:
Different from large-scale classification tasks, fine-grained visual classification is a challenging task due to two critical problems: 1) evident intra-class variances and subtle inter-class differences, and 2) overfitting owing to fewer training samples in datasets. Most existing methods extract key features to reduce intra-class variances, but pay no attention to subtle inter-class differences in fine-grained visual classification. To address this issue, we propose a loss function named exploration of class center, which consists of a multiple class-center constraint and a class-center label generation. This loss function fully utilizes the information of the class center from the perspective of features and labels. From the feature perspective, the multiple class-center constraint pulls samples closer to the target class center, and pushes samples away from the most similar nontarget class center. Thus, the constraint reduces intra-class variances and enlarges inter-class differences. From the label perspective, the class-center label generation utilizes class-center distributions to generate soft labels to alleviate overfitting. Our method can be easily integrated with existing fine-grained visual classification approaches as a loss function, to further boost excellent performance with only slight training costs. Extensive experiments are conducted to demonstrate consistent improvements achieved by our method on four widely-used fine-grained visual classification datasets. In particular, our method achieves state-of-the-art performance on the FGVC-Aircraft and CUB-200-2011 datasets.

Abstract:
Recently, self-supervised denoising methods have attracted significant attention due to the considerable challenge posed by constructing a large-scale real noise dataset for supervised training. The most representative self-supervised denoisers are based on blind-spot networks (BSNs), which exclude the central pixel of receptive field. However, excluding any input pixel potentially leads to the loss of vital information required for accurate predictions, especially when the excluded pixel corresponds to the output position. In addition, a standard BSN has struggled to effectively reduce real-world noise due to the spatial correlation of noise, though it makes the significant results with independently distributed synthetic noise. In this paper, we propose a novel self-supervised real-world image denoising framework called Complementary-BSN based on two reciprocal branches (Mask-Map branch and Enhanced-PD-BSN branch) with an efficient loss function to employ the pixels information ignored by masked convolution and provide additional optimization target for self-supervised output. Specifically, we exploit a block-wise random-placing (BRP) scheme for further weaken the noisy correlation to avoid the illusion of image structure recovery due to existing complex noise and make Complementary-BSN more suitable for real noise. Additionally, we develop an efficient strategy (multi-stride PD (MPD)) to fuse multiple PD strides for inference, narrowing the restoration gap between textural and flat regions. Extensive experiments on real-world datasets demonstrate that our method achieves superior performance to other state-of-the-art (SOTA) self-supervised denoising methods. The code is available at https://github.com/cuijin7382/Complementary-BSN.

Abstract:
Recent advancements in blind image quality assessment (BIQA) are primarily propelled by deep learning technologies. While leveraging transformers can effectively capture long-range dependencies and contextual details in images, the significance of local information in image quality assessment can be undervalued. To address this challenging problem, we propose a novel feature enhancement framework tailored for BIQA. Specifically, we devise an Adaptive Graph Attention (AGA) module to simultaneously augment both local and contextual information. It not only refines the post-transformer features into an adaptive graph, facilitating local information enhancement, but also exploits interactions amongst diverse feature channels. The proposed technique can better reduce redundant information introduced during feature updates compared to traditional convolution layers, streamlining the self-updating process for feature maps. Experimental results show that our proposed model outperforms state-of-the-art BIQA models in predicting the perceived quality of images. The code is available at https://github.com/sky-whs/AGAIQA.

Abstract:
Real-time video services are gaining popularity in our daily life, yet limited network bandwidth can constrain the delivered video quality. Video Super Resolution (VSR) technology emerges as a key solution to enhance user experience by reconstructing high-resolution (HR) videos. The existing real-time VSR frameworks have primarily emphasized spatial quality metrics like PSNR and SSIM, which often lack consideration of temporal coherence, a critical factor for accurately reflecting the overall quality of super-resolved videos. Inspired by Video Quality Assessment (VQA) strategies, we propose a dual-frame training framework and a lightweight multi-branch network to address VSR processing in real time. Such designs thoroughly leverage the spatio-temporal correlations between consecutive frames so as to ensure efficient video restoration. Furthermore, we incorporate ST-RRED, a powerful VQA approach that separately measures spatial and temporal consistency aligning with human perception principles, into our loss functions. This guides us to synthesize quality-aware perceptual features across both space and time for realistic reconstruction. Our model demonstrates remarkable efficiency, achieving near real-time processing of 4K videos. Compared to the state-of-the-art lightweight model MRVSR, ours is more compact and faster, 60% smaller in size (0.483M vs. 1.21M parameters), and 106% quicker (96.44fps vs. 46.7fps on 1080p frames), with significantly improved perceptual quality.

Abstract:
We introduce a novel learning method that can effectively perceive both the geometry structure and semantic labels of a 3D scene in real time. Existing real-time 3D scene reconstruction approaches often rely on volumetric schemes to regress a Truncated Signed Distance Function (TSDF) as the 3D representation. However, these volumetric approaches primarily focus on ensuring global coherence in the reconstructed scene, which often results in a lack of local geometric detail. To address this limitation, we propose a solution that leverages the latent geometric knowledge present in 2D image features by explicit depth prediction thereby creating anchored features, which are used to refine the learning of occupancy in the TSDF volume. Furthermore, we discover that this cross-dimensional feature refinement methodology can also be applied to the task of semantic segmentation by utilizing semantic priors. As a result, we propose an end-to-end cross-dimensional refinement neural network (CDRNet) that can extract both the 3D mesh and 3D semantic labeling of a scene in real time. Through experimental evaluation on multiple datasets, we demonstrate that our method achieves state-of-the-art 3D perception capability by boosting over 40% and 18% in 3D semantic segmentation and geometric reconstruction respectively over the prior art. These promising results indicate the significant potential of our approach for various industrial applications. Demo video and code can be found on the project page, https://hafred.github.io/cdrnet/.

Affiliations: Brain-Inspired Computing and Intelligent Control of Chongqing Key Laboratory, College of Artificial Intelligence, Southwest University, Chongqing, China; Brain-Inspired Computing and Intelligent Control of Chongqing Key Laboratory, National and Local Joint Engineering Laboratory of Intelligent Transmission and Control Technology, Chongqing Brain Science Collaborative Innovation Center, College of Artificial Intelligence, Southwest University, Chongqing, China; Department of Mathematics, Texas A&M University, Doha, Qatar

Abstract:
Low-light image enhancement aims to obtain a normal-light image by adjusting the illumination of a low-light image. The existing methods do not fully explore the prior information hidden in low-light images, which raises the problems of detail loss and color distortion. To alleviate these issues, we propose a multi-prior collaborative network (MPC-Net) with transformer for low-light image enhancement. It extracts the indispensable prior information to facilitate high-quality image enhancement. Specifically, a pre-trained high-level vision model is employed to extract coarse texture and structure, which is then refined through a proposed self-distillation module to obtain compact representation for texture and structure. Furthermore, we design a color branch consisting of negative residual blocks and a pyramid structure to solve for noise-free color prior, aiming to provide the enhancer with a modeling mechanism for color information. Finally, a transformer-based multi-prior fusion module is developed to aggregate the content and prior information. Extensive experiments show that the proposed MPC-Net achieves superior performance on three referenced datasets and four no-referenced datasets. Our code is available at: https://github.com/Shecyy/MPC-Net.

Abstract:
RGB (visible), near-infrared (NI), and thermal infrared (TI) imaging modalities are commonly combined for round-the-clock surveillance. We introduce a novel unsupervised multi-modality person re-identification (MM-ReID) task, which, based on an individual’s image in any one modality, seeks to identify matches in the other two modalities. Compared to prior MM-ReID problem formulations, unsupervised MM-ReID significantly reduces labeling cost and imaging constraints. To address the unsupervised MM-ReID task, we propose a novel inter-modality similarity learning (IMSL) framework consisting of four synergistic interconnected modules: modality mean clustering (MMC), multi-modality reliability estimation (MMRE), shape-based mutual reinforcement (SMR), and modality-aware invariant learning (MIL). MMC iterates with SMR and MIL in a mutually beneficial manner to provide pseudo-labels that are robust to modality gap. MMRE normalizes sample weights, mitigating the impact of noisy labels in the multi-modality setting. SMR emphasizes shape information to implicitly enhance the model’s robustness to the modality gap and is additionally guided by pseudo-labels provided by MMC to attend to identity-related details. MIL explicitly encourages learning of modality-invariant and identity-related features via contrastive feedback for the MMC module. Extensive experimental results on the multi-modality and cross-modality datasets demonstrate that IMSL provides substantial performance gains over existing methods. Code is made available at https://github.com/zqpang/IMSL.

Abstract:
The field of face sketch-to-photo synthesis involves generating photographic facial images with enhanced details and a heightened sense of style realism. In recent years, the advancement of deep learning techniques has significantly contributed to the development of methods for synthesizing photographic face images from sketches. Nevertheless, challenges remain in synthesizing facial photographs with richer details and more accurate structural representation. This paper introduces a novel architecture for face sketch-to-photo synthesis, using denoising diffusion probabilistic models (DDPM). Our approach simplifies the complex transformation process into sequential forward and backward denoising steps. We incorporate a pretrained coarse generator to effectively encode sketch information, integrating it into each backward step to guide the generative process toward accurate photo space representation. Furthermore, we design a detail diffusion branch to refine the coarse photo face generated from the coarse generator. By deeply fusing multiscale detail features from this branch with a sophisticated conditional noise predictor, our model effectively captures the correlation between detail and stylistic elements both in sketches and in photographic faces. Extensive experimental evaluations on three datasets show the effectiveness of our model, emphasizing its ability to synthesize facial photographs with remarkable realism and rich detail. The synthesized facial images consistently demonstrate superior face recognition accuracy, surpassing that of state-of-the-art methods.

Abstract:
Federated learning is widely used and researched as an effective method for solving the privacy problems faced by centralized learning. To address the communication limitations and heterogeneity among clients, many existing methods based on the mixup algorithm share data mixed with the client’s local dataset to improve the model accuracy. However, due to the heterogeneity of federated learning, there may be some clients who join the mixup process with insufficient data, which will violate the privacy-preserving assumption of the mixup. Because of this weakness, many methods based on the mixup approach will face a serious privacy problem while trying to improve federated learning over other parts, e.g., accuracy or communication efficiency. Therefore, we propose the FedMDO framework to solve the privacy problem faced by mixup-based methods. In FedMDO, we introduce the auxiliary client to hold the auxiliary dataset that is related to the federated learning task and to generate the mixup templates for clients to increase the amount of data in the mixup process. By introducing the auxiliary client, the decrease in model accuracy can be suppressed as much as possible while taking advantage of the privacy-preserving gain from the increase in data volume. Furthermore, we introduce differential privacy into FedMDO with an elaborate redesign to enhance privacy protection. The corresponding analysis shows that under FedMDO, differential privacy can achieve the same protection with less negative impact. Experiments show that with at least an approximately 10%+ improvement in model accuracy and an average of 5 times greater communication savings compared to the FedAvg and non-IID SOTAs with weak privacy protection design, our method can yield significant improvement in the privacy of the shared data.

Abstract:
With the development of multimedia technology, events are usually presented in multimedia forms, thus multimedia event extraction (MEE) has become more and more important. Existing MEE works usually use simple strategies to align two modalities, making it difficult to precisely extract events and arguments in complex multimedia documents. To address this problem, we propose a novel Multi-grained Gradual Inference Model (MGIM) that focuses on inferring and interpreting events in complex multimedia structures in a coarse-to-fine manner. To efficiently integrate textual and visual modalities, we design a Coarse-grained Alignment (CA) module, which represents the two modalities in a graph structure and performs coarse-grained alignment. Based on the CA module, we further propose a Fine-grained Inference module (FI) that fine-grained aligns text and image by performing multiple rounds of gradual inference. MGIM provides a comprehensive interpretation of multimedia events at two information granularities (coarse and fine). Extensive experiments on the M2E2 dataset demonstrate the effectiveness of MGIM.

Abstract:
As vision sensor technology continues to evolve, the requirements for detecting targets of interest in the images captured by the sensors are increasing. Considering fast detection and high accuracy, the industry favors geometric key point-based solutions. However, there are a large number of small and fuzzy objects in the real world. Geometric key point detectors do not effectively utilize the contextual features of the region of interest, leading to excessive false positive and false negative results. In this work, a simple, effective, and interpretable tiny object detection method called Regional Cross Self-Attention Object Detection Network (RCSANet) is proposed. It adopts Region Proposal Networks and transformers to capture regional background relations and uses regional background relations to generate key point sequences. The regional cross self-attention mechanism is introduced to curtail computation redundancy and minimize the interference of redundant information to the target region. Additionally, a position coding called dynamic implicit position coding is proposed to cooperate with regional cross self-attentiveness. Dynamic implicit location coding can encode arbitrarily long input sequences. The computational cost of RCSANet is significantly lower than that of state-of-the-art object detection solutions. Moreover, RCSANet improves the performance on the four benchmark datasets, of MSCOCO, Tinyperson, DOTA, and AI-TOD, by about 3.0%AP.

Abstract:
Semantic segmentation has recently achieved notable advances by exploiting “class-level” contextual information during learning, e.g., the Object Contextual Representation (OCR) and Context Prior (CPNet) approaches. However, these approaches simply concatenate class-level information to pixel features to boost pixel representation learning, which cannot fully utilize intra-class and inter-class contextual information. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. To better exploit class-level information, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Moreover, we design a dedicated decoder for CAR (named CARD), which consists of a novel spatial token mixer and an upsampling module, to maximize its gain for existing baselines while being highly efficient in terms of computational cost. Specifically, CAR consists of three novel loss functions. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. CAR can be directly applied to most existing segmentation models during training, including OCR and CPNet, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23% mIOU with superior generalization ability. CARD outperforms state-of-the-art approaches on multiple benchmarks with a highly efficient architecture. The code will be available at https://github.com/edwardyehuang/CAR.

Abstract:
In typical unsupervised domain adaptive object detection, it is assumed that extensive unlabeled training data from the target domain can be easily obtained. However, in some access-constrained scenarios, massive target data cannot be guaranteed, but acquiring only a few target samples and annotating them may costs less. Therefore, inspired by the meta-learning success in few-shot tasks, we propose an Instance-level Prototype learning Network (IPNet) for solving the domain adaptive object detection under the supervised few-shot scenario in this work. To compensate for the target domain data deficiency, we fuse cropped instances from labeled images in both domains to learn a representative prototype for each class, by enforcing features of the same class’s instances but from different domains to be as close as possible. These prototypes are further employed to discriminate various features’ salience in an image, and separate foreground and background regions for respective domain alignment. Extensive experiments are conducted on several cross-domain scenarios, and their results show the consistent accuracy gains of the IPNet over state-of-the-art methods, e.g., 10.4% mAP increase on Cityscapes-to-FoggyCityscapes setting and 3.0% mAP increase on Sim10k-to-Cityscapes setting.

Abstract:
Most existing RGB-D salient object detection (SOD) methods rely on high-quality depth images. However, their performance is limited when processing low-quality depth maps. This paper exploits more complementary image priors to guide the model to learn on variable depth maps, and a novel multi-prior driven network called MPDNet is proposed for RGB-D SOD. MPDNet utilizes four processing pipelines to process RGB images and other priors, which include an RGB image processing pipeline, a depth map processing pipeline, a fine-grained and gradient prior processing pipeline, and an edge learning pipeline. Specifically, fine-grained and gradient priors are input to the same processing pipeline. For the depth maps, fine-grained and gradient priors, a prior channel attention module utilizes the channel attention mechanism to filter noises and highlights the salient cues. The RGB image processing pipeline uses a multi-feature progressive enhancement module to fuse and enhance features from depth maps. And a multi-feature prediction decoder decodes initial salient masks. In the edge learning pipeline, edge prior serves as an edge label and is captured by an edge capture module. Finally, the clear salient masks are obtained by fusing the salient information from the four pipelines. The experimental results on six benchmarks indicate that the proposed method outperforms thirteen state-of-the-art methods in six evaluation metrics.

Abstract:
Stereo matching is a challenging task in 3D vision. Only relying on single-scale cost aggregation provides deficient matching information. Prior works thus try to adopt pyramid cost volumes fusion to calculate the matching cost. However, the commonly used cost volume fusion process can not fully exploit the benefits of these multi-scale cost volumes. Motivated by the cross-scale feature discrepancy, we propose an Unambiguous Pyramid cost volumes Fusion Network terms as UPFNet, to reduce the ambiguity between pyramid cost volumes at different scales and boost the cross-scale information flow in the stereo matching framework based on 3D convolution. First, we propose a pyramid-cost progressive fusion (PPF) module, which adds consistent supervision for pre-fusion cost volumes to reduce feature semantic inconsistency and facilitates cross-scale interactions to narrow the detailed gap between different scales. The output disparity can be gradually refined in a coarse-to-fine manner. Furthermore, we design a residual disparity aggregation (RDA) module, introducing disparity dimension information to further exploit the local aggregation capability of 3D convolution by squeezing disparity and exciting channel response. Extensive experiments on the Scene Flow, KITTI and Middlebury benchmarks demonstrate the effectiveness of the proposed UPFNet. The results show that the proposed approach achieves state-of-the-art performance and is ranked first in the KITTI 2015 leaderboard when submission. Our codes are available at: https://github.com/Baboom-l/UPFNet.

Abstract:
Deep contrastive clustering has recently gained significant attention due to its advantageous ability to leverage the contrastive learning paradigm for joint representation learning and clustering. However, previous deep contrastive clustering approaches mostly focus on instance discrimination or cluster discrimination, which often overlook the rich semantic information latent in the vast intermediate levels of granularity between instances and clusters. Moreover, they are typically prone to utilizing relationships only within the same level of granularity, e.g., instance-instance relationships and cluster-cluster relationships, but frequently neglect the interactions between different granularity-levels. To tackle these issues, this paper presents a novel end-to-end deep contrastive clustering approach termed Deep Clustering with Hybrid-Grained Contrastive and Discriminative Learning (DCHL). Particularly, the instance-level contrastive learning and cluster-level contrastive learning are first formulated, where the cluster-level contrastive learning is further split into fine-grained and coarse-grained branches. To capture global dependencies, the cluster-level contrastiveness is explored on the coarse-grained cluster branch. Meanwhile, to capture hybrid-grained relationships, the dual-level instance-group discrimination learning is enforced between the instance branch and the fine-grained cluster branch, where the self instance-group discrimination and the cross instance-group discrimination are simultaneously optimized for enhancing the deep clustering performance. Experiments on five challenging image datasets confirm the superiority of DCHL over the state-of-the-art. Code available: https://github.com/dengxiaozhi/DCHL.

Abstract:
The Geometry-based Point Cloud Compression (G-PCC) standard enables point cloud delivery over the internet through efficient compression. Limited by the transmission bandwidth, rate control is demanded in G-PCC for high-quality point cloud video streaming. This paper thus proposes a content-aware rate control solution for G-PCC. Given the target bitrate and distortion evaluation criteria, our method can predict the geometry and attribute quantizers for G-PCC while minimizing the overall distortion. Specifically, as the rate and distortion of both geometry and attribute are involved in G-PCC, we separately establish rate/distortion models for geometry and attribute. Moreover, recognizing the dependence of attribute compression on reconstructed geometry, we integrate the geometry quantizer into the attribute rate/distortion models to improve prediction accuracy. For dynamic coding scenarios, we leverage selective representative frames for efficient model parameter initialization. Additionally, we introduce a \mu updating strategy that dynamically incorporates information from previous frames to update the existing models. Extensive experiments demonstrate the effectiveness of our proposed method. Under the G-PCC common test condition, our method achieves remarkable rate accuracy, with a 5.3% bitrate error for static coding and 0.3% for dynamic coding. Moreover, it achieves >15% BD-Rate gains over the G-PCC anchor. These results showcase its capabilities in delivering high-fidelity point cloud video streams within the bandwidth constraint.

Abstract:
With the advancement of deep learning, the task of image-text retrieval has received widespread attention for addressing the semantic heterogeneity in multimodal data. However, many existing methods ignore the uncertainty present in manually annotated datasets. It is crucial for models to learn the potential corresponding relationships between regions in images and words in sentences. To tackle these challenges, we introduce the Multi-layer Probabilistic Association Reasoning Network (MPARN). In MPARN, the region-word association reasoning module is developed to treat each visual and textual fragment as unique probability distributions. This allows our model to imagine and capture the intricate one-to-many and many-to-many relationships between visual and textual objects. To effectively integrate the association distributions between visual and textual modalities, we propose the cross-modal association probability composer. This composer not only combines these distributions effectively but also preserves the intrinsic hierarchical structure of the elements involved. Furthermore, we introduce the semantic relationship reasoning module, which is designed to analyze the contextual semantic information within each modality. The multi-layer adaptive aggregate composer is employed to progressively explore semantic correlations within each modality and to dynamically synthesize outputs based on their relevance. Our extensive experiments on the Flickr30K and MSCOCO datasets demonstrate the MPARN’s state-of-the-art retrieval performance when compared to other baselines. The qualitative results further validate the effectiveness of the probabilistic association distributions.

Abstract:
As a prevailing cross-modal reasoning task, Visual Question Answering (VQA) has achieved impressive progress in the last few years, where the language bias is widely studied to learn more robust VQA models. However, the visual bias, which also influences the robustness of VQA models, is seldomly considered, resulting in weak inference ability. Therefore, how to balance the effect of language bias and visual bias has become essential in the current VQA task. In this paper, we devise a new reweighting strategy taking both the language bias and visual bias into account, and propose a Fair Attention Network for Robust Visual Question Answering (named as FAN-VQA). It first constructs a question bias branch and a visual bias branch to estimate the bias information from two modalities, which are utilized to judge the importance of samples. Then, adaptive importance weights are learned from the bias information and assigned to the candidate answers to adjust the training losses, enabling the model to shift more attention to the difficult samples that need less-salient visual clues to infer the correct answer. In order to improve the robustness of the VQA model, we design a progressive strategy to balance the influence of original training loss and adjusted training loss. Extensive experiments on the VQA-CP v2, VQA v2, and VQA-CE datasets demonstrate the effectiveness of the proposed FAN-VQA method.

Abstract:
Contour-based instance segmentation has been actively studied, thanks to its flexibility and elegance in processing visual objects within complex backgrounds. In this work, we propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance with an iterative and progressive contour refinement strategy. Technically, PolySnake introduces a recurrent update operator to estimate the object contour iteratively. It maintains a single estimate of the contour that is progressively deformed toward the object boundary. At each iteration, PolySnake builds a semantic-rich representation for the current contour and feeds it to the recurrent operator for further contour adjustment. Through the iterative refinements, the contour progressively converges to a stable status that tightly encloses the object instance. Beyond the scope of general instance segmentation, extensive experiments are conducted to validate the effectiveness and generalizability of our PolySnake in two additional specific task scenarios, including scene text detection and lane detection. The results demonstrate that the proposed PolySnake outperforms the existing advanced methods on several multiple prevalent benchmarks across the three tasks. The codes and pre-trained models are available at https://github.com/fh2019ustc/PolySnake.

Abstract:
Logit adjustment is an effective long-tailed visual recognition strategy to encourage a significant margin between rare and dominant labels. Existing methods typically employ the globally fixed label frequencies throughout the training to adjust margins. However, in practice, we observe that the local (in-batch) label frequencies change dynamically or even vanish for some classes (especially the tail classes) in batch-dependent training, which is inconsistent with global ones. Furthermore, our analyses reveal that the intra-class collinear samples actually do not contribute to the gradient update, but substantially increase the corresponding local label frequencies. Such contributions are spurious due to over-counting the label frequencies without contributing to the gradient. All of these will cause serious interference in precisely estimating local frequencies of the authentic contribution, leading to inauthentic margins. To simultaneously address the above issues, this paper innovatively proposes Dynamic Learnable Logit Adjustment (DLLA) loss to learn the local label frequencies within dynamic mini-batches precisely. Specifically, DLLA owns two complementary parts: 1) rank-metric eliminates spurious contributions from collinear samples by calculating the algebraic rank of the feature subspace in the mini-batch. 2) class-supplement ensures all classes appear in every mini-batch by inserting the corresponding learnable class prototype, for which we resort to neural collapse theory to make them align to the ideal regular simplex structure. Extensive experiments on standard benchmark datasets verify the effectiveness of our method.

Abstract:
View change causes significant differences in the gait appearance. Consequently, recognizing gait in cross-view scenarios is highly challenging. Most recent approaches either convert the gait from the original view to the target view before recognition is carried out or extract the gait feature irrelevant to the camera view through either brute force learning or decouple learning. However, these approaches have many constraints, such as the difficulty of handling unknown camera views. This work treats the view-change issue as a domain-change issue and proposes to tackle this problem through adversarial domain adaptation. This way, gait information from different views is regarded as the data from different sub-domains. The proposed approach focuses on adapting the gait feature differences caused by such sub-domain change and, at the same time, maintaining sufficient discriminability across the different people. For this purpose, a Hierarchical Feature Aggregation (HFA) strategy is proposed for discriminative feature extraction. By incorporating HFA, the feature extractor can well aggregate the spatial-temporal feature across the various stages of the network and thereby comprehensive gait features can be obtained. Then, an Adversarial View-change Elimination (AVE) module equipped with a set of explicit models for recognizing the different gait viewpoints is proposed. Through the adversarial learning process, AVE would not be able to identify the gait viewpoint in the end, given the gait features generated by the feature extractor. That is, the adversarial domain adaptation mitigates the view change factor, and discriminative gait features that are compatible with all sub-domains are effectively extracted. Extensive experiments on three of the most popular public datasets, CASIA-B, OULP, and OUMVLP richly demonstrate the effectiveness of our approach.

Abstract:
Few-shot object detection (FSOD) has brought increasing academic interest by recognizing previously unseen novel classes with very limited well-labeled samples. However, most existing methods identify novel classes via some object-specific characteristics in the few provided samples rather than intrinsic inter-class relations between base and novel classes, which heavily degrades the detection performance on novel classes. Moreover, they cannot learn discriminative proposal representations to distinguish base and novel classes, and thus misclassify novel objects as confusable base classes. To tackle the above challenges, we develop a novel Category-contextual Relation Encoding Network (CRE-Net), which is an early attempt to reason inter-class context relationships for FSOD task. To be specific, we propose a novel category-contextual relation encoding mechanism to capture intrinsic inter-class relations between base and novel classes via knowledge aggregation from global category-contextual descriptors. It utilizes intrinsic inter-class contextual relations to adaptively refine the convolution kernel, thus encoding the local semantic context of query image with category-contextual relation as guidance. Furthermore, to explore discriminative representations for base and novel classes, we develop a scarcity-compensatory contrastive proposal loss by incorporating data scarcity of novel classes and proposal semantic consistency with high confidence. This loss could compact object instances from the same category to a tighter cluster, and enhance the space separability of different classes. Extensive experiments on Pascal VOC and COCO datasets verify the state-of-the-art detection performance of our CRE-Net model when compared with other baseline methods.

Abstract:
Video-based person re-identification (Re-ID) aims at retrieving the video clips of the same person across multiple cameras. Since video clips are captured at various spatial resolutions (scales), learning multi-scale person appearance features while constructing the cross-scale information interaction is pivotal for video-based person Re-ID. In this paper, we propose an efficient framework, Multi-Scale Aligned Spatial-Temporal Interaction (MS-STI), which not only exchanges the spatial-temporal information within a scale, but also mines implicit related complementary knowledge across scales. MS-STI presents a hierarchical multi-branch architecture that designs the branches with fewer convolutional layers for lower spatial resolution inputs. In this way, the framework enables inter-scale feature size matching for exchanging information across multiple scale-specific branches. We share the parameters of branched sub-networks to optimize the efficiency of person feature extraction. Furthermore, we propose two modules, Spatial Interaction (SI) and Multi-Scale Temporal Interaction (MSTI), which can realize spatial-temporal interaction across multiple branches. SI performs point-wise spatial information transfer within a frame. While MSTI focuses on inter-frame and inter-scale information interaction. Extensive experiments on three challenging benchmarks demonstrate the effectiveness and superiority of the proposed MS-STI.

Abstract:
Due to the flexible training requirement and the appealing generalization ability, unpaired image dehazing has received increasing attention in coping with real-world hazy images. However, most of the existing methods rely on the loose dehazing-hazing cycle constraint, which makes it hard to eliminate poor-quality dehazing results when using a powerful hazing network in the training process. To address this issue, this paper proposes a simple yet efficient Adversarial Deformation Constraint (ADC). More specifically, we sequentially perform two operations, i.e., dehazing and deformation, on a hazy image. In the training process, the dehazing branch is desired to be deformation-unaware, which requires that the output of these two operations remains constant regardless of their performing order. Adversarially, the deformation branch tends to maximize the difference in the outputs of these two operations when their performing orders are different. Through an additive image decomposition model, we verify that the ADC could regularize the solution space to push the dehazing error towards zero. Finally, by incorporating ADC into the common dehazing-hazing cycle constraint, we significantly improve the robustness of unpaired image dehazing. Experiments on multiple benchmark hazy image databases demonstrate the superiority of ADC over many state-of-the-art image dehazing methods. The source code of the proposed ADC-Net will be released on https://github.com/whrws/ADC-Net.

Abstract:
Recent image manipulation detection approaches primarily rely on sophisticated Convolutional Neural Network (CNN)-based models for region localization, while they tend to ignore: 1) the feature correlations that exist between manipulated and non-manipulated regions; 2) significance of multi-scale representations in detecting manipulated regions of varying sizes, consequently hampering the overall performance of image manipulation detection. To address these limitations, we propose a novel approach, called Cascade Hierarchical Graph Convolutional Network (Cas-HGCN), which comprehensively learns the feature correlations between manipulated and non-manipulated regions at different scales using the Feature Correlations Modeling (FCM) module. Specifically, the FCM module treats the grids in the hierarchical image/feature maps as nodes, constructs a fully-connected graph by connecting each node, and leverages it to learn and refine feature correlations across different scales in a cascading manner. This process results in high discriminability for distinguishing manipulated and non-manipulated regions. Extensive experiments conducted on three public datasets, namely CASIA, NIST, and Coverage, demonstrate the promising detection accuracy achieved by Cas-HGCN without the need for pre-training on large datasets, surpassing the performance of existing state-of-the-art competitors.

Abstract:
Unsupervised domain adaptation (UDA) aims to estimate a transferable model for unlabeled target domains by exploiting labeled source data. Optimal Transport (OT) based methods have recently been proven to be a promising solution for UDA with a solid theoretical foundation and competitive performance. However, most of these methods solely focus on domain-level OT alignment by leveraging the geometry of domains for domain-invariant features based on the global embeddings of images. However, global representations of images may destroy image structure, leading to the loss of local details that offer category-discriminative information. This study proposes an end-to-end Deep Hierarchical Optimal Transport method (DeepHOT), which aims to learn both domain-invariant and category-discriminative representations by mining hierarchical structural relations among domains. The main idea is to incorporate a domain-level OT and image-level OT into a unified OT framework, hierarchical optimal transport, to model the underlying geometry in both domain space and image space. In DeepHOT framework, an image-level OT serves as the ground distance metric for the domain-level OT, leading to the hierarchical structural distance. Compared with the ground distance of the conventional domain-level OT, the image-level OT captures structural associations among local regions of images that are beneficial to classification. In this way, DeepHOT, a unified OT framework, not only aligns domains by domain-level OT, but also enhances the discriminative power through image-level OT. Moreover, to overcome the limitation of high computational complexity, we propose a robust and efficient implementation of DeepHOT by approximating origin OT with sliced Wasserstein distance in image-level OT and accomplishing the mini-batch unbalanced domain-level OT. Extensive experiments show the superiority of DeepHOT in several benchmark datasets. The code is available on GitHub (https://github.com/Innse/DeepHOT).

Abstract:
Multi-view subspace clustering (MVSC) is a popular area of research that concentrates on partitioning data points from multiple views. It has gained wide attention in recent years due to the ability to handle complex data with diverse features across different views. However, the success of MVSC largely relies on the quality of the learned similarity matrix, and existing methods normally adopt the separate two-step procedures of optimization and symmetrization, which could not guarantee symmetry and adaptive locality of the similarity matrix. To alleviate this issue, in this paper, we propose a novel paradigm called Symmetric Multi-view Subspace Clustering with Automatic Neighbor Discovery (SMSC-AND), which aims at formulating the symmetrization and localization of the ideal similarity matrix into one unified framework. In particular, we theoretically and experimentally demonstrate that SMSC-AND can directly receive the refined symmetric similarity matrix without previous post-processing procedures. Additionally, we propose an automatic neighbor discovery strategy that avoids previous rank constraints or fixed neighbor size, thereby eliminating the requirement for additional hyperparameters. Benefiting from the aforementioned merits, we can directly explore the local structure of the consensus similarity matrix of multi-view data without pre-searching hyperparameters. Comprehensive experimental results on various benchmark datasets have demonstrated the superiority of the proposed algorithm when compared with other MVSC competitors.

Abstract:
The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J\&F on Ref-YoutubeVOS dataset and 83.2% of J_S on YoutubeVOS dataset, respectively. The code will be released.

Abstract:
The object tracking technology for aerial remote sensing images has made significant development, but it is still a very challenging work. The related difficulties of object tracking include the accumulation of long-term tracking errors, similar object interference, partial or full occlusion, scale change, etc, which can lead to object tracking failure. In this paper, an aerial object tracker with ViT Spatio-Temporal Feature Fusion (STFF) for the aerial remote sensing images is proposed, which can achieve accurate tracking of aviation objects. Firstly, we propose a spatial-temporal feature fusion strategy based on the characteristics of object tracking timing. In this strategy, the object information of the previous frames is applied to enhance both the real-time responsiveness of the model and the performance of the tracker. Secondly, the dynamic change information of objects in space and time context is used for spatio-temporal feature information fusion, which can further enhance the appropriate correlation and promote the feature aggregation and information transmission of visual tracking. Finally, a dataset with real and virtual scenarios is collected and constructed to address training data requirements for aviation object tracking. According to our experiments, STFF can achieve accurate tracking of aerial objects and has achieved excellent performance on UAV123, DTB70 and our benchmarks.

Abstract:
Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.

Abstract:
The current high-fidelity generation and high-precision detection of DeepFake images are at an arms race. We believe that producing DeepFakes that are highly realistic and “detection evasive” can serve the ultimate goal of improving future generation DeepFake detection capabilities. In this paper, we propose a simple yet powerful pipeline to reduce the artifact patterns of fake images without hurting image quality by performing implicit spatial-domain notch filtering. We first demonstrate that frequency-domain notch filtering, although famously shown to be effective in removing periodic noise in the spatial domain, is infeasible for our task at hand due to the manual designs required for the notch filters. We, therefore, resort to a learning-based approach to reproduce the notch filtering effects, but solely in the spatial domain. We adopt a combination of adding overwhelming spatial noise for breaking the periodic noise pattern and deep image filtering to reconstruct the noise-free fake images, and we name our method DeepNotch. Deep image filtering provides a specialized filter for each pixel in the noisy image, producing filtered images with high fidelity compared to their DeepFake counterparts. Moreover, we also use the semantic information of the image to generate an adversarial guidance map to add noise intelligently. Our large-scale evaluation on 3 representative DeepFake detection methods (tested on 16 types of DeepFakes) has demonstrated that our technique significantly reduces the accuracy of these 3 fake image detection methods, 36.79% on average and up to 97.02% in the best case.

Abstract:
Cross-modal compression (CMC) aims to compress highly redundant visual data into compact, common, and human-comprehensible domains, such as text, to preserve semantic fidelity. However, CMC is limited by a constant level of semantic fidelity and constrained semantic fidelity due to a single compression domain (plain text). To address these issues, we propose a new approach called Multiple-domains rate-distortion optimized CMC (M-CMC). Specifically, our method divides the image into two complementary representations: 1) a structure representation with an edge map, and 2) a texture representation with dense captions, which include numerous region-caption pairs instead of plain text. In this way, we expand the single domain to multiple domains, namely, edge maps, regions, and text. To achieve diverse levels of semantic fidelity, we suggest a rate-distortion reward function, where the distortion measures the semantic fidelity of reconstructed images and the rate measures the information content of the text. We also propose Multiple-stage Self-Critical Sequence Training (MSCST) to optimize the reward function. Extensive experimental results demonstrate that the proposed method achieves diverse levels of semantic translation more effectively than other CMC-based methods, achieves higher semantic compression performance compared to traditional block-based and learning-based image compression frameworks with 97,000-500 times compression ratio, and provides a simple yet effective way for image editing.

Abstract:
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.

Abstract:
While existing lightweight visual trackers can run in real-time at edge devices, they face the difficulty of object appearance changes. An effective solution to this problem is to add an online updatable dynamic template for trackers to learn about changes in target appearance over time. However, existing dynamic template utilization methods are unsuitable for lightweight networks, resulting in limited accuracy improvement and a significant increase in computational workload. In this paper, we propose PromptVT, an efficient and accurate video tracking framework, which consists of two important designs: a plug-and-play dynamic template prompter (DTP) and a hierarchical multi-scale transformer (HMT). The DTP module guides networks to effectively learn changes between initial and dynamic templates through two prompts without additional computational workload. The HMT module combines spatial features of the search area and template at different scales and levels, enabling the tracker to learn a more comprehensive visual representation. Our proposed PromptVT outperforms state-of-the-art real-time trackers on eight benchmarks (VOT2020, LaSOT, GOT-10K, UAV123, AntiUAV, AntiUAV410, TrackingNet, OTB100) while running at 52 fps (PyTorch model) and 76 fps (ONNX model) on CPUs, with only 2.9G FLOPs and 3M parameters. Code and models are available at https://github.com/faicaiwawa/PromptVT.

Abstract:
RGB-T tracking has attracted increasing attention recently due to the all-weather and all-day working capability. However, most current RGB-T trackers usually assume that RGB data and thermal infrared (TIR) data are well spatially aligned, which is difficult to be achieved in practice. Such spatial misalignment between RGB data and TIR data may lead to the ineffective cross-modal information propagation during multi-modal feature fusion, thus reducing the tracking performance. In addition, due to the discrepancy in imaging characteristics of RGB images and TIR images, there also exist great differences between the information captured by the two modality data. The differences in characteristics of RGB and TIR modalities in different local areas will cause a single fusion strategy to be unable to fully explore the complementary information within multi-modal data. For that, we propose an RGB-T tracker, referred to as AMNet, to specifically solve such two problems with two dedicated modules, i.e., a Mutual-interacted Spatial Alignment (MSA) module and an Information Matching Fusion (IMF) module. The former spatially aligns the two modality data through three essential parts, including interactions of multi-modal features, prediction of cross-modal offset map, and enhancement of the aligned features. While the latter first discriminates different types of local regions by employing several intra-modal attention modules and then uses a divide-and-conquer fusion strategy to exploit such discriminative information within RGB and TIR features of different cases for tracking. We validate the effectiveness of our AMNet with extensive experiments on three RGB-T benchmarks, which achieves new state-of-the-art performance.

Abstract:
Deep networks have made remarkable progress in Multi-View Stereo (MVS) task in recent years. However, the problem of finding accurate correspondences across different views under ill-posed matching situations remains unresolved and crucial. To address this issue, this paper proposes a Geometry-enhanced Attentive Multi-View Stereo (GA-MVS) network, which can access multi-view consistent feature representation and achieve accurate depth estimation in challenging situations. Specifically, we propose a geometry-enhanced feature extractor to explore illumination-invariant geometric features and incorporate them with common texture features to improve matching accuracy when dealing with view-dependent photometric effects, such as shadow and specularity. Then, we design a novel attentive learning framework to explore per-pixel adaptive supervision, effectively improving the depth estimation performance of textureless regions. The experimental results on the DTU and Tanks & Temples benchmarks demonstrate that our method achieves state-of-the-art results compared to other advanced MVS models.

Abstract:
Natural video capturing suffers from visual blurriness due to high-motion of cameras or objects. Until now, the video blurriness removal task has been extensively explored for both human vision and machine processing. However, its computational cost is still a critical issue and has not yet been fully addressed. In this paper, we propose a novel Lightweight Video Deblurring (LightViD) method that achieves the top-tier performance with an extremely low parameter size. The proposed LightViD consists of a blur detector and a deblurring network. In particular, the blur detector effectively separate blurriness regions, thus avoid both unnecessary computation and over-enhancement on non-blurriness regions. The deblurring network is designed as a lightweight model. It employs a Spatial Feature Fusion Block (SFFB) to extract hierarchical spatial features, which are further fused by ConvLSTM for effective spatial-temporal feature representation. Comprehensive experiments with quantitative and qualitative comparisons demonstrate the effectiveness of our LightViD method, which achieves competitive performances on GoPro and DVD datasets, with reduced computational costs of 1.63M parameters and 96.8 GMACs. Trained model available: https://github.com/wgp/LightVid.

Affiliations: PCA Laboratory, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Computer Science and Engineering, Southeast University, Nanjing, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China; Department of Computer and Information Science, State Key Laboratory of Internet of Things for Smart City, University of Macau, Macau, China

Abstract:
Medical image segmentation is an essential process to assist clinics with computer-aided diagnosis and treatment. Recently, a large amount of convolutional neural network (CNN)-based methods have been rapidly developed and achieved remarkable performances in several different medical image segmentation tasks. However, the same type of infected region or lesions often has a diversity of scales, making it a challenging task to achieve accurate medical image segmentation. In this paper, we present a novel Uncertainty-aware Hierarchical Aggregation Network, namely UHA-Net, for medical image segmentation, which can fully make utilization of cross-level and multi-scale features to handle scale variations. Specifically, we propose a hierarchical feature fusion (HFF) module to aggregate high-level features, which is used to produce a global map for the coarse localization of the segmented target. Then, we propose an uncertainty-induced cross-level fusion (UCF) module to fully fuse features from the adjacent levels, which can learn knowledge guidance to capture the contextual information from adjacent resolutions. Further, a scale aggregation module (SAM) is presented to learn multi-scale features by using different convolution kernels, to effectively deal with scale variations. At last, we formulate a unified framework to simultaneously fuse inter-layer convolutional features and learn the discriminability of multi-scale representations from the intra-layer features, leading to accurate segmentation results. We carry out experiments on three different medical image segmentation tasks, and the results demonstrate that our UHA-Net outperforms state-of-the-art segmentation methods. Our implementation code and segmentation maps will be publicly at https://github.com/taozh2017/UHANet.

Abstract:
Graph Contrastive Learning (GCL) has achieved great success in self-supervised representation learning throughout positive and negative pairs based on graph neural networks (GNNs), where one critical issue lies in how to handle the false hard negatives that share the large similarity to the same referenced class as the anchor, which is critical to message passing of GNNs to exploit the graph structure. However, the existing arts either mistakenly identify or miss the false hard negatives, hence resulting into poor node representation. Building on this, there are several crucial bottlenecks – Where do false hard negatives exist upon the anchor? How to well seek false hard negatives? Whether are more false hard negatives better? To answer these questions, in this paper, we propose a novel Locally Weighted Graph Contrastive Learning method, named LocWGCL, while revealing that false hard negatives are primarily distributed in the first-order and second-order neighborhoods of the anchor. Benefiting from the tightness between the first-order nodes and the anchor, representation similarity is calculated to select false hard negatives. For the second-order case, false hard negatives are identified, such that they share the similar passed message with the anchor over the common first-order nodes, along with the large similarity. Upon the seeking process, we devise a weighted strategy to false hard negatives for better node representation. Empirical studies verify the advantages of LocWGCL over the state-of-the-arts on six benchmarks.

Abstract:
This paper proposes a unified and efficient entropy coding method for learned image compression (LIC) from the perspective of traditional signal processing. First, the consistency of structures and optimization objectives are used to interpret the existing split-coded-then-merge entropy coding strategies in LIC as a particular filter banks framework, with feature separation and feature aggregation representing the analysis filter bank and synthesis filter bank, respectively. Thus, we borrow the design from the multirate filter banks and proposed Multirate Progressive Entropy Model (MPEM) to enhance the rate-distortion performance and decoding speed. In particular, we create an analysis filter bank that divides compact features into a few nonuniform subsets based on various spatial and channel sampling rates. Then multi-scale detail and mean coefficients within the current subset are used as prior representations to help generate the prediction parameters of the next subset, and the carefully designed synthetic filter bank performs a near-perfect reconstruction of the features. In addition, we propose a Multi-level Edge Attention Moudal (MEAM) to increase the edge and texture information’s contribution and reduce the high-frequency information loss brought on by MPEM’s inherent multi-rate spatial sampling, which leverages the edge operator and structural reparameterization principles. The results of the experiments show that, in comparison to the effective LIC methods and traditional code, the proposed MPEM can decode data at a cutting-edge speed while also offering comparable rate-distortion performance.

Abstract:
Image captioning (IC) takes an image as input and generates open-form descriptions in the domain of natural language. IC requires the detection of objects, modeling of relations between them, an assessment of the semantics of the scene and representing the extracted knowledge in a language space. Previous detector-based models suffer from limited semantic perception capability due to predefined object detection classes and semantic inconsistency between visual region features and numeric labels of the detector. Inspired by the fact that text prompts in pre-trained multi-modal models contain specific linguistic knowledge rather than discrete labels, and excel at an open-form semantic understanding of visual inputs and their representation in the domain of natural language. We aim to distill and leverage the transferable language knowledge from the pre-trained RegionCLIP model to remedy the detector for generating rich image captioning. In this paper, we propose a novel Cascade Semantic Prompt Alignment Network (CSA-Net) to produce an aligned fine-grained regional semantic-visual space where rich and consistent textual semantic details are automatically incorporated to region features. Specifically, we first align the object semantic prompt and region features to produce semantic grounded object features. Then, we employ these object features and relation semantic prompt to predict the relations between objects. Finally, these enhanced object and relation features are fed into the language decoder, generating rich descriptions. Extensive experiments conducted on the MSCOCO dataset show that our method achieves a new state-of-the-art performance with 145.2% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on the ‘Karpathy’ split, 141.6% (c5) and 144.1% (c40) CIDEr scores on the official online test server. Significantly, CSA-Net outperforms in generating captions with higher quality and diversity, achieving a RefCLIP-S score of 83.2. Moreover, we expand the testbeds to other challenging captioning benchmarks, i.e., nocaps datasets, CSA-Net demonstrates superior zero-shot capability. Source codes released at https://github.com/CrossmodalGroup/CSA-Net.

Abstract:
Replacing objects in images is a practical functionality of Photoshop, e.g., clothes changing. This task is defined as Unsupervised Deformable-Instances Image-to-Image Translation (UDIT), which maps multiple foreground instances of a source domain to a target domain, involving significant changes in shape. Although previous works incorporate instance masks of source domain for instance shape indication, their translation still fails in shape because of inadequate utilization of shape information in masks. To mitigate this issue, we introduce an effective two-stage pipeline for UDIT called Mask-Guided Deformable-instances GAN++ (MGD-GAN++), which generates target masks in the first stage named Mask Morph and utilizes the masks to guide the synthesis of corresponding instances in the second stage named Mask-Guided Image Generation. To further provide sufficient supervision with existing unpaired datasets, an overall set of training schemes is proposed for the two stages of MGD-GAN++, coined as Aligned Supervision and Inpainting Supervision, respectively. Extensive experiments on four datasets demonstrate the significant advantages of our MGD-GAN++ over existing methods both quantitatively and qualitatively. Furthermore, our training time consumption is hugely reduced compared to the state-of-the-art.

Abstract:
Despite recent progress, Video Object Segmentation (VOS) remains challenging in complex situations such as low light and dark scenes. In this paper, we tackle the visibility limitations by introducing thermal information as auxillary for VOS. Specifically, we generate a hybrid benchmark dataset for Visible-Thermal VOS, named VisT300, which contains 300 challenging videos with visible light and thermal frames and corresponding object mask annotations. Besides, a Visible-Thermal integration Network, named as VTiNet, is proposed to use both cross-modal and cross-frame propagation for accurate video object segmentation. It is advantageous in two aspects: 1) effective cross-modal feature fusion and propagation for strong expressions on visible, thermal, and fused modalities; 2) effective modality-sensitive memory bank enables preserving the most valuable historical contexts in each modality. Extensive experiments demonstrate our VTiNet outperforms the state-of-the-art VOS works by a large margin (over 5% than RGB SotAs in Mean \mathcal J & \mathcal F ). Our preliminary research clearly recovers that importing complementary modalities can effectively increase the strength of models to achieve robust segmentation in challenging scenarios. Data and code are released at https://github.com/yjybuaa/vtinet, and we hope this work will promote the progress of visible-thermal VOS.

Abstract:
With a focus on abnormal events contained within untrimmed videos, there is increasing interest among researchers in video anomaly detection. Among different video anomaly detection scenarios, weakly-supervised video anomaly detection poses a significant challenge as it lacks frame-wise labels during the training stage, only relying on video-level labels as coarse supervision. Previous methods have made attempts to either learn discriminative features in an end-to-end manner or employ a two-stage self-training strategy to generate snippet-level pseudo labels. However, both approaches have certain limitations. The former tends to overlook informative features at the snippet level, while the latter can be susceptible to noises. In this paper, we propose an Anomalous Attention mechanism for weakly-supervised anomaly detection to tackle the aforementioned problems. Our approach takes into account snippet-level encoded features without the supervision of pseudo labels. Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module. The module learns different areas of the video, including areas that are challenging to detect, and also assists the attention optimization. Experiments on benchmark datasets XD-Violence and UCF-Crime verify the effectiveness of our method. Besides, thanks to the proposed snippet-level attention, we obtain a more precise anomaly localization.

Abstract:
Image-to-image (I2I) translation often requires establishing cycle consistency between the source and the translated images across different domains. However, cycle consistency requires redundant reconstruction, and is too restrictive to satisfy the bijection assumption between the two domains. In this paper, we propose SwinIT, a hierarchical Swin-transformer I2I Translation framework without using cycle consistency. Specifically, we carefully design symmetrical encoders for content and style flows, then explore newly proposed adaptive denormalization and normalization strategies. This framework can effectively capture and fuse content and style representations in a coarse-to-fine manner, ensuring our method achieves high performance without cycle consistency. Guided by element-wise feature adaptive denormalization, our model focuses on preserving semantic structure information. Due to the semantic mismatch between unpaired source and exemplar images, we introduce cross-attention adaptive instance normalization to help achieve better alignment. However, because the original optimization objective lacks direct supervision to preserve high-frequency information, rich edge details are lost during the translation. We propose a wavelet transformation matching loss to recover the details by converting the image into multi-frequency parts. We validate our proposed method in various I2I translation tasks, including arbitrary style transfer, multi-modal image synthesis, and semantic image synthesis, demonstrating its effectiveness in both qualitative and quantitative evaluations.

Abstract:
Interpretation of predictions made by Convolutional Neural Networks (CNNs) is a rapidly growing field of research. A common approach involves enhancing semantic segmentation predictions through the generation of heatmaps that illustrate the significance of individual pixels in the segmentation. Nevertheless, the selection of beneficial features from these heatmaps remains a challenge. This is because the introduced information often contains interfering factors such as mutual features between different objects, background, and insufficient heat map resolution which often diminish its effectiveness. To overcome these limitations, we introduce Refined Weak Slices (RWS). Our main idea is to identify low attention regions in heat maps i.e. weak slices, in conjunction with segmentation accuracy, and utilize them to select effective features across different DNN layers, to enhance segmentation. We then seamlessly integrate these features back into the CNN, thus refining and enhancing the semantic segmentation result with selected features. Through extensive experiments, we demonstrate that incorporating the RWS module into state-of-the-art methods yields a notable improvement in the average mIoU by 2.84% on benchmark datasets (VOC 2012, COCOStuff, ADE20K, Cityscapes) for both ResNet-101 and ResNet-50 architectures. Furthermore, we achieve a maximum improvement of 5.8% with a single CNN. Overall, the combination of RWS and CNNs exhibits excellent performance in image segmentation tasks.

Abstract:
Learning-based point cloud registration has achieved great success in recent years but is still limited by its generalization. The performance of these methods declines when they are extended to unseen datasets that have inconsistent distributions with the training set. In this paper, we propose a novel random network-based method, which does not require training. Our approach utilizes multiple randomly initialized networks for feature extraction and correspondence building. Furthermore, we also introduce a co-ensemble strategy to prune the outliers in correspondences built upon random networks, which leverages spatial consistency. Through our co-ensemble pruning, a large proportion of outliers can be removed, thereby achieving robust registration in affordable RANSAC iterations. Extensive experiments on 3DMatch and KITTI demonstrate that our method outperforms not only the traditional methods but also the learning-based methods trained on datasets inconsistent with the test set. The code will be released at https://github.com/phdymz/RandPCR.

Abstract:
Semantic segmentation based on 4D light field (LF) images exhibits superior performance by exploiting rich spatial and angular information. However, current methods only focus on narrow-baseline cases, ignoring the feasibility and capability of large disparity scene for segmentation. Motivated by this, we propose a novel network called LF-IENet++ suitable for both narrow-baseline LF and wide-baseline LF in this paper, which fully mines complementary information across views via implicit feature integration and explicit feature propagation. In order to concentrate on inconsistent context between view images during feature integration, we shield small disparity regions manifested as repeat content to avoid redundant attention. Besides, a two-stage operation consisting of the image-level warping and feature-level warping is introduced to mitigate the propagation distortion. Since both feature integration and feature propagation require exact guidance from prior disparity, we design a semantic-aware disparity estimator that leverages semantic cues to optimize disparity generation while ensuring that our network can perform semantic segmentation in an end-to-end solution. To validate the effectiveness of the proposed method, we present the first multi-scale baseline dataset for LF semantic segmentation. Compared to state-of-the-art methods, our LF-IENet++ achieves outstanding performance and shows high robustness under different disparity situations. Besides, our method obtains higher accuracy on wide-baseline cases, demonstrating the significance of introducing large disparity LF for semantic segmentation.

Abstract:
Recently, Few-Shot Object Detection (FSOD) has received considerable research attention as a strategy for reducing reliance on extensively labeled bounding boxes. However, current approaches encounter significant challenges due to the intrinsic issue of incomplete annotation while building the instance-level training benchmark. In such cases, the instances with missing annotations are regarded as background, resulting in erroneous training gradients back-propagated through the detector, thereby compromising the detection performance. To mitigate this challenge, we introduce a simple and highly efficient method that can be plugged into both meta-learning-based and transfer-learning-based methods. Our method incorporates two innovative components: Confusing Proposals Separation (CPS) and Affinity-Driven Gradient Relaxation (ADGR). Specifically, CPS effectively isolates confusing negatives while ensuring the contribution of hard negatives during model fine-tuning; ADGR then adjusts their gradients based on the affinity to different category prototypes. As a result, false-negative samples are assigned lower weights than other negatives, alleviating their harmful impacts on the few-shot detector without the requirement of additional learnable parameters. Extensive experiments conducted on the PASCAL VOC and MS-COCO datasets consistently demonstrate that our method significantly outperforms both the baseline and recent FSOD methods. Furthermore, its versatility and efficiency suggest the potential to become a stronger new baseline in the field of FSOD. Code is available at https://github.com/Ybowei/UNP.

Abstract:
Neural Architecture Search (NAS) is a powerful tool for automating effective image and video processing DNN designing. The ranking of the accuracy has been advocated to design an efficient performance predictor for NAS. The previous contrastive method solves the ranking problem by comparing pairs of architectures and predicting their relative performance. However, it only focuses on the rankings between the two involved architectures and neglects the overall quality distributions of the search space, which may suffer generalization issues. On the contrary, we propose to let the performance predictor concentrate on the global quality level of specific architecture, and learn the tier embeddings of the whole search space automatically with learnable queries. The proposed method, dubbed as Neural Architecture Ranker with Query-to-Tier technique (NARQ2T), explores the quality tiers of the search space globally and classifies each individual to the tier they belong to. Thus, the predictor gains knowledge of the performance distributions of the search space which helps to generalize its ranking ability to the datasets more easily. Thanks to the encoder-decoder design, our method is able to predict the latency of the searched model without deteriorating the performance prediction. Meanwhile, the global quality distribution facilitates the search phase by directly sampling candidates according to the statistics of quality tiers, which is free of training a search algorithm, e.g., Reinforcement Learning or Evolutionary Algorithm, thus it simplifies the NAS pipeline and saves the computational overheads. The proposed NARQ2T achieves state-of-the-art performance on two widely used datasets for NAS research. Moreover, extensive experiments have validated the efficacy of the designed method.

Abstract:
Super-Resolution (SR) algorithms aim to enhance the resolutions of images. Massive deep-learning-based SR techniques have emerged in recent years. In such case, a visually appealing output may contain additional details compared with its reference image. Accordingly, fully referenced Image Quality Assessment (IQA) cannot work well; however, reference information remains essential for evaluating the qualities of SR images. This poses a challenge to SR-IQA: How to balance the referenced and no-reference scores for user perception? In this paper, we propose a Perception-driven Similarity-Clarity Tradeoff (PSCT) model for SR-IQA. Specifically, we investigate this problem from both referenced and no-reference perspectives, and design two deep-learning-based modules to obtain referenced and no-reference scores. We present a theoretical analysis based on Human Visual System (HVS) properties on their tradeoff and also calculate adaptive weights for them. Experimental results indicate that our PSCT model is superior to the state-of-the-arts on SR-IQA. In addition, the proposed PSCT model is also capable of evaluating quality scores in other image enhancement scenarios, such as deraining, dehazing and underwater image enhancement. The source code is available at https://github.com/kekezhang112/PSCT.

Abstract:
Video moment retrieval aims to locate the timestamps best matching the query description within an untrimmed video. However, existing video moment retrieval approaches typically suffer from two major limitations: (1) Utilize only negative moment-sentence pairs sampled from intra-videos, which may overfit the bias of the dataset and not have an excellent understanding of the video and query due to the dataset size and annotation biases. (2) Decouple the video and the query, perform unimodal learning separately, and then concatenate them together as multimodal fusion features. In this paper, we propose a novel approach named Momentum Contrastive Matching Network(MCMN). Inspired by MoCo, we propose the Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions, which contributes to the generation of more precise and discriminative representations, and use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. In addition, we use an attention module to adaptively generate clip-specific word embeddings to achieve semantic alignment from a temporal perspective, which are considered to be more important for finding relevant video contents with large boundary ambiguities. Experimental results on the three major video moment retrieval benchmark datasets, including TACoS, Charades-STA, and ActivityNet Captions demonstrate that MCMN surpasses previous methods and reaches state-of-the-art with disparate visual features.

Abstract:
The objective of video re-localization (VRL) is to localize a successive sequence of frames, namely, the target moment, from untrimmed reference videos that semantically correspond to a given query video. During training, the weakly supervised setting of VRL provides only coarse-grained video-level rather than frame-level annotations. For the weakly supervised VRL (WS-VRL) task, obtaining effective video feature representations that can be used to evaluate the relevance between videos and localizing the accurate temporal boundaries of the target moment remain challenging. In this paper, a novel multi-agent-reinforced switchable network (MARS) is proposed to address these challenges. MARS can adaptively guide video feature encoding and moment localization using multiple learned agents. Specifically, an agent-controlled switchable encoder is used to obtain effective video feature representations, and an agent-reinforced boundary localizer is used to determine accurate localized moments through progressive refinement. Furthermore, a relevance-oriented reward generator was designed to estimate the relevance of the localized moment to the query video and assign a reward to multiple agents. The effectiveness of the proposed MARS model was verified through extensive experiments on the ActivityNet-VRL dataset.

Abstract:
The Versatile Video Coding (VVC) standard adopts a series of new coding tools in transform and quantization, including multiple transform selection, low-frequency non-separable transform, and trellis quantization. These new technologies, which bring significant coding gain, create daunting challenges to optimizing the VVC codec. In this work, we propose a new all-zero block (AZB) detection scheme tailored for VVC, with the collaboration of genuine all-zero block (GAZB) and pseudo all-zero block (PAZB) detection. First, to accommodate the multiple transform sizes in VVC, we develop a GAZB detection method that is apt for square and non-square residual blocks. Meanwhile, a theoretical upper bound is derived to locate the last significant coefficient and detect the potential frequency domain GAZB. Subsequently, a method tailored for trellis-coded quantization in VVC is devised for detecting PAZB. Finally, the GAZB and PAZB detection methods are collaboratively employed for AZB detection in VVC. The proposed method is implemented on the VVC codec Versatile Video Encoder (VVenC), and extensive experimental results show that the proposed method achieves promising time savings for test sequences of different resolutions with negligible rate-distortion performance loss.

Abstract:
Object Tracking in satellite videos is a challenging task due to the small target size, low spatial resolution, limited appearance and texture information, and the potential for background confusion. While current state-of-the-art tracking methods perform well on natural images, they often produce unsatisfactory results when applied to satellite videos. In this paper, we address these challenges by leveraging location prompts and refining the feature extractor and bounding box refinement module. Furthermore, we integrate motion features to effectively handle illumination variations that frequently arise in satellite videos, thereby enhancing the overall robustness of the tracker. Our proposed approach, abbreviated as SVLPNet, has been thoroughly evaluated through extensive experiments conducted on two authentic satellite video datasets. The obtained results unequivocally showcase the promising potential of SVLPNet in facilitating object tracking on satellite videos. The source code and raw results will be released at https://github.com/Wprofessor/SVLPNet.

Abstract:
Spatio-temporal resolution adaptive (STRA) coding has been repeatedly proven to be a promising way to improve coding efficiency and reduce coding complexity. The wide consensus is that the optimal subsampled resolution and frame rate should be governed by so- called generalized rate-distortion performance based on the ultimately perceived distortion. However, it is non-trivial to accurately predict the quality of reconstructed videos due to the fact that the distortion originates from both subsampling and compression. To address this issue, we propose a novel video quality assessment model that is fully aware of the information available in downsampled videos for compression, such as resolution and frame rate. More specifically, the proposed model relies on quality-aware spatial features that are extracted by an image quality fine-tuned backbone. Subsequently, the spatio-temporal quality is modeled based on the transformer encoder, which is adaptive to the downsampling spatial and temporal resolutions. This enables the transformer encoder to produce discriminative features that capture long-range temporal dependencies related to the current context. The quality score, which is the output of the transformer encoder, thus reflects both the influence of the subsampling and compression. We conduct extensive experiments that demonstrate the superiority of the proposed model over state-of-the-art methods on four subsampling and compression video quality datasets. Furthermore, we apply the proposed model to bitrate ladder optimization, leading to a perceptual-aware spatial and temporal downsampling strategy that yields promising bitrate savings. The source codes of the proposed model will be publicly available at https://github.com/h4nwei/STRA-VQA.

Abstract:
Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model’s potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method’s effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.

Abstract:
Few-shot 3D point cloud segmentation segments novel categories in point cloud scenes with only limited annotations. However, most current methods do not consider query content when exploring support prototypes, and thus suffer from intra-class variations between objects and incomplete representation of category information from annotated support samples. In this paper, we propose a novel Query-Guided support Prototype exploration Network (QGPNet) to tackle this challenge. Firstly, we present a point feature alignment module, which leverages geometry relationship between prototypes and query points, to tackle data misalignment caused by intra-class variations, and thus prevents incorrect label propagation from prototypes to query points. Secondly, we design a prototype feature mining strategy, which progressively harvests diverse support prototypes in the interaction with query features, to fully utilize the category information provided by annotated samples. Additionally, we introduce a semantic-aware data augmentation strategy for query samples in the training process, potentially improving the generalization ability of support prototypes on query samples. Extensive experiments on two indoor 3D datasets S3DIS and ScanNet demonstrate that QGPNet outperforms previous state-of-the-art methods by a large margin.

Abstract:
Much progress has been made in reconstructing garments from an image or a video. However, none of existing works meet the expectations of digitizing high-quality animatable dynamic garments that can be adjusted to various unseen poses. In this paper, we propose the first method to recover high-quality animatable dynamic garments from monocular videos without depending on scanned data. To generate reasonable deformations for various unseen poses, we propose a learnable garment deformation network that formulates the garment reconstruction task as a pose-driven deformation problem. To alleviate the ambiguity estimating 3D garments from monocular videos, we design a multi-hypothesis deformation module that learns spatial representations of multiple plausible deformations. Experimental results on several public datasets demonstrate that our method can reconstruct high-quality dynamic garments with coherent surface details, which can be easily animated under unseen poses. The code is available for research purposes at https://cic.tju.edu.cn/faculty/likun/projects/DGarment.

Abstract:
Point-by-point labeling of point clouds is a very costly task. Previous meta-learning-based few-shot methods predict categories by calculating the distance between unlabeled data (query set) and the prototype calculated by a few of data with the label (support set), which can reduce the dependence of point cloud segmentation algorithms on large amounts of labeled data. But it ignores the category information gap caused by object diversity between the two types of data and forcing information transfer is ineffective. To address this issue, we propose a co-occurrent object mining module for mining co-occurring object information from support and query sets. Specifically, the capture of co-occurrent information is used to activate the feature that co-occurs between the support and query set in the high-dimensional feature space so that the prototype generated by computing the mean of support features is more similar to the query set. By reducing the object diversity within the same category, the information gap problem is gradually improved. In addition, we propose a point-attention module to refine the support set features before mining co-occurrent features. It can be widely embedded in the point cloud backbone network. The experimental results on two semantic segmentation datasets demonstrate that our method obtains an average 19.43% lead over the state-of-the-art methods in 4 different few-shot tasks, while inference is around 45 times faster.

Abstract:
Fashion image generation attracts increasing attentions with wide applications in fashion design, virtual try-on, cosmetic industry, etc. Editing clues such as segmentation masks, keypoints and sketches are usually taken to guide the desired transformation of a reference image. However, spatial manipulation of the reference image remains a challenge, especially facing large-scale deformations and multiple editing requirements. In this paper, we propose a general model for multiple fashion editing tasks such as facial editing, pose transformation and clothes design based on user-defined editing instructions like semantic segmentation masks, keypoints, and sketches. With diverse editing requirements and various deformation scales, it is hard to learn the corresponding relationship between the editing clue and reference image with a uniform framework. Accordingly, we design a feature flow estimation network, which can adaptively adjust the feature flow according to the editing clue and the reference image, and generate a coarsely aligned image. Then we propose an image generative network to enrich the texture details of the transformed reference image. Experiments on three tasks verify the effectiveness of the proposed method and the adaptability to multiple tasks. The code and pretrained models will be available at https://github.com/zengjianhao/ Fashion-Image-Generation-Based-on-Editing-Clue.

Abstract:
Maintaining identity consistency and avoiding ID-switch during tracking is one of the primary focuses of multiple object tracking (MOT). One-shot MOT methods which jointly learn the detection and tracking models in one single network (hence namely, one-shot) have achieved promising results in tracking accuracy and speed. However, their capabilities of maintaining ID consistency are somehow weakened. The reason for this weakened ID consistency is two-fold: 1) the ID features learned by one-shot methods are not discriminative enough due to their heatmap-based single-location representation. 2) severe occlusion in the MOT scene leads to feature ambiguity and high ID-switch. In this paper, we propose a one-shot MOT system with strong ID consistency called PID-MOT (Preserved ID MOT). Specifically, we devise a visibility branch to predict the object occlusion level, and a predicted visibility map will be used in both Feature Refinement Model (FRM) and a visibility-guided two-stage association strategy (VGTAS). FRM is designed to strengthen the location-based features and enrich the identity information. VGTAS is proposed for tackling objects with high and low visibility separately. In addition, we initialize the parameters of our model by training on the recently emerged abundant synthetic MOTSynth dataset from scratch rather than the commonly used COCO dataset for full training. Finally, we carry out our method on the commonly used MOT datasets and the experimental results demonstrate that the proposed PID-MOT achieves especially good performances in ID F1 score (IDF1) and ID-Switch (IDS) compared with other state-of-the-art one-shot trackers, with comparable overall HOTA/MOTA performance. The code is available at https://github.com/Kroery/PIDMOT.

Abstract:
Visible-infrared person re-identification (VI-ReID) has raised more attention in night-time surveillance applications due to the struggle to capture valid appearance information under poor illumination conditions via visible cameras. Existing works usually separate the modality-specific and modality-irrelevant information in visible and infrared features, or project features of two modalities into a unified embedding feature space directly, which aims to eliminate huge modality discrepancies. However, these methods neglect the intra-modality and inter-modality correlations. We argue that the correlations can implicitly guide the network to discover the modality-irrelevant information, thus more beneficial for eliminating huge modality discrepancies and preserving individual differences. To this end, we propose a novel framework, termed as correlation-guided semantic consistency network (CSC-Net), to explore and exploit the intra-modality and inter-modality correlations. Specifically, CSC-Net consists of a cross-modality semantic alignment (CSA) module, a cross-granularity discrepancy awareness (CDA) module, and a probability consistency constraint (PCC) module. CSA mines the inter-modality correlation by calculating the semantic similarity between modalities to explore modality-irrelevant features, and then transfers the learned features to the backbone network to face the input of only single modality images. To preserve the individual differences, CDA sufficiently utilizes the intra-modality correlation via exploring the multi-granularity discriminative information. Finally, PCC constrains the network at the probability level, cooperating with the CSA which constrains at the feature level, to further alleviate the modality discrepancy. Extensive experiments on two public VI-ReID datasets SYSU-MM01 and RegDB have verified the effectiveness of our approach.

Abstract:
Most existing person re-identification (Re-ID) methods rely on high-cost manual annotations. To overcome the applicable issue, we focus on a novel semi-supervised Re-ID without cross-camera annotations, which we call random camera supervised person Re-ID (RCS). It is beneficial to real-world application, since a short-time and cheap annotation is conductive to rapid deployment of person re-ID. But only a small proportion of identities are annotated under a random camera, which is extremely challenging for Re-ID cross-camera pairing without cross-camera labeling or a labeled image for each identity. Towards reliable cross-camera learning, we propose a random camera guided framework (RCG) that can fully make use of the few labeled images with promising performance. RCG has two components: 1) Different from other complex methods to improve the clustering accuracy, Random camera guided clustering is adopted to mine cross-camera images of each identity, where the few labeled data helps to guide the simple but effective cluster split and combination. 2) Network learning under RCS is conducted with cluster-wise and camera-wise contrastive learning, where we deal with the camera variance in subgroup unit innovatively and further emphasis the importance of the labeled images. Extensive experiments on three large-scale Re-ID datasets show that our proposed approach not only outperforms state-of-the-art methods by a large margin, but achieves better performance with less annotation and more flexible RCS setting.

Abstract:
Image-guided depth completion (IGDC) is a multimodal computer vision task for acquiring high-precision dense depth maps. It reasonably predicts values around accurate sparse depth measurements by relying on the details of simultaneous dense RGB images. To achieve this goal, spatial propagation networks (SPNs) elaborate context-aware meta cells that connect pixels to their neighbors and fuse sparse and dense modalities by linear propagation. However, static affinity matrices and fixed neighborhood connections limit the representation of the networks. In this paper, our proposed dynamic SPN (DySPN) uses a nonlinear propagation model (NLPM), which processes the propagation more finely by adjusting the affinity weights, diffusion paths, and the number of neighbors. Specifically, we first generate adaptive weighting (AW) matrices by decoupling the neighborhood into parts with respect to different distances. Independent attention maps are recursively applied to refine the weight value. Furthermore, a dynamic path (DP) strategy is adopted to unfreeze the links of the neighborhood for learning variable connections. The solution space of the paths is also constrained by a propagation decay loss to keep the results stable. Finally, we introduce a diffusion suppression (DS) operation, which preserves the edge of dense depth maps by manipulating the AW and DP strategies to decelerate and terminate the propagation. In our experiment, the proposed method requires fewer iterations and neighbors than other SPNs while yielding better results. DySPN outperforms state-of-the-art (SoTA) methods on the KITTI DC, NYU Depth v2, and VOID datasets. Our code is available at: https://github.com/Kyakaka/DySPN.

Abstract:
Clear and high-resolution (HR) underwater images are indispensable in acquiring underwater information. However, existing underwater image enhancement and super-resolution (UIESR) networks achieve limited enhancement-super-resolution performance on real-world turbid low-resolution (LR) underwater images because 1) they assume that the resolution degradation is simple and known bicubic down-sampling, generating unrealistic training data for UIESR task; 2) they extract known priors from the underwater imaging model, which is meager to address complex UIESR problems caused by unknown mixed dual-degradation; and 3) they ignore the interaction between blurring and color casts in the RGB color space, leading to unsatisfactory correction results of two distortions. To address these issues, we propose a realistic UIESR network (RUIESR) consisting of three parts: a realistic LR image generation module (RLGM), a dual-degradation estimation module (DEM), and an enhancement and super-resolution module (ESRM). Firstly, RLGM aims to generate LR images obeying underwater LR image distribution by learning real LR properties from unpaired real LR-HR underwater images for training. Secondly, a contrast-driven learning strategy is proposed in the DEM to accurately estimate unknown dual-degradation priors that can aid the reconstruction task. Finally, ESRM is proposed to enhance textures and correct color casts, which includes a dual-branch structure to separate blurring and color casts distortions and utilizes specific priors for each distortion to assist reconstruction. Extensive experiments on real and synthetic underwater datasets show that the proposed RUIESR outperforms existing works regarding visual quality and quantitative metrics.

Abstract:
Homography estimation aligns image pairs in cross-views, which is a crucial and fundamental computer vision problem. Existing methods only consider correspondences of texture features for homography estimation, leading to unpleasant artifacts and misalignments introduced by mismatches, especially for low-texture image pairs. In contrast to others, we introduce intuitive structural information as an additional clue that is more sensitive to human vision and low-texture scenarios. In this paper, we propose an edge-aware unsupervised progressive network that couples texture and edge correlation to comprehensively explore potential matching features for homography estimation. To explore robust edge and texture features, we employ a multiscale network to capture feature pyramids with different receptive fields. Then, we design an edge-aware correlation module tailored for homography regression, which plugs in multiscale features to capture accurate correlation maps. Specifically, the edge-aware correlation module leverages the feature-selecting strategy for edge features to capture discriminative matching edges and further guides the texture correlation unit to focus on correctly matched textures. Finally, we leverage multiscale edge-aware correlation maps to predict homography progressively from coarse to fine. Experimental results demonstrate that our proposed method improves PSNR by 11.09% on the real large parallax dataset and reduces matching error by 32.04% on the synthetic COCO dataset, yielding more accurate alignment results than previous state-of-the-art methods.

Abstract:
In order to achieve the secure, efficient storage and transmission of medical images, we propose a joint lossless compression and encryption (JLCE) scheme. First, according to the intra-block correlation degree, the original medical image is divided into two non-overlapping regions with strong correlation and weak correlation. Then, a linear prediction method is performed on the strong correlation region to generate prediction errors; while the integer discrete Tchebichef transform (iDTT) with the properties of energy compaction and perfect image reconstruction is exploited to compact the energy of the weak correlation region and produce the transformed coefficients. Finally, a secure arithmetic encoding algorithm is presented to encode the prediction errors and transformed coefficients and output the encrypted and compressed bitstream. Without secret keys, the outputted encoded result can be decoded by our scheme, but the decoded result doesn’t disclose the content of the original medical image. Experimental results show that, the proposed scheme has satisfactory format compatibility and security and also achieves better performances of compression ratio and computational efficiency compared with some state-of-the-art schemes.

Abstract:
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CiSDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CiSDQ mainly includes three contributions: 1) We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally. 2) CiSDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity. 3) Apart from semantic segmentation, CiSDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered. Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation are conducted to demonstrate that CiSDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.

Abstract:
Global weather forecast is an important spatial-temporal prediction problem, which can provide numerous societal benefits such as extreme weather forewarning, traffic scheduling, and agricultural planning. Though many spatial-temporal prediction models have been proposed, they suffer from two drawbacks for global weather forecasts, namely 1) ignoring the physical mechanism and spherical characteristics and 2) not effectively exploiting the global and local correlations. To address the above drawbacks, in this paper, we formalize global weather state dynamics as partial differential equations (PDEs) in spherical space and infer the state of the global weather system by solving these PDEs. Specifically, we use Green’s function method to solve the PDEs and find that the solution of the spherical PDEs can be obtained by the spherical convolution. We further proposed a novel Spherical Neural Operator, SNO, which consists of spherical convolution and vanilla convolution. The former is used to solve these PDEs and model the global correlations in spherical space, and the latter is used to capture the local correlations. Upon the operator, a global weather prediction model is developed. Extensive experimental results demonstrate the effectiveness and superiority of our method over state-of-the-art approaches.

Abstract:
Self-supervised multi-frame depth estimation outperforms single-frame approaches by utilizing not only appearance information, but also geometric information. A common practice for multi-frame methods is to employ feature-metric bundle adjustment (FBA) to refine depth map initialized from the single-frame prior. However, FBA cannot always provide effective residual updates due to unreliable matching costs, which are corrupted by thin texture, occlusion, and especially object motion. To tackle this problem, we propose a context-aware transformer (CAT) to refine the corrupted matching costs by leveraging the spatial context information. Specifically, the CAT adaptively aggregates matching costs according to the spatial affinity inferred from local appearance context, and produces reliable contextual costs for FBA. Moreover, we design a motion-aware regularization loss to provide supervision for regions with moving objects, making CAT competent for dynamic scenes. Extensive experiments and analyses on the KITTI and Cityscapes datasets demonstrate the effectiveness and superior generalization capability of our approach.

Abstract:
Perceptual Lossless Compression (PLC) is a novel compression standard directed by the Audio Video coding Standard (AVS) work group. It defines a lightweight, low-latency, and visually lossless image compression framework, which offers an alternative mezzanine codec for most user-agnostic on-chip compression scenarios, alleviating the tension between growing transmission demands and expensive integration upgrades. In this paper, the technical designs in the development process of PLC will be fully introduced. The balance between feature modeling and ASIC implementation costs will be present throughout. A high throughput and low hardware complexity implementation will be detailed and evaluated namely HIM. Hopefully, the design of the PLC standard and HIM framework will bring new inspiration for the emerging low-latency interaction systems.

Abstract:
Recently, many efforts have been devoted to improving the retrieval performance of supervised cross-modal hashing; however, current methods are gradually reaching a performance bottleneck, especially when dealing with real-world multimedia data. This is mainly due to their application of coarse-grained semantics, unrobust hash functions, and inflexible workflows. Therefore, discovering refined semantics hidden in data, designing robust hash functions, and creating a non-interfering but facilitative learning workflow are much more significant. With this motivation, in this paper, we propose a novel supervised cross-modal hashing method, i.e., Multiple Information Embedded Hashing, MIEH for short. It consists of a three-step working flow that flexibly handles multiple information mining, hash code learning, and hash function learning. First, it explores the multimedia data from multiple perspectives such as modal-level consistency, class-level discriminability, and instance-level similarity to mine comprehensive semantic information, which not only contributes to the generation of discriminative hash codes, but also accelerates convergence. Subsequently, MIEH is committed to embed the refined semantics into targeted hash codes with an efficient discrete optimization algorithm. Finally, it improves the learning ability of linear hash function by noisy example erasing and deviation correcting. Considering this, MIEH is able to garner more robust hash function. Extensive experiments conducted on three popular benchmark datasets highlight the superiority of our MIEH on large-scale cross-modal retrieval tasks and demonstrate its competitive performance against state-of-the-art approaches. The source code is available at https://github.com/yxinwang/MIEH.

Abstract:
Eye-tracking technology is extensively utilized in affective computing research, enabling the investigation of emotional responses through the analysis of eye movements. Integration of eye-tracking with other modalities, allows for the collection of multimodal data, leading to a more comprehensive understanding of emotions and their relationship with physiological responses. This paper presents a novel head-mounted eye-tracking system for multimodal data acquisition with a completely redesigned structure and improved performance. We propose a novel method for pupil-fitting with high efficiency and robustness based on deep learning and RANSAC, which gets better performance of pupil segmentation when it is partially occluded, and build a 3D model to obtain gaze points. Existing eye trackers for multi-modal synchronous data collection either have limited device support or suffer from significant synchronization delays. Our proposed hard real-time synchronization mechanism implements microsecond level latency with low cost, which facilitates multimodal analysis for affective computing research. The uniquely designed exterior effectively reduces facial occlusion, making it more comfortable for the wearer while facilitating the capture of facial expressions.

Abstract:
The autoregressive model has been widely used in learning-based image compression due to its superior context modeling capability. However, its sequential processing nature also undermines the ability of decoding in parallel and hinders the deployment in real applications. In this paper, we propose a decoupled framework to resolve this issue. With the decoupled architecture, the entropy decoding process is independent of the latent sample reconstruction process. The entropy decoding process thus can be finished before the latent sample prediction process begins, which leads to significant decoding time savings by enabling the two processes to be conducted in parallel. To further reduce the decoding time, we introduce wavefront processing, where multiple rows can be processed simultaneously when reconstructing the latent samples. On top of that, we design a series of coding tools to improve the rate-distortion efficiency and reduce the decoding complexity. Device interoperability is also supported by the proposed solution, where the same bitstream can be successfully decoded on different CPU/GPU devices. Comprehensive experiments are conducted to validate the effectiveness of the proposed method. Using objective evaluation metrics required by JPEG AI Call for Proposals (CfP), the proposed method achieves a BD-rate change of −29.6% on average with 2.44 times faster decoding speed compared to VVC image coding. When compared to the commonly used benchmark learning-based methods, the proposed method achieves −30.5% BD-rate changes and 101 times faster decoding speed over cheng2020attn. The proposed solution has been proposed to JPEG AI and IEEE 1857.11 as a response to CfP and the core techniques have been adopted by both.

Abstract:
Quantizing a floating-point neural network to its fixed-point representation is crucial for Learned Image Compression (LIC) because it improves decoding consistency for interoperability and reduces space-time complexity for implementation. Existing solutions often have to retrain the network for model quantization, which is time-consuming and impractical to some extent. This work suggests using Post-Training Quantization (PTQ) to process pretrained, off-the-shelf LIC models. We theoretically prove that minimizing quantization-induced mean square error (MSE) of model parameters (e.g., weight, bias, and activation) in PTQ is sub-optimal for compression tasks and thus develop a novel Rate-Distortion (R-D) Optimized PTQ (RDO-PTQ) to best retain the compression performance. Given a LIC model, RDO-PTQ layer-wisely determines the quantization parameters to transform the original floating-point parameters in 32-bit precision (FP32) to fixed-point ones at 8-bit precision (INT8), for which a tiny calibration image set is compressed in optimization to minimize R-D loss. Experiments reveal the outstanding efficiency of the proposed method on different LICs, showing the closest coding performance to their floating-point counterparts. Our method is a lightweight and plug-and-play approach without retraining model parameters but just adjusting quantization parameters, which is attractive to practitioners. Such an RDO-PTQ is a task-oriented PTQ scheme, which is then extended to quantize popular super-resolution and image classification models with negligible performance loss, further evidencing the generalization of our methodology. Related materials will be released at https://njuvision.github.io/RDO-PTQ.

Abstract:
Typical Siamese-based trackers focus on the target region and pay less attention to the background area. However, the background area can provide the tracker with prior knowledge about the target surroundings. Nonetheless, since the tracker can naturally utilize the target template for localization, importing additional background knowledge requires proper design so that the background area prior knowledge can be fully explored. Furthermore, the introduction of the entire background regions is redundant. Instead, the part background distractors in the regions are more meaningful for the discrimination of the tracker. In this work, we propose a background prior knowledge fully explored tracker for robust tracking. Firstly, we present a Transformer-based explicitly and fully background-utilizing scheme by boosting the tracker to independently exploit the background for localization. Specifically, a target-distractor independent decoder explicitly utilizes the background knowledge by making the target and the distractors independently perform fusion with the search feature. Secondly, we design a simple yet efficient discriminative distractors mining module to refine the background prior knowledge by replacing the whole background region with the mined background distractors. Extensive experiments demonstrate that the proposed method performs favorably against state-of-the-art trackers on nine benchmarks.

Abstract:
Multi-label zero-shot learning (MLZSL) is a more realistic and challenging task than single-label zero-shot learning (SLZSL), which aims to recognize multiple unseen classes in a single image. To adapt generative models to the MLZSL task and better recognize multiple unseen object categories in an image, this paper proposes a Transferable Generative Framework (TGF), which consists of a Multi-Label Semantic Embedding Autoencoders (SEAs), a Semantic-Related Multi-Label Feature Transformation Network (FTN) and a Multi-Label Feature Generation Networks (FGNs). First, SEAs adaptively encodes the class-level word vectors corresponding to each sample containing different number of classes into sample-level semantic embeddings with the same dimension. Then, FTN transforms global features extracted by a CNN pre-trained on single-label images into features that are semantic-related and more suitable for multi-label classification. Finally, FGNs generates both global and local features to better recognize the dominant and minor object categories in a multi-label image, respectively. Extensive experiments on three benchmark datasets show that TGF significantly outperforms state-of-the-arts. Specifically, compared with the previous best generative MLZSL method (i.e., Gen-MLZSL), TGF improves the mAP of the ZSL (GZSL) task by 5.4% (6.9%), 20.5% (27.9%), and 2.4% (3.9%) on NUS-WIDE, Open Images, and MS-COCO datasets, respectively.

Abstract:
Knowledge distillation (KD) is a technique that transfers “dark knowledge” from a deep teacher network (teacher) to a shallow student network (student). Despite significant advances in KD, existing work has not adequately mined two crucial types of knowledge: 1) the knowledge of head categories, which represents the relationship between the target category and its similar categories. Our findings reveal that this highly similar (complex) knowledge is essential for improving student’s performance; and 2) the effectively utilized knowledge of tail categories. Existing studies often treat the non-target categories collectively without sufficiently considering the effectiveness of knowledge from tail categories. To tackle these challenges, we reformulate classical KD (ReKD) into two components: Top- K Inter-class Similar Distillation (TISD) and Non-Top- K Inter-class Discriminability (NTID). Firstly, TISD captures and imparts the knowledge of head categories to the student. Our experimental results have verified that TISD is particularly effective in transferring the knowledge of head categories, even in fine-grained dataset classification. Secondly, we theoretically show that the weighting coefficient of NTID increases with the probability of Top- K , leading to stronger suppression of knowledge transfer for tail categories. This observation explains why difficult samples are more informative than simple ones. To better utilize both types of knowledge, we optimize both TISD and NTID using different weighting coefficients, thereby enhancing the student’s ability to learn this valuable knowledge from both head and tail categories. Furthermore, our extensive experimental results demonstrate that ReKD achieves state-of-the-art performance on various image classification datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet-1K, as well as object detection and instance segmentation using the MS-COCO dataset.

Abstract:
This paper focuses on fisheye image rectification. Existing learning-based solutions learn image representations that mix distortion features and content features. Since the distortion feature dominates the rectification process, we propose a novel distortion-aware representation learning framework, which decouples the distortion feature from the content feature, for fisheye image rectification. Specifically, we first pre-train a Vision Transformer with a supervised pre-text task, which regresses the distortion distribution map of a distorted image. The pre-training equips the Vision Transformer with the ability to capture distortion-related patterns. After that, the pre-trained model is fine-tuned to predict the pixel-wise flow map to rectify the fisheye images. Extensive experiments are conducted to evaluate our approach and verify our idea of feature decoupling. The experiment results demonstrate the state-of-the-art performance of our approach compared to existing algorithms, as well as its generality on real-world images. Our source code is publicly available at https://github.com/lzk9508/DaFIR.

Abstract:
Video question answering aims to provide correct answers given complex videos and related questions, posting high requirements of the comprehension ability in both video and language processing. Existing works phrase this task as a multi-modal fusion process by aligning the video context with the whole question, ignoring the rich semantic details of nouns and verbs separately in the multi-modal reasoning process to derive the final answer. To fill this gap, in addition to the semantic alignment of the whole sentence, we propose to disentangle the semantic understanding of language, and reason over the corresponding frame-level and motion-level video features. We design an unified multi-granularity language module of residual structure to adapt the semantic understanding at different granularity with context exchange, e.g., word-level and sentence-level. To enhance the holistic question understanding for answer prediction, we also design a contrastive sampling approach by selecting irrelevant questions as negative samples to break the intrinsic correlations between questions and answers within the dataset. Notably, our model is competent for both multiple-choice and open-ended video question answering. We further employ a pre-trained language model to retrieve relevant knowledge as candidate answer context to facilitate open-ended VideoQA. Extensive quantitative and qualitative experiments on four public datasets (NextQA, MSVD, MSRVTT, and TGIF-QA-R) demonstrate the effective and superior performance of our proposed model. Our code will be released upon the paper’s acceptance.

Abstract:
Coded Aperture Snapshot Spectral Imaging (CASSI) reconstruction aims to recover the 3D spatial-spectral signal from 2D measurement. Existing methods for reconstructing Hyperspectral Image (HSI) typically involve learning mappings from a 2D compressed image to a predetermined set of discrete spectral bands. However, this approach overlooks the inherent continuity of the spectral information. In this study, we propose an innovative method called Spectral-wise Implicit Neural Representation (SINR) as a pioneering step toward addressing this limitation. SINR introduces a continuous spectral amplification process for HSI reconstruction, enabling spectral super-resolution with customizable magnification factors. To achieve this, we leverage the concept of implicit neural representation. Specifically, our approach introduces a spectral-wise attention mechanism that treats individual channels as distinct tokens, thereby capturing global spectral dependencies. Additionally, our approach incorporates two components, namely a Fourier coordinate encoder and a spectral scale factor module. The Fourier coordinate encoder enhances the SINR’s ability to emphasize high-frequency components, while the spectral scale factor module guides the SINR to adapt to the variable number of spectral channels. Notably, the SINR framework enhances the flexibility of CASSI reconstruction by accommodating an unlimited number of spectral bands in the desired output. Extensive experiments demonstrate that our SINR outperforms baseline methods. By enabling continuous reconstruction within the CASSI framework, we take the initial stride toward integrating implicit neural representation into the field.

Abstract:
Endoscopic images captured under low-light enclosed intestinal environment usually have poor visibility (manifested as uneven illumination and noise), affecting the work efficiency of physicians and the accuracy of lesion detection. To improve the image quality, the literature has reported many low-light image enhancement (LIE) methods. However, most methods do not perform well in handling the low-light endoscopic image enhancement (LEIE) task, usually bringing additional artifacts or amplifying noise. In this paper, we propose a novel deep pyramid enhancement network (DPENet) to enhance endoscopic images from both global and local perspectives. Specifically, considering the uneven illumination of endoscopic images, DPENet utilizes an image pyramid framework with three parallel branches to explore and integrate both global and local features at different scales. To suppress noise, DPENet sets multiple scale-space feature extraction blocks (SFEBs) in each branch. SFEB consists of a contextual feature extraction module (CFEM) and a spatial residual attention module (SRAM). CFEM mines contextual information to help the network understand semantic information while suppress the isolated noise. SRAM leverages the spatial attention mechanism to help the network adaptively focus on dim regions. Experimental results on a public dataset and our collected dataset show that DPENet is competent for the LEIE task with promising results, and outperforms 9 state-of-the-art LIE methods in both qualitative and quantitative aspects.

Abstract:
Triggered by the success of transformers in various visual tasks, the spatial self-attention mechanism has recently attracted more and more attention in the computer vision community. However, we empirically found that a typical vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images. To address this problem, motivated by the temporal attention mechanism in brains, we propose a hierarchical attention network for learning fine-grained feature representations, called HAN, where the features learnt by implementing a sequence of spatial self-attention operations corresponding to multiple moments are aggregated progressively. The proposed HAN consists of four modules: a self-attention backbone module for learning a sequence of features with self-attention operations, a spatial feature self-organizing module for facilitating the model training, a hierarchical aggregation module for aggregating the re-organized features via a Long Short-Term Memory network, and a context-aware module that is implemented as the forget block of the hierarchical aggregation module for preserving/forgetting the long-term memory by utilizing contextual information. Then, we propose a HAN-based method for open-set fine-grained recognition by integrating the proposed HAN network with a linear classifier, called HAN-OSFGR. Extensive experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that the proposed HAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.

Abstract:
High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model’s condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment (DFA) layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.

Abstract:
Traffic accident detection and anticipation is an obstinate road safety problem and painstaking efforts have been devoted. With the rapid growth of video data, Vision-based Traffic Accident Detection and Anticipation (named Vision-TAD and Vision-TAA) become the last one-mile problem for safe driving and surveillance safety. However, the long-tailed, unbalanced, highly dynamic, complex, and uncertain properties of traffic accidents form the Out-of-Distribution (OOD) feature for Vision-TAD and Vision-TAA. Current AI development may focus on these OOD but important problems. What has been done for Vision-TAD and Vision-TAA? What direction we should focus on in the future for this problem? A comprehensive survey is important. We present the first survey on Vision-TAD in the deep learning era and the first-ever survey for Vision-TAA. The pros and cons of each research prototype are discussed in detail during the investigation. In addition, we also provide a critical review of 31 publicly available benchmarks and related evaluation metrics. Through this survey, we want to spawn new insights and open possible trends for Vision-TAD and Vision-TAA tasks.

Abstract:
Multi-view representation learning aims to extract comprehensive information from multiple sources. It has achieved significant success in applications such as video understanding and 3D rendering. However, how to improve the robustness and generalization of multi-view representations from unsupervised and incomplete scenarios remains an open question in this field. In this study, we discovered a positive correlation between the semantic distance of multi-view representations and the tolerance for data corruption. Moreover, we found that the information ratio of consistency and complementarity significantly impacts the performance of discriminative and generative tasks related to multi-view representations. Based on these observations, we propose an end-to-end CLustering-guided cOntrastiVE fusioN (CLOVEN) method, which enhances the robustness and generalization of multi-view representations simultaneously. To balance consistency and complementarity, we design an asymmetric contrastive fusion module. The module first combines all view-specific representations into a comprehensive representation through a scaling fusion layer. Then, the information of the comprehensive representation and view-specific representations is aligned via contrastive learning loss function, resulting in a view-common representation that includes both consistent and complementary information. We prevent the module from learning suboptimal solutions by not allowing information alignment between view-specific representations. We design a clustering-guided module that encourages the aggregation of semantically similar views. This action reduces the semantic distance of the view-common representation. We quantitatively and qualitatively evaluate CLOVEN on five datasets, demonstrating its superiority over 13 other competitive multi-view learning methods in terms of clustering and classification performance. In the data-corrupted scenario, our proposed method resists noise interference better than competitors. Additionally, the visualization demonstrates that CLOVEN succeeds in preserving the intrinsic structure of view-specific representations and improves the compactness of view-common representations. Our code can be found at https://github.com/guanzhou-ke/cloven.

Abstract:
Artistic style transfer aims to transfer the style of an artwork to a photograph while maintaining its original overall content. Many prior works focus on designing various transfer modules to transfer the style statistics to the content image. Although effective, ignoring the clear disentanglement of the content features and the style features from the first beginning, they have difficulty in balancing between content preservation and style transferring. To tackle this problem, we propose a novel information disentanglement method, named InfoStyler, to capture the minimal sufficient information for both content and style representations from the pre-trained encoding network. InfoStyler formulates the disentanglement representation learning as an information compression problem by eliminating style statistics from the content image and removing the content structure from the style image. Besides, to further facilitate disentanglement learning, a cross-domain Information Bottleneck (IB) learning strategy is proposed by reconstructing the content and style domains. Extensive experiments demonstrate that our InfoStyler can synthesize high-quality stylized images while balancing content structure preservation and style pattern richness.

Abstract:
The task of image matting is an active research area in computer vision, and various trimap-free methods have been proposed to improve its performance. However, these methods do not consider the gap between composited and real-world images, resulting in limited generalization ability. To address this issue, we propose a domain alignment (DA) module that consists of local region-wise alignment (LRA) and global harmonious alignment (GHA). The LRA aligns the most diverse pixels in the transparent regions of the foreground between composited and real images. On the other hand, the GHA aligns the global image harmonization for both composited and real images, which helps the network choose the appropriate semantics for real harmonious images. Additionally, we design a transformer-based network with dynamic attention pruning (DAP) mechanism to accurately locate domain-sensitive regions, allowing the DA module to work more effectively. Furthermore, we introduce a new dataset, the Real-world Matting Dataset (RM-1k), to advance the real-world matting task. Our proposed method is evaluated on two composited benchmarks (Composite-1k and Distinctions-646) and two real-world datasets (AIM-500 and RM-1k), and the results show that our method achieves robust performance on both composited and real-world images.

Abstract:
Event cameras are bio-inspired dynamic vision sensors that are superior to frame-based cameras in terms of low power consumption, high dynamic range, and high temporal resolution in computer vision tasks. Recent advances in voxel-based representation learning have successfully exploited the sparsity of events with low computational complexity, but face challenges in extracting spatio-temporal features within voxels and representative global dependencies between voxels, thus limiting their representation power. In this work, towards a better trade-off between accuracy and computation overhead, we propose a novel voxel-based multi-scale transformer network (VMST-Net) to process event streams. Specifically, VMST-Net projects events within voxels into multi-channel frames along the time axis, such that 2D convolutions could be leveraged to encode spatio-temporal features in voxels. Then, VMST-Net utilizes a novel multi-scale multi-head self-attention (MSMHSA) mechanism with a multi-scale fusion (MSF) module that allows different heads within each layer to attend different scale 3D neighborhoods to adaptively aggregate the coarse-to-fine voxel features with little computational costs and parameters. Moreover, to model effective global features while saving computations, we aggregate features in a local-to-global manner by enlarging the coverage of 3D neighborhoods as the network gets deeper. Extensive experimental results on benchmark datasets demonstrate that our model advances state-of-the-art accuracy with low model complexity and computational complexity in all three visual tasks, including object classification, action recognition, and human pose estimation.

Abstract:
Few-shot object detection (FSOD) aims to detect novel objects with limited annotated examples. Mainstream methods suffer from the data scarcity of novel classes with insufficient intra-class variations, which makes the trained model biased to base classes. Actually, there are massive unlabeled novel instances in the base dataset and their adequate utilization will enhance the discriminability of model to novel classes. This paper proposes a semi-supervised few-shot object detection method, which utilizes a teacher model and a pre-trained few-shot object detector to guide the learning of a student model through adaptive pseudo labeling. In particular, a class-adaptive threshold filtering (CATF) strategy is designed to deal with the class-imbalance problem of pseudo labels. And for each novel class, the threshold to select valuable pseudo labels is determined by quantile statistics of the confidence score distribution of pseudo labels. Furthermore, the pre-trained detector and the teacher model are associated with the preliminary CATF and in-depth CATF, respectively, and then the pseudo labels from the two-stream CATF are fused to provide supervisions. In this way, the knowledge of these two models is exploited, which improves the quality of pseudo labels. Under these supervisions, the student model is trained and the teacher model is correspondingly updated through parameters sharing, thus forming a positive feedback to improve the performance of both models. Besides, an attention module is integrated to the teacher and student models to enhance the feature representation of novel instances. The validations on PASCAL VOC and MS COCO show the effectiveness of the proposed method.

Abstract:
Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

Abstract:
Recovering a dense depth map from a pair of indoor RGB and sparse depth images in an unsupervised manner is paramount in applications such as autonomous driving and 3D reconstruction. Most existing methods leverage sparse depth maps to directly estimate the dense depth map with the pixel-wise regression constraints over the input known depth. However, such regression constraints independently compare per-pixel depth values, which ignore the important 3D structures hidden behind depth maps and result in severe structural distortion and poor robustness. In this paper, we propose a Structure-Preserving Encoding (SPE) module by reformulating depth completion as the process of 3D structure generation. The generated structure should recover the complete scene and also consist with the known partial structure, so that the learned depth features from this task are able to encode rich structural information. In addition, SPE hierarchically interpolates and propagates the 3D structures into dense structure-aware positional encodings, which further boosts the information interactions between RGB and depth features via our transformer. Extensive experiments on VOID and NYUv2 demonstrate that SPTR outperforms the state-of-the-art methods by a large margin across various densities of input depths and a strong generalization ability to other datasets.

Abstract:
We propose a minimal solution for sphere-based camera-projector pair (CPP) calibration. Previous works often treated the camera and projector calibration as two independent problems, which exploit only intra-view information from geometric properties of sphere dual image formation and hence require at least three spheres for CPP calibration. However, other than intra-view information, we observe that inter-view information between camera and projector provides additional constraints. Combining these two kinds of information yields a minimal solution for CPP calibration, where only a single sphere is required. Extensive experiments have verified the effectiveness of proposed minimal solver, which demonstrates higher flexibility and comparable accuracy to the state-of-the-art methods. Moreover, the achieved flexibility allows high-quality 3D reconstruction with an uncalibrated CPP, given only a single sphere in the scene.

Abstract:
Images captured in the low-light condition suffer from low visibility and various imaging artifacts, e.g., real noise. Existing supervised algorithms for low-light image enhancement require a large set of pixel-aligned training image pairs, which are hard to prepare in practice. Though some recent unsupervised methods can alleviate such data challenges, many real world artifacts inevitably get falsely amplified in the enhanced results due to the lack of corresponding supervision. In this paper, instead of using perfectly aligned images for training, we creatively employ the misaligned real world images as the guidance, which are considerably easier to collect. Specifically, we propose a Cross-Image Disentanglement Network (CIDN) with weakly supervised learning, to separately extract cross-image brightness and image-specific content features from low/normal-light images. Based on that, CIDN can simultaneously correct the brightness and suppress image artifacts in the feature domain, which largely increases the robustness of the pixel shifts between training pairs. By considering real world corruptions, we propose a new training dataset with misaligned and noisy image pairs and its corresponding evaluation dataset. Experimental results show that our model achieves state-of-the-art performances on both the newly proposed dataset and other popular low-light datasets. The code implementation is publicly available at: https://github.com/GuoLanqing/CIDN.

Abstract:
Underwater images suffer from quality degradation due to the underwater light absorption and scattering. It remains challenging to enhance underwater images using deep learning-based methods since the scarcity of real-world underwater images and their enhanced counterparts. Although existing works manually select well-enhanced images as reference images to train enhancement networks in an end-to-end manner, their performance tends to be inferior in some scenarios. We argue that the manually selected reference images cannot approximate their ground truth perfectly, leading to imbalanced learning and domain shift in enhancement networks. To address this issue, we analyse widely used underwater datasets from the perspective of color spectrum distribution and surprisingly find the sound color spectrum distribution of the enhanced reference images compared to in- air datasets. Based on this perceptive observation, instead of directly learning the enhancement mapping, we propose a novel methodology to learn color compensation for general purposes. Specifically, we present a probabilistic color compensation network that estimates the probabilistic distribution of colors by multi-scale volumetric fusion of texture and color features. We further propose a novel two-stage enhancement framework that first performs color compensation and then enhancement, which is highly flexible to be integrated with an existing enhancement method without tuning. Extensive experiments on underwater image enhancement across various challenging scenarios show that our proposed approach consistently improves the results of the popular conventional and learning-based methods by a significant margin. Moreover, our enhanced images achieve superior performance on underwater salient object detection and visual 3D reconstruction, demonstrating that our method can successfully break through the generalization bottleneck of existing learning-based enhancement models. Our implementation will be made available at https://github.com/Ray2OUC/P2CNet.

Abstract:
Deep learning-based image restoration methods trained on synthetic datasets have witnessed notable progress, but suffer from significant performance drops on real-world images due to huge domain shifts. To alleviate this issue, some recent methods strive to improve the generalization ability of models with unpaired training. However, these solutions typically handle each problem individually and ignore the shared physical properties of different harsh scenarios, i.e., heavy rain, hazy and low-light images degrade more densely with increasing scene depth. Such limitations make them generalize poorly to real-world images. In this paper, we propose a novel Physically Oriented Generative Adversarial Network (POGAN) for unpaired image restoration with depth-density priors. Specifically, our POGAN consists of two core designs: Physical Restoration Network (PRNet) and Degradation Rendering Network (DRNet). The former focuses on estimating the physical components related to the depth and density distribution for restoration, while the latter re-renders degradation effects guided by the estimated depth information. To further facilitate learning the above physical prior, we design a Spatial-Frequency Interaction Residual block (SFIR), which efficiently learns global frequency information and local spatial features in an interactive manner. Extensive experiments on synthetic and real-world datasets demonstrate the superiority of our method in heavy rain, haze, and low-light scenarios.

Abstract:
Maximum margin clustering (MMC) is a typical clustering method which aims to maximize the margin between different clusters. However, in practical applications, a data object may be represented by multiple feature sets (views), with each feature set representing different information of the underlying data. The traditional MMC methods can handle only the data from a single view and are unable to utilize the multi-view data to enhance the clustering model. In multi-view clustering, there are two basic principles: the consensus principle and complementarity principle. Most multi-view clustering methods implement mainly the consensus principle, while the complementarity principle has not been sufficiently taken into account. Distinguished from the existing methods, \text M^3\text CP introduces the idea of privileged information learning into multi-view clustering and implements both of the consensus principle and complementarity principle. Based on privileged information learning, \text M^3\text CP embodies the complementarity principle by considering one view as the main learning information and the other views as the privileged information, so that multiple views can provide information to complement each other. The derived learning problem is then solved by applying the constrained concave–convex procedure and cutting plane techniques. By employing these techniques, the computational time of \text M^3\text CP is able to scale linearly with respect to the dataset size. Numerical experiments on real-life multi-view datasets demonstrate that \text M^3\text CP is able to achieve better clustering accuracy and meanwhile needs less computational time, compared to state-of-the-art multi-view clustering methods.

Abstract:
Most existing methods for over-exposure in image correction are developed based on sRGB images, which can result in complex and non-linear degradation due to the image signal processing pipeline. By contrast, data-driven approaches based on RAW image data offer natural advantages for image processing tasks. RAW images, characterized by their near-linear correlation with scene radiance and enriched information content due to higher bit depth, demonstrate superior performance compared to sRGB-based techniques. Further, the spectral sensitivity characteristics intrinsic to digital camera sensors indicate that the blue and red channels in a Bayer pattern RAW image typically encompass more contextual information than the green channels. This property renders them less susceptible to over-exposure, thereby making them more effective for data extraction in high dynamic range scenes. In this paper, we introduce a Channel-Guidance Network (CGNet) that leverages the benefits of RAW images for over-exposure correction. The CGNet estimates the properly-exposed sRGB image directly from the over-exposed RAW image in an end-to-end manner. Specifically, we introduce a RAW-based channel-guidance branch to the U-net-based backbone, which exploits the color channel intensity prior of RAW images to achieve superior over-exposure correction performance. To further facilitate research in over-exposure correction, we present synthetic and real-world over-exposure correction benchmark datasets. These datasets comprise a large set of paired RAW and sRGB images across a variety of scenarios. Experiments on our RAW-sRGB datasets validate the advantages of our RAW-based channel guidance strategy and proposed CGNet over state-of-the-art sRGB-based methods on over-exposure correction. Our code and dataset are publicly available at https://github.com/whiteknight-WJN/CGNet.

Abstract:
Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.

Abstract:
Optical-flow-based and kernel-based approaches have been extensively explored for temporal compensation in satellite Video Super-Resolution (VSR). However, these techniques are less generalized in large-scale or complex scenarios, especially in satellite videos. In this paper, we propose to exploit the well-defined temporal difference for efficient and effective temporal compensation. To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies since we observe that these discrepancies offer distinct and mutually complementary properties. Specifically, we devise a Short-term Temporal Difference Module (S-TDM) to extract local motion representations from RGB difference maps between adjacent frames, which yields more clues for accurate texture representation. To explore the global dependency in the entire frame sequence, a Long-term Temporal Difference Module (L-TDM) is proposed, where the differences between forward and backward segments are incorporated and activated to guide the modulation of the temporal feature, leading to a holistic global compensation. Moreover, we further propose a Difference Compensation Unit (DCU) to enrich the interaction between the spatial distribution of the target frame and temporal compensated results, which helps maintain spatial consistency while refining the features to avoid misalignment. Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches. Code will be available at https://github.com/XY-boy/LGTD.

Abstract:
Image-text matching is a fundamental task to bridge vision and language. The critical challenge lies in accurately learning the semantic similarity between these two heterogeneous modalities. For visual and textual features, existing methods typically default to a static dimensional correspondence mechanism, i.e., using a single dimension as the measure-unit to perform one-to-one correspondence, to examine semantic similarity, e.g., the cosine/Euclidean distance or the weighted similarity. In this paper, different from the single-dimensional correspondence with limited semantic expressive capability, we propose a novel enhanced semantic similarity learning (ESL), which generalizes both measure-units and their correspondences into a dynamic learnable framework to examine the multi-dimensional enhanced correspondence between visual and textual features. Specifically, we first devise the intra-modal multi-dimensional aggregators with iterative enhancing mechanism, which dynamically captures new measure-units integrated by hierarchical multi-dimensions, producing diverse semantic combinatorial expressive capabilities to provide richer and discriminative information for similarity examination. Then, we devise the inter-modal enhanced correspondence learning with sparse contribution degrees, which comprehensively and efficiently determines the cross-modal semantic similarity. Extensive experiments verify its superiority in achieving state-of-the-art performance. Codes will be released at https://github.com/CrossmodalGroup/ESL.

Abstract:
Supervised cross-modal hashing has received wide attention in recent years. However, existing methods primarily rely on sample-wise semantic relationships to evaluate the semantic similarity between samples, overlooking the impact of label distribution on enhancing retrieval performance. Moreover, the limited representation capability of traditional dense hash codes hinders the preservation of semantic relationship. To overcome these challenges, we propose a new method, Joint Semantic Preserving Sparse Hashing (JSPSH). Specifically, we introduce a new concept of cluster-wise semantic relationship, which leverages label distribution to indicate which samples are more suitable for clustering. Then, we jointly utilize sample-wise and cluster-wise semantic relationships to supervise the learning of hash codes. In this way, JSPSH preserves both kinds of semantic relationships to ensure that more samples with similar semantics are clustered together, thereby achieving better retrieval results. Furthermore, we utilize high-dimensional sparse hash codes that offer stronger representation capability to preserve such more complex semantics. Finally, an interaction term is introduced in hash functions learning stage to further narrow the gap between modalities. Experimental results on three large-scale datasets demonstrate the effectiveness of JSPSH in achieving superior retrieval performance.

Abstract:
Multispectral pedestrian detection is an important task due to its critical role in a wide spectrum of applications. Basically, the complementary information from color and thermal images could provide a more accurate and reliable pedestrian detection result. However, multimodal data usually suffer from the issue of dynamic change or corruption for some modalities. At the same time, as a safety-critical task, how to produce a stable and reliable detection result is also a key challenge. To address these challenges, we propose a stable multispectral pedestrian detection (SMPD) algorithm, providing a new paradigm for multispectral detection by dynamically integrating different modalities at an evidence level. Specifically, we introduce the Dirichlet distribution to characterize the distribution of the class probabilities, parameterized with evidence from different modalities. Then, multi-branch fusion, based on Dempster-Shafer theory, can integrate these pieces of evidence to obtain the detection result. In addition, a Plug-and-Play module, termed modal enhancement module, is introduced to enhance cross-modality interaction. This is an end-to-end framework, which can induce accurate detection and uncertainty estimation, and then endows the model with both reliability and robustness against noise or corruption. Extensive experimental results demonstrate the efficiency of our algorithm compared with state-of-the-art methods.

Abstract:
Subtle variations are invisible to the naked eyes in human physiological signals can reflect important biological and health indicators. Although numerous computer vision methods have been proposed to recover and magnify these changes, most of them either only focus on identifying and recognizing explicit features such as shapes and textures, or are weak in long-term temporal modeling and spatiotemporal interactive perception of implicit biometrics. Therefore, it is difficult for them to robustly overcome various disturbances that affect detection performance. To address these issues, this paper presents TranPhys, a novel remote photoplethysmography (rPPG) network for facial video-based heart rate estimation. Specifically, first, we argue that facial subregions vary over time due to their biological personalities. So we split the input face video into multiple spatiotemporal tubes, build the 3D vision transformer with encoders and decoders to adequately model the high-dimensional representations of the respective regulars in each subregion, and globally coordinate their feedback on the cardiac pulsing waveform. Second, we design the temporal pooling attention to more finely mine the subtle changes hidden in the skin color over time and their long-term contextual rhythm cues. Third, we leverage the self-supervised masked autoencoding paradigm to overcome redundancy to enhance the robustness of our model, and construct the targeted spatiotemporal sampling maps instead of raw input sequences as the pretrained constraint labels to fully inspire self-supervision. We train, validate, and practice our TranPhys on multiple public datasets to demonstrate that our method achieves the competitive performance in remote heart rate estimation.

Abstract:
Recently, deep hashing-based cross-modal retrieval has attracted much attention of researchers, due to its advantages of fast retrieval efficiency and low storage overhead, etc. However, the existing deep hashing-based cross-modal retrieval methods typically 1) suffer from inadequately capturing the semantic relevance and coexistent information for cross-modal data, which may result in sub-optimal retrieval performance, 2) require a more comprehensive similarity measurement for cross-modal features to ensure high retrieval accuracy, 3) lack of scalability for lightweight deployment framework. To handle the issues mentioned above, we propose a CLIP-based knowledge distillation hashing (CKDH) for cross-modal retrieval, by referring the research trend of combining traditional methods and modern neural architecture to design lightweight networks based on large language models. Specifically, to effectively help capture the semantic relevance and coexistent information, CLIP is fine-tuned to extract visual features, while a graph attention network is used to enhance textual features extracted by bag-of-words model in the teacher model. Then, for better supervising the training of student model, a more comprehensive similarity measurement is introduced to represent distilled knowledge by jointly preserving the log-likelihood, intra and inter modality similarities. Finally, the student model extracts deep features by a lightweight networks, and generates the hash codes under the supervision of the similarity matrix produced by the teacher model. Experimental results on three widely used datasets demonstrate that CKDH can outperform some state-of-the-art methods, by delivering the best result consistently.

Abstract:
Video-text cross-modal retrieval (VTR) is more natural and challenging than image-text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more closely with real-world scenarios, i.e., weak semantic text description as a query, we propose a multilevel semantic interaction alignment (MSIA) model. We develop a two-stream network, which decomposes video and text alignment into multiple dimensions. Specifically, in the video stream, to better align heterogeneity data, redundant video information is suppressed via the designed frame adaptation attention mechanism, and richer semantic interaction is achieved through a text-guided attention mechanism. Then, for text alignment in the video local region, we design a distinctive anchor frame strategy and a word selection method. Finally, a cross-granularity alignment approach is designed to learn more and finer semantic features. With the above schema, the alignment between video and weak semantic text descriptions is reinforced, further alleviating the issues of difficult alignment caused by weak semantic text descriptions. The experimental results on VTR benchmark datasets show the competitive performance of our approach in comparison to that of state-of-the-art methods. The code is available at: https://github.com/jiaranjintianchism/MSIA.

Abstract:
Online Continual Learning (OCL), as a core step towards achieving human-level intelligence, aims to incrementally learn and accumulate novel concepts from streaming data that can be seen only once, while alleviating catastrophic forgetting on previously acquired knowledge. Under this mode, the model needs to learn new classes or tasks in an online manner, and the data distribution may change over time. Moreover, task boundaries and identities are not available during training and evaluation. To balance the stability and plasticity of networks, in this work, we propose a replay-based framework for OCL, named Contrastive Correlation Preserving Replay (CCPR), which focuses on not only instances but also correlations between multiple instances. Specifically, besides the previous raw samples, the corresponding representations are stored in the memory and used to construct correlations for the past and the current model. To better capture correlation and higher-order dependencies, we maximize the low bound of mutual information between the past correlation and the current correlation by leveraging contrastive objectives. Furthermore, to improve the performance, we propose a new memory update strategy, which simultaneously encourages the balance and diversity of samples within the memory. With limited memory slots, it allows less redundant and more representative samples for later replay. We conduct extensive evaluations on several popular CL datasets, and experiments show that our method consistently outperforms the state-of-the-art methods and can effectively consolidate knowledge to alleviate forgetting.

Abstract:
Weakly-supervised temporal action localization (WTAL) aims to localize and classify action instances in untrimmed videos with only video-level labels available. Despite the remarkable success of existing methods, whose generated proposals are commonly far more than the ground-truth action instances, it still makes sense to improve the ranking accuracy of the generated proposals since users in real-world scenarios usually prioritize the action proposals with the highest confidence scores. The inaccuracy of the proposal ranking mainly comes from two aspects: For one thing, the traditional proposal generation manner entirely relies on snippet-level perception, resulting in a significant yet unnoticed gap with the target of proposal-level localization. For another, existing methods commonly employ a hand-crafted proposal generation manner, a post-process that does not participate in model optimization. To address the above issues, we propose an end-to-end trained two-stage method, termed as Learning Proposal-aware Re-ranking (LPR) for WTAL. In the first stage, we design a proposal-aware feature learning module to inject the proposal-aware contextual information into each snippet, and then the enhanced features are utilized for predicting initial proposals. Furthermore, to perform effective and efficient proposal re-ranking, in the second stage, we contrast the proposals attached with high confidence scores with our constructed multi-scale foreground/background prototypes for further optimization. Evaluated by both the vanilla and Top- k mAP metrics, results of extensive experiments on two popular benchmarks demonstrate the effectiveness of our proposed method.

Abstract:
Multi-label image classification is a fundamental yet challenging task, which aims to predict the labels associated with a given image. Most of previous methods directly exploit the high-level features from the last layer of convolutional neural network for classification. However, these methods cannot obtain global features due to the limited size of convolutional kernels, and they fail to extract multi-scale features to effectively recognize small-scale objects in the images. Recent studies exploit the graph convolution network to model the label correlations for boosting the classification performance. Despite substantial progress, these methods rely on manually pre-defined graph structures. Besides, they ignore the associations between semantic labels and image regions, and do not fully explore the spatial context of images. To address above issues, we propose a novel Dual Attention Transformer (DATran) model, which adopts a dual-stream architecture that simultaneously learns spatial and channel correlations from multi-label images. Firstly, in order to solve the problem that current methods are difficult to recognize small-size objects, we develop a new multi-scale feature fusion (MSFF) module to generate multi-scale feature representation by jointly integrating both high-level semantics and low-level details. Secondly, we design a prior-enhanced spatial attention (PSA) module to learn the long-range correlation between objects from different spatial positions in images to enhance the model performance. Thirdly, we devise a prior-enhanced channel attention (PCA) module to capture the inter-dependencies between different channel maps, thus effectively improving the correlation between semantic categories. It is worth noting that PSA module and PCA module complement and promote each other to further augment the feature representations. Finally, the outputs of these two attention modules are fused to obtain the final features for classification. Performance evaluation experiments are conducted on MS-COCO 2014, PASCAL VOC 2007 and VG-500 datasets, demonstrating that DATran model achieves better performance than current state-of-the-art models.

Abstract:
Pedestrian attribute recognition (PAR) has received increasing attention because of its wide application in video surveillance and pedestrian analysis. Extracting robust feature representation is one of the key challenges in this task. The existing methods primarily rely on convolutional neural networks (CNNs) as the backbone network for feature extraction. However, these methods mainly focus on small discriminative regions while ignoring the global perspective. To overcome these limitations, we propose PARFormer, a pure transformer-based multi-task PAR network consisting of four modules. In the feature extraction module, we build a transformer-based strong baseline for feature extraction, which achieves competitive results on several PAR benchmarks compared with the existing CNN-based baseline methods. Since the PAR task is vulnerable to environmental factors, we enhance feature robustness in the feature processing module and propose an effective data augmentation strategy named batch random mask (BRM) block to reinforce the attentive feature learning of random patches. Furthermore, we propose a multi-attribute center loss (MACL) to augment the inter-attribute discriminability of feature representations. As viewpoints can affect some specific attributes, in the viewpoint perception module, we propose a multi-view contrastive loss (MVCL) that enables the network to exploit the viewpoint information. In the attribute recognition module, we alleviate the negative-positive imbalance problem to generate the attribute predictions. These modules interact and jointly learn a highly discriminative feature space and supervise the generation of the final features. Extensive experimental results show that the proposed PARFormer network performs well compared to the state-of-the-art methods on several public datasets, including PETA, RAP, and PA100K. Code will be released at https://github.com/xwf199/PARFormer.

Abstract:
Neural networks can be successfully used for cross-component prediction in video coding. In particular, attention-based architectures are suitable for chroma intra prediction using luma information because of their capability to model relations between difierent channels. However, the complexity of such methods is still very high and should be further reduced, especially for decoding. In this paper, a cost-effective attention-based neural network is designed for chroma intra prediction. Moreover, with the goal of further improving coding performance, a novel approach is introduced to utilize more boundary information effectively. In addition to improving prediction, a simplification methodology is also proposed to reduce inference complexity by simplifying convolutions. The proposed schemes are integrated into H.266/Versatile Video Coding (VVC) pipeline, and only one additional binary block-level syntax flag is introduced to indicate whether a given block makes use of the proposed method. Experimental results demonstrate that the proposed scheme achieves up to −0.46%/−2.29%/−2.17% BD-rate reduction on Y/Cb/Cr components, respectively, compared with H.266/VVC anchor. Reductions in the encoding and decoding complexity of up to 22% and 61%, respectively, are achieved by the proposed scheme with respect to the previous attention-based chroma intra prediction method while maintaining coding performance.

Abstract:
Deep hashing has attracted broad interest in cross-modal retrieval because of its low cost and efficient retrieval benefits. To capture the semantic information of raw samples and alleviate the semantic gap, supervised cross-modal hashing methods that utilize label information which could map raw samples from different modalities into a unified common space, are proposed. Although making great progress, existing deep cross-modal hashing methods are suffering from some problems, such as: 1) considering multi-label cross-modal retrieval, proxy-based methods ignore the data-to-data relations and fail to explore the combination of the different categories profoundly, which could lead to some samples without common categories being embedded in the vicinity; 2) for feature representation, image feature extractors containing multiple convolutional layers cannot fully obtain global information of images, which results in the generation of sub-optimal binary hash codes. In this paper, by extending the proxy-based mechanism to multi-label cross-modal retrieval, we propose a novel Deep Semantic-aware Proxy Hashing (DSPH) framework, which could embed multi-modal multi-label data into a uniform discrete space and capture fine-grained semantic relations between raw samples. Specifically, by learning multi-modal multi-label proxy terms and multi-modal irrelevant terms jointly, the semantic-aware proxy loss is designed to capture multi-label correlations and preserve the correct fine-grained similarity ranking among samples, alleviating inter-modal semantic gaps. In addition, for feature representation, two transformer encoders are proposed as backbone networks for images and text, respectively, in which the image transformer encoder is introduced to obtain global information of the input image by modeling long-range visual dependencies. We have conducted extensive experiments on three baseline multi-label datasets, and the experimental results show that our DSPH framework achieves better performance than state-of-the-art cross-modal hashing methods. The code for the implementation of our DSPH framework is available at https://github.com/QinLab-WFU/DSPH.

Abstract:
In cross-modal retrieval, the hashing technique has sparked a great revolution because of its competitive query speed and minimal storage. However, existing approaches may have critical limitations: 1) Label Intrinsic Relations. They barely explore category information inherent in labels and only consider labels as distinct entities, losing rich latent semantic information. 2) Modality-specific and Modality-coherence Semantics. They often construct a common subspace and an affinity matrix to learn modality-specific features and modality-coherence correlations, respectively. The former will lead to considerable quantization errors because the subspaces should be approximate rather than exactly equal. The latter is not scalable due to its high computational costs. 3) Non-relaxation Optimization Strategy. To solve constraints, some approaches relax the binary constraints to continuous, rising significant quantization errors. To mitigate these problems, we propose Disperse Asymmetric Subspace Relation Hashing, termed DASRH. In particular, it first embeds modality-specific kernel features into dispersed latent spaces, which can effectively fuse heterogeneous patterns. Additionally, it exploits fine-grained categories from labels by reconstructing collective semantic representations, making discriminative binary codes. Furthermore, it constructs an asymmetric consistent relation integration, preserving both inter-modal disparities and intra-class differences. In the optimization process, an effective alternative iterative optimization scheme is established. Theoretical analysis and comprehensive experiments highlight the advantages of our DASRH against cutting-edge technology.

Abstract:
Low-light image enhancement aims to improve the visual quality of images captured under poor illumination and has caught much attention these years. However, existing low-light enhancement methods encounter many problems, e.g., they may not be robust to diverse low-light conditions or have to sacrifice computational efficiency for enhancement performance, which hinder their practical applications. To solve these problems, this paper proposes a novel enhancement method, called Pixel-Wise Gamma Correction Mapping (PWGCM), which combines our innovative pixel-wise Gamma Correction (GC) and deep learning. Compared with conventional GC, our pixel-wise GC is characterized by a set of gamma correction maps, which have the same size as the input image and are taken to replace the single global GC parameter of conventional GC. These gamma correction maps are generated from the low-light image input by a lightweight convolutional neural network at low computational cost. New no-reference loss functions are provided to train the network, ensuring reliable unsupervised learning. Furthermore, our PWGCM is enhanced by an iterative strategy, under which the low-light input image is iteratively enhanced based on the generated gamma correction maps and can yield visually pleasant results. Extensive experiments are done to compare our PWGCM with several state-of-the-art methods in terms of visual quality, efficiency, and auxiliary effects on high-level tasks. The comparison results confirm the superiority of our PWGCM.

Abstract:
Image fusion techniques aim to generate more informative images by merging multiple images of different modalities with complementary information. Despite significant fusion performance improvements of recent learning-based approaches, most fusion algorithms have been developed based on convolutional neural networks (CNNs), which stack deep layers to obtain a large receptive field for feature extraction. However, important details and contexts of the source images may be lost through a series of convolution layers. In this work, we propose a cross-modal transformer-based fusion (CMTFusion) algorithm for infrared and visible image fusion that captures global interactions by faithfully extracting complementary information from source images. Specifically, we first extract the multiscale feature maps of infrared and visible images. Then, we develop cross-modal transformers (CMTs) to retain complementary information in the source images by removing redundancies in both the spatial and channel domains. To this end, we design a gated bottleneck that integrates cross-domain interaction to consider the characteristics of the source images. Finally, a fusion result is obtained by exploiting spatial-channel information in refined feature maps using a fusion block. Experimental results on multiple datasets demonstrate that the proposed algorithm provides better fusion performance than state-of-the-art infrared and visible image fusion algorithms, both quantitatively and qualitatively. Furthermore, we show that the proposed algorithm can be used to improve the performance of computer vision tasks, e.g., object detection and monocular depth estimation.

Abstract:
Binary neural networks leverage Sign function to binarize weights and activations, which require gradient estimators to overcome its non-differentiability and will inevitably bring gradient errors during backpropagation. Although many hand-designed soft functions have been proposed as gradient estimators to better approximate gradients, their mechanism is not clear and there are still huge performance gaps between binary models and their full-precision counterparts. To address these issues and reduce gradient error, we propose to tackle network binarization as a binary classification problem and use a multi-layer perceptron (MLP) as the classifier in the forward pass and gradient estimator in the backward pass. Benefiting from the MLP’s theoretical capability to fit any continuous function, it can be adaptively learned to binarize networks and backpropagate gradients without any prior knowledge of soft functions. From this perspective, we further empirically justify that even a simple linear function can outperform previous complex soft functions. Extensive experiments demonstrate that the proposed method yields surprising performance both in image classification and human pose estimation tasks. Specifically, we achieve 65.7% top-1 accuracy of ResNet-34 on ImageNet dataset, with an absolute improvement of 2.6%. Moreover, we take binarization as a lightweighting approach for pose estimation models and propose well-designed binary pose estimation networks SBPN and BHRNet. When evaluating on the challenging Microsoft COCO keypoint dataset, the proposed method enables binary networks to achieve a mAP of up to 60.6 for the first time. Experiments conducted on real platforms demonstrate that BNN achieves a better balance between performance and computational complexity, especially when computational resources are extremely low.

Abstract:
Since Facial Action Unit (AU) annotations require domain expertise, common AU datasets only contain a limited number of subjects. As a result, a crucial challenge for AU detection is addressing identity overfitting. We find that AUs and facial expressions are highly associated, and existing facial expression datasets often contain a large number of identities. In this paper, we aim to utilize the expression datasets without AU labels to facilitate AU detection. Specifically, we develop a novel AU detection framework aided by the Global-Local facial Expressions Embedding, dubbed GLEE-Net. Our GLEE-Net consists of three branches to extract identity-independent expression features for AU detection. We introduce a global branch for modeling the overall facial expression while eliminating the impacts of identities. We also design a local branch focusing on specific local face regions. The combined output of global and local branches is firstly pre-trained on an expression dataset as an identity-independent expression embedding, and then finetuned on AU datasets. Therefore, we significantly alleviate the issue of limited identities. Furthermore, we introduce a 3D global branch that extracts expression coefficients through 3D face reconstruction to consolidate 2D AU descriptions. Finally, a Transformer-based multi-label classifier is employed to fuse all the representations for AU detection. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art on the widely-used DISFA, BP4D and BP4D+ datasets.

Abstract:
Since Vision Transformers (ViTs) are introduced into computer vision, they have developed rapidly in a variety of visual tasks. Recently, they have been gradually applied to visual tracking. The Transformer can adaptively capture the global similarity comparisons of target objects and search regions, which has achieved competitive performance results. However, Transformer architectures often require a large amount of training data and computing resources, and lack prior knowledge of inductive biases that existed in images. The advantages of convolutional neural networks (CNNs) in extracting local similarities are not fully exploited. To resolve these problems, we propose a lightweight tracking architecture, combining CNN and Transformer in the feature fusion stage. Specifically, Local-Global Feature Interaction (LGFI) module and Feature Cross-Fusion (FCF) module are the key components in our approach. In the LGFI module, the proposed method includes a Transformer global information network and a Transformer-like CNN local information network for simultaneous global scope dependency establishment and local feature similarity enhancement, then aggregates their feature results together. In the FCF module, the proposed method includes a multi-head cross-attention and a convolutional feedforward network for feature fusion of templates and search regions. Finally, we use the classification and regression head to predict the exact location of the target. Extensive experiments demonstrate that, our method achieves better tracking performance than the baseline method, when both methods are trained with fewer data. Meanwhile, without any extra training data, the proposed method also obtains comparable results with other state-of-the-art trackers on six challenging benchmarks, including GOT-10k, LaSOT, TrackingNet, OTB100, UAV123, and NFS. Furthermore, our model is lightweight compared with the baseline method, with fewer parameters and lower FLOPs, while running at real-time speed.

Abstract:
Few-shot class-incremental learning (FSCIL) aims to continually learn new classes using a few samples while not forgetting the old classes. The scarcity of new training data will seriously destroy the model’s stability and plasticity. Continually Evolved Classifiers (CEC) (Zhang et al., 2021), a kind of framework, maintains the stability by freezing the encoder and achieves the plasticity by evolving the classifier along with a pseudo incremental learning scheme. However, the performance of CEC is limited due to 1) inequitable information gains between classifier weights and test features, and 2) inefficient learning task construction strategy. To address the first issue, we propose a Knowledge-guided Relation Refinement Module (KRRM) to update both the classifier weights and test features. The main function of KRRM is achieved through cross-attention to propagate the knowledge represented by old encoded data. To address the second issue, we design a Pseudo Incremental relation Refinement Learning (PIRL) that utilizes a novel hard concepts mining strategy to mine hard concept tasks globally and locally. By successfully addressing the two issues, our proposed method, named Improved Continually Evolved Classifiers (CEC+), extends the potential of CEC without introducing any additional parameters. More precisely, extensive experiments on CIFAR100, miniImageNet, and Caltech-UCSD Birds-200-2011, demonstrate that our proposed method surpasses prior state-of-the-art methods.

Abstract:
Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual effects of each task. Inspired by the recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross-task interaction pipeline. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning via multiple queries where each query encodes the task-related context. The MQTransformer is composed of three key components: shared encoder, cross-task query attention module and shared decoder. We first model each task with a task-relevant query. Then both the task-specific feature output by the feature extractor and the task-relevant query are fed into the shared encoder, thus encoding the task-relevant query from the task-specific feature. Secondly, we design a cross-task query attention module to reason the dependencies among multiple task-relevant queries; this enables the module to only focus on the query-level interaction. Finally, we use a shared decoder to gradually refine the image features with the reasoned query features from different tasks. Extensive experiment results on two dense prediction datasets (NYUD-v2 and PASCAL-Context) show that the proposed method is an effective approach and achieves state-of-the-art results. Code and models are available at https://github.com/yangyangxu0/MQTransformer.

Abstract:
With the development of social network, a large number of private JPEG images are stored in social cloud platform. Correspondingly, the platform embeds user ID or authentication labels to manage these privacy images, preventing them from being arbitrarily accessed or tampered by illegal persons. However, data embedding in JPEG domain inevitably produces irreversible modifications to DCT coefficients, thus resulting in obvious or even serious distortion in the host JPEG images. To address this problem, this paper proposes an efficient JPEG reversible data hiding (RDH) method by constructing progressive two-dimensional histogram mappings. We firstly design distortion function to calculate the cost of each DCT frequency band, and then sort them to build histogram mapping containing a series of coefficient pairs. Subsequently, a progressive mapping mechanism is introduced to maintain most of AC coefficients unchanged. According to the given capacity, this mechanism can adaptively generate an optimum two-dimensional histogram mapping to embed secret messages. Our scheme can achieve an effective balance among embedding capacity, visual quality of the marked image and file size expansion, while keeping high cost-performance complexity. Extensive experiments demonstrate that our method outperforms existing JPEG RDH schemes in terms of visual quality and file size increment of the marked image, and provides an efficient solution for the confidentiality and security access problem of sensitive private image in cloud environment.

Abstract:
Capturing the long-range spatial-temporal correlation among joints of dynamic skeletal data efficiently is very challenging in hand gesture recognition (HGR). The flexibility of Transformer in modeling global dependencies among elements of any sequence makes it a perfect solution for skeleton-based HGR. However, the existing Transformer-based approaches only capture the correlation of intra-frame and inter-frame joints, respectively, without considering the relationship among different joints in several successive frames. In this paper, a novel spatial-temporal synchronous transformer (STST) method is proposed for skeleton-based HGR. The spatial-temporal chunks encoding module is proposed to encode the hand gesture skeleton sequence (HGSS) into several chunks, in which each chunk contains several consecutive frames to encode the relationship among spatial-temporal joints. Then, the encoding feature is fed into a spatial-temporal chunks transformer module and a temporal integration transformer module to model the spatial-temporal correlation of HGSSs, simultaneously, so that a more comprehensive understanding of the global and local spatial-temporal information can be achieved. In this way, the spatial-temporal information among joints can be efficiently extracted and utilized to better understand the semantics of gesture actions and then yield a higher recognition accuracy. Extensive experiments on SHREC’17 Track dataset and DHG-14/28 dataset show that the proposed method achieves the state-of-the-art performance compared with other representative methods.

Abstract:
Person re-identification is the task to retrieve a given person in multiple non-overlapping cameras. Due to viewpoint variation from different cameras, the intra-class variance and inter-class similarity of human images are two critical factors that limit the accuracy of person re-identification. In order to simultaneously resolve these issues, we propose a multi-viewpoint aggregation model, which aims to extend the query method of Re-ID task from single to multi-viewpoints to deal with the various viewpoints and explore inherent gallery information for query optimization. And for constructing multiple auxiliary query images with complementary viewpoints, we design a novel identity consistency pose transfer framework based on a pseudo Siamese structure and trained by a specific Re-ID guided meta-learning pipeline. The goal is to keep the identity consistency between initial query and generated images by enhancing identity-related representation through feature learning and reducing the domain gap between generated and original images. Extensive experiment results indicate our method achieves the Rank-1/mAP performances on Market-1501 (96.62%/93.26%), DukeMTMC-reid (93.45%/87.89%) and CUHK03-labeled (88.26%/87.97%), which outperforms the state-of-the-art based on single viewpoint Re-ID methods.

Abstract:
The key to crossview geolocalization is to match images of the same target from different viewpoints, e.g., images from drones and satellites. It is a challenging problem due to the changing appearance of objects from variable viewpoints. Most existing methods focus mainly on extracting global features or on segmenting feature maps, causing the loss of information contained in the images. To address the above issues, we propose a new ConvNeXt-based method called MCCG, which stands for Multiple Classifier for Cross-view Geolocalization. The proposed method captures rich discriminative information by cross-dimension interaction and acquires multiple feature representations, realizing a comprehensive feature representation. Additionally, the robustness of the model is improved crediting the multiple feature representations exploiting more contextual information despite position shifting or scale variations. Extensive experiments on the widely used public benchmarks University-1652 and SUES-200 demonstrate that the proposed method achieves state-of-the-art performance in both drone-view target localization and drone navigation applications by over 3% compared to existing methods. Our code and model are available at https://github.com/mode-str/crossview.

Abstract:
RGB-D semantic segmentation can be advanced with convolutional neural networks due to the availability of Depth data. Although objects cannot be easily discriminated by just the 2D appearance, with the local pixel difference and geometric patterns in Depth, they can be well separated in some cases. Considering the fixed grid kernel structure, CNNs are limited to lack the ability to capture detailed, fine-grained information and thus cannot achieve accurate pixel-level semantic segmentation. To solve this problem in the CNN structure, we propose a Pixel Difference Convolutional Network (PDCNet) to capture detailed intrinsic patterns by aggregating both intensity and gradient information in the local range for Depth data and global range for RGB data, respectively. Precisely, PDCNet consists of a Depth branch and an RGB branch. For the Depth branch, we propose a Pixel Difference Convolution (PDC) to consider local and detailed geometric information in Depth data via aggregating both intensity and gradient information. For the RGB branch, we contribute a lightweight Cascade Large Kernel (CLK) to extend PDC, namely CPDC, to enjoy global contexts for RGB data and further boost performance. Consequently, the local and global pixel differences from both modal data are seamlessly incorporated into PDCNet during the information propagation process. Experiments on three challenging benchmark datasets, i.e. , NYUDv2 (78.4 Pixel Acc., 53.5 mIoU), SUN RGB-D (83.3 Pixel Acc., 49.6 mIoU) and SID Dataset (83.1 Pixel Acc., 61.4 mIoU) reveal that our PDCNet achieves state-of-the-art performance for the semantic segmentation task.

Abstract:
Video data bring a big challenge to semantic segmentation due to the large volume of data and strong inter-frame redundancy. In this paper, we propose a dual local and global correlation network tailored for efficient video semantic segmentation. It consists of three modules: 1) a local attention based module, which measures correlation and achieves feature aggregation in a local region between key frame and non-key frame; 2) a consistent constraint module, which considers long-range correlation among pixels from a global view for promoting intra-frame semantic consistency of non-key frame; and 3) a key frame decision module, which selects key frames adaptively based on the ability of feature transferring. Extensive experiments on the Cityscapes and Camvid video datasets demonstrate that our proposed method could reduce inference time significantly while maintaining high accuracy. The implementation is available at https://github.com/An01168/DCNVSS.

Abstract:
Continuous Sign language Recognition (CSLR) aims to generate gloss sequences based on untrimmed sign videos. Since discriminative visual features are essential for CSLR, current efforts mainly focus on strengthening the feature extractor. The feature extractor can be disassembled into a spatial representation module and a short-term temporal module for spatial and visual features modeling. However, existing methods always regard it as a monoblock and rarely implement specific refinements for such two distinct modules, which is difficult to achieve effective modeling of spatial appearance information and temporal motion information. To address the above issues, we proposed a spatial temporal enhanced network which contains a spatial-visual alignment (SVA) module and a temporal feature difference (TFD) module. Specifically, the SVA module conducts an auxiliary task between the spatial features and target gloss sequences to enhance the extraction of hand and facial expressions. Meanwhile, the TFD module is constructed to exploit the underlying dynamic between consecutive frames and inject the aggregated motion information into spatial features to assist short-term temporal modeling. Extensive experimental results demonstrate the effectiveness of the proposed modules and our network achieves state-of-the-art or competitive performance on four public CSLR datasets.

Abstract:
We propose a Generative Adversarial Network (GAN)-based architecture for achieving high-quality physically based rendering (PBR). Conventional PBR relies heavily on ray tracing, which is computationally expensive in complicated environments. Some recent deep learning-based methods can improve efficiency but cannot deal with illumination variation well. In this paper, we propose PBR-GAN, an end-to-end GAN-based network that solves these problems while generating natural photo-realistic images. Two encoders (the shading encoder and albedo encoder) and two decoders (the image decoder and light decoder) are introduced to achieve our target. The two encoders and the image decoder constitute the generator that learns the mapping between the generated domain and the real domain. The light decoder produces light maps that pay more attention to the highlight and shadow regions. The discriminator aims to optimize the generator by distinguishing target images from the generated ones. Three novel loss items, concentrating on domain translation, overall shading preservation, and light map estimation, are proposed to optimize the photo-realistic outputs. Furthermore, a real dataset is collected to provide realistic information for training GAN architecture. Extensive experiments indicate that PBR-GAN can preserve the illumination variation and improve the image perceptual quality.

Abstract:
Recently, deep learning has been widely employed to improve the quality of low-light videos. However, most existing low-light video enhancement methods fail to effectively explore temporal dependence, and the enhanced videos may suffer from severe noise, loss of detailed texture, and temporal inconsistency. In this paper, we propose a novel SNR-prior Guided Trajectory-aware Transformer (SGTT) to enable effective video representation learning for low-light video enhancement. Specifically, signal-to-noise ratio prior and cosine similarity are introduced to build the trajectory-aware dual-attention for exploiting the dependence of long-range spatio-temporal information, which searches for sharper and highly correlated patches within the same trajectory to assist in enhancing the target frames. Moreover, to adaptively fuse spatio-temporal information of support frames propagated bidirectionally, an attention-guided spatio-temporal feature aggregation module is proposed to perceive and enhance the specific high-quality features. The evaluation of both dynamic and static videos shows the effectiveness of our network, which significantly outperforms the state-of-the-art methods.

Abstract:
Adverse weather conditions, such as rain, raindrop, snow and haze, consistently degrade images in an unpredictable manner, thereby rendering existing task-specific and task-aligned methods inadequate in addressing this formidable problem. To this end, we investigate the application of Transformer in image restoration and introduce an efficient frequency-oriented method called AIRFormer, which is designed to restore weather-degraded images comprehensively and holistically. Specifically, we identify that the initial self-attention mechanism exhibits distinctive properties akin to a low-pass filter. Therefore, we construct a frequency-guided Transformer encoder by incorporating wavelet-based prior information to guide the extraction of image features. Additionally, considering the non-specific frequency characteristics of self-attention in the later stages, we develop a frequency-refined Transformer decoder that incorporates learnable task-specific queries across spatial dimensions, channel dimensions, and wavelet domains. To facilitate the training of our proposed method, we curate a comprehensive benchmark dataset named AIR40K that, encompasses a wide range of challenging scenarios. Extensive experimental evaluations demonstrate the superiority of our AIRFormer over both task-aligned and all-in-one methods across 15 publicly available datasets. Notably, AIRFormer achieves the best trade-off between the inference time and quality of reconstructed image, comparing with existing methods such as TransWeather and Restormer. The source code, dataset and pre-trained models will be available at https://github.com/chdwyb/AIRFormer.

Abstract:
Unsupervised hashing has the desirable advantages of label independence, high storage, and retrieval efficiency, which is suitable for scalable image retrieval. Most existing methods focus on enhancing the image hashing model training process at the offline stage. However, little attention has been paid to the query content analysis by them at the online retrieval stage. They still suffer from important query semantic shortages, and thus limit the online retrieval performance, which is the ultimate objective of the image retrieval system. In this paper, we propose an Online Query Expansion Hashing (OQEH) for efficient image retrieval, by adaptively enhancing the discriminative capability of query hash codes in an expansion manner at the online retrieval stage. Specifically, we first design a self-expansion network to learn semantically invariant feature representations from images and their visual augmentations. Then, we conduct neighborhood-expansion to search similar samples for each image from a query expansion set with the semantically invariant features and design a Transformer architecture to adaptively transfer the semantics of neighbor samples to their corresponding images. With the support of semantically invariant features, query expansion set, and adaptive semantic transfer, the representation capability of query hash codes can be enhanced at the online retrieval stage. Experimental results demonstrate that the proposed OQEH method achieves superior retrieval accuracy and comparable retrieval efficiency compared with the state-of-the-art methods. Particularly, on MS COCO dataset, OQEH can obtain about 6% performance improvement compared with the state-of-the-art results. The source codes of our method are available at: https://github.com/christinecui/OQEH.

Abstract:
Previous works on human motion prediction follow the pattern of building an extrapolation mapping between the sequence observed and the one to be predicted. However, the inherent difficulty of time-series extrapolation and complexity of human motion data still result in many failure cases. In this paper, we explore a longer horizon of sequence with more poses following behind, which breaks the limit in extrapolation problems that data/information on the other side of the predictive target is completely unknown. As these poses are unavailable for testing, we regard them as a privileged sequence, and propose a Two-stage Privileged Knowledge Distillation framework that incorporates privileged information in the forecasting process while avoiding direct use of it. Specifically, in the first stage, both the observed and privileged sequence are encoded for interpolation, with Privileged-sequence-Encoder (Priv-Encoder) learning privileged knowledge (PK) simultaneously. Then, in the second stage where privileged sequence is not observable, a novel PK-Simulator distills PK by approximating the behavior of Priv-Encoder, but only taking as input the observed sequence, to enable a PK-aware prediction pattern. Moreover, we present a One-stage version of this framework, using Shared Encoder that integrates the observation encoding in both interpolation and prediction branches to realize parallel training, which helps produce the most conducive PK to prediction pipeline. Experimental results show that our frameworks are model-agnostic, and can be applied to existing motion prediction models with encoder-decoder architecture to achieve improved performance.

Abstract:
Recent advancements in video semantic segmentation have made substantial progress by exploiting temporal correlations. Nevertheless, persistent challenges, including redundant computation and the reliability of the feature propagation process, underscore the need for further innovation. In response, we present Deep Common Feature Mining (DCFM), a novel approach strategically designed to address these challenges by leveraging the concept of feature sharing. DCFM explicitly decomposes features into two complementary components. The common representation extracted from a key-frame furnishes essential high-level information to neighboring non-key frames, allowing for direct re-utilization without feature propagation. Simultaneously, the independent feature, derived from each video frame, captures rapidly changing information, providing frame-specific clues crucial for segmentation. To achieve such decomposition, we employ a symmetric training strategy tailored for sparsely annotated data, empowering the backbone to learn a robust high-level representation enriched with common information. Additionally, we incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency. Experimental evaluations on the VSPW and Cityscapes datasets demonstrate the effectiveness of our method, showing a superior balance between accuracy and efficiency. The implementation is available at https://github.com/BUAAHugeGun/DCFM.

Abstract:
Heatmap-based methods have dominated the face alignment task, yet the maximum response decoding scheme necessitates further reform. While some studies have attempted to compensate for prediction offsets using a post-processing module, the prediction errors induced by the maximum response decoding scheme remain challenging to rectify. In this paper, we assume that using heatmap value to denote the ground-truth probability is not accurate enough. To cure this problem, we propose DISPAL, a novel DIStribution-based Probability for fAcial Landmarks, which signifies the ground-truth probability by the similarity between the pixel’s neighbouring value distribution and Gaussian distribution. This innovative probability enables us to pinpoint the keypoint location more robustly than previous methods that rely solely on the peak score. It also exhibits remarkable generalization to complex decoding methodologies. Furthermore, we propose supervising this probability as an additional task loss to help the model learn better heatmap representation. Extensive empirical results on WFLW, 300W, and COFW datasets demonstrate that our distribution-based probability mechanism significantly surpasses original value-based probability approaches.

Abstract:
Knowledge-based Scene Graph Generation (SGG) requires external commonsense knowledge beyond the visual scene to infer the relation between objects. Such knowledge can be obtained in a variety of forms, such as vision, text, and graph. However, there are two drawbacks as follows: 1) commonsense knowledge essentially has uncertainty, but current works usually represent knowledge in a deterministic manner, which is not well matched to its nature, 2) using commonsense knowledge without denoising will introduce irrelevant information. This can increase the burden on the relation classifier and only obtain marginal gains over a large amount of data. In this paper, we propose a novel Gaussian distribution-aware commonsense knowledge learning method for SGG. First, we associate each object pair with a Gaussian distribution, which parametrizes visual context and commonsense as mean and variance, respectively. We prove that Gaussian modeling can provide a probabilistic soft space to measure the uncertainty of external knowledge, which allows diverse predictions. Second, to reduce semantic noise in commonsense, we sample multiple variables from the Gaussian distribution and train multi-expert classifiers, which can be dynamically examined for the ensemble softmax classification. Extensive comparative experiments on two benchmarks confirm that our method can achieve competitive performance against the state-of-the-art. Ablation studies verify the essential roles of individual components. Moreover, the visualization of multi-expert classifiers confirms our ability to integrate commonsense for relation inference.

Abstract:
Deep neural networks (DNNs) have demonstrated excellent performance across various domains. However, recent studies have shown that deep neural networks are vulnerable to adversarial examples, including DNN-based video action recognition models. While much of the existing research on adversarial attacks against video models focuses on perturbation-based attacks, there is limited research on patch-based black-box attacks. Existing patch-based attack algorithms suffer from the problem of a large search space of optimization algorithms and use patches with simple content, leading to suboptimal attack performance or requiring a large number of queries. To address these challenges, we propose the “Diffusion Patch Attack (DPA) with Spatial-Temporal Cross-Evolution (STCE) for Video Recognition,” a novel approach that integrates the excellent properties of the diffusion model into video black-box adversarial attacks for the first time. This integration significantly narrows the parameter search space while enhancing the adversarial content of patches. Moreover, we introduce the spatial-temporal cross-evolutionary algorithm to adapt to the narrowed search space. Specifically, we separate the spatial and temporal parameters and then employ an alternate evolutionary strategy for each parameter type. Extensive experiments conducted on three widely used video action recognition models (C3D, NL, and TPN) and two benchmark datasets (UCF-101 and HMDB-51) demonstrate the superior performance of our approach compared to other state-of-the-art black-box patch attack algorithms.

Abstract:
Cross-View Geo-Localization task is aimed at establishing correspondences between images captured from different perspectives within the same geographical region. The major challenge lies in the significant appearance variations of the same scene in different views. Current methods predominantly rely on learning a representation of the coarse-grained information from images and then evaluating the similarity, while the fine-grained features are usually not well-treated. In this paper, a novel method, named DAC (Domain Alignment and scene Consistency) is proposed, which leverages contrastive learning to acquire the global information of images and simultaneously employs a domain space alignment module to align the fine-grained features. The comprehensive utilization of multi-grained vision information guarantees better feature representations. Additionally, a cross-batch scene consistency strategy is proposed in the network to establish the global supervision of the positive samples based on scene correspondence, which improves the distinctiveness of the image representations. Advanced performance is shown by our method in drone-view target localization and drone navigation applications, outperforming state-of-the-art methods in comprehensive tests on the popular public datasets University-1652 and SUES-200. Our method also outperforms existing methods in cross-region localization, showing an average improvement of 5.6% in the R@1. Our codes and models are available at https://github.com/SummerpanKing/DAC.

Abstract:
Compositional Zero-Shot Learning (CZSL) has been applied to various scenarios, including scene understanding, visual-language representation, and domain adaptation. Despite numerous endeavours and significant advancements, the crucial issues of fuzzy conceptualization of visual attributes and insufficient inter-class connectivity, have remained insufficiently addressed. To address these issues, we propose Learning Visual Attributes Representation for Compositional Zero-Shot Learning (LVAR-CZSL), which has the ability to learn visual attributes and inter-class dependencies. LVAR-CZSL is mainly composed of two key components: the Visual Attribute Representation Module (VARM) and the Connected Learning Module (CLM). Specifically, VARM extracts detailed attributes and object visual features from global visual features, resolving the issue of fuzzy visual attribute concepts. Moreover, CLM endows LVAR-CZSL with the capability to perceive connectivity between different attributes and objects, effectively enhancing inter-class connectivity. To establish a close connection between VARM and CLM and minimize the gap between image and text features, we introduce the composition-attribute-object Joint Scoring Function (JSF). Additionally, we propose Joint Loss Function (JLF) to optimize the learning process of VARM and CLM. The experiment results on four datasets show that LVAR-CZSL achieves state-of-the-art performance. The code is available at https://github.com/mxjmxj1/LVAR-CZSL.

Abstract:
As a secondary generation method, video recording will cause irreversible damage to the watermark within the video, which has always been challenging in video forensics. Although many video watermarking methods are reported in the literature, these methods, however, still cannot well resist camera recording. This has motivated the authors in this paper to introduce a new video watermarking method to resist camera recording. For the proposed method, two watermarks, i.e., copyright watermark and synchronization watermark, are embedded into the well-selected frequency domain coefficients. The synchronization watermark is used to ensure that the copyright watermark can be successfully extracted at the decoder side. To extract the copyright watermark without manual assistance, a neural network based segmentation model is applied to identify the watermarked video-playing region in the camera-recorded video. Meanwhile, automatic perspective correction is performed on the watermarked video-playing region so that the watermark information can be extracted accurately. The experiments show that the watermark data can be embedded into the raw video successfully and extracted from the camera-recorded video accurately by applying the proposed method. And, the proposed method significantly outperforms related works in terms of robustness in different scenarios, which has verified the superiority and applicability of the proposed method.

Abstract:
The extraction of distribution from images with diverse weather conditions is crucial for enhancing the robustness of visual algorithms. When addressing image degradation caused by different weather, accurately perceiving the data distribution of weather-informed degradation becomes a fundamental challenge. However, given the highly stochastic nature, modelling weather distribution poses a formidable task. In this paper, we propose a novel multi-Weather distribution difFUsion blind restoration model, named WeaFU. Firstly, the model employs representation learning to map image distribution into a latent space. Subsequently, WeaFU utilizes a diffusion-based approach, with the assistance of Diffusion Distribution Generator (DDG), to perceive and extract corresponding weather distribution. This strategy ingeniously injects data distribution into the recovery process, significantly enhancing the robustness of the model in diverse weather scenarios. Finally, a Conditional Distribution-Aware Transformer (CDAT) is constructed to align the distribution information with pixels, thereby obtaining clear images. Extensive experiments on real and synthetic datasets demonstrate that WeaFU achieves superior performance.

Abstract:
To predict the class label from a partially observable activity sequence can be quite challenging due to the high degree of similarity existing in early segments of different activities. In this paper, an innovative HARDness-Guided Discrimination Network (HARDer-Net) is proposed to evaluate the relationship between similar activity pairs that are extremely hard to discriminate. To train our HARDer-Net, an innovative adversarial learning scheme has been designed, providing our network with the strength to extract subtle discrimination information for the prediction of 3D early activities. Moreover, to enhance the adversarial learning scheme efficacy of our model for 3D early action prediction, we construct a Hardness-Guided bank that dynamically records the hard similar samples and conducts reward-guided selections of these recorded hard samples using a deep reinforcement learning scheme. The proposed method significantly enhances the capability of the model to discern fine-grained differences in early activity sequences. Several widely-used activity datasets are used to evaluate our proposed HARDer-Net, and we achieve state-of-the-art performance across all the evaluated datasets.

Affiliations: Robotics and Autonomous Systems (ROAS) Thrust, The Hong Kong University of Science and Technology (Guangzhou), Nansha, Guangzhou, Guangdong, China; Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, Hong Kong; Department of Computer Science, Dalian University of Technology, Dalian, China; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Fusionopolis, Singapore; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; Department of Computing, the School of Design, and the Research Institute for Sports Science and Technology, The Hong Kong Polytechnic University, Hung Hom, Hong Kong; School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China; School of Informatics, Xiamen University, Xiamen, China; Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Robotics and Autonomous Systems (ROAS) Thrust and Data Science and Analytics (DSA) Thrust, The Hong Kong University of Science and Technology (Guangzhou), Nansha, Guangzhou, Guangdong, China

Abstract:
Natural images often contain multiple shadow regions, and existing video shadow detection methods tend to fail in fully identifying all shadow regions, since they mainly learned temporal features at single-scale and single memory. In this work, we develop a novel convolutional neural network (CNN) to learn motion-guided multi-scale memory features to obtain multi-scale temporal information based on multiple network memories for boosting video shadow detection. To do so, our network first constructs three memories (i.e., a global memory, a local memory, and a motion memory) to combine spatial context and object motion for detecting shadows. Based on these three memories, we then devise a multi-scale motion-guided long-short transformer (MMLT) module to learn multi-scale temporal and motion memory features for predicting a shadow detection map of the input video frame. Our MMLT module includes a dense-scale long transformer (DLT), a dense-scale short transformer (DST), and a dense-scale motion transformer (DMT) to read three memories for learning multi-scale transformer features. Our DLT, DST, and DMT consist of a set of memory-read pooling attention (MPA) blocks and densely connect these output features of multiple MPA blocks to learn multi-scale transformer features since the scales of these output features are varied. By doing so, we can more accurately identify multiple shadow regions with different sizes from the input video. Moreover, we devise a self-supervised pretext task to pre-training the feature encoder for enhancing the downstream video shadow detection. Experimental results on three benchmark datasets show that our video shadow detection network quantitatively and qualitatively outperforms 26 state-of-the-art methods.

Abstract:
Scene text spotting, a unified framework between text detection and text recognition, has made great progress in recent years. Existing methods usually adopt the fully-supervised learning strategy, which relies on time-consuming location annotations, particularly for scene texts with arbitrary shapes. In this paper, we propose a weakly-supervised scene text spotting method via the location labels of single points with the corresponding text transcriptions. Due to the weak location annotations for challenging scene texts, previous weakly-supervised methods adopting the convolution neural network structure make it hard to model the different-scale text feature representations under blurring or nosing scenarios. In addition, as the single-point location can only cover part of the text instance, it will burden the confusion of sequential-like scene text recognition. To address these issues, we present a novel sequential recurrence self-attention for granularity-aware single-point scene text spotting. Specifically, we first enhance the scene text feature representations with different scales by integrating the global intra-interaction of high-level features with the low-level local features. Then, based on the granularity-aware text features, we decode them into text transcriptions in the sequential recurrence self-attention manner to capture the sequence-dependent relation in character-level semantics and locations. Extensive experiments show that our proposed method outperforms existing state-of-the-art weakly-supervised scene text spotters by a large margin.

Abstract:
When observing a person’s body, humans can extract the structured representation of the body called a parse graph, which includes the hierarchical decompositions from the entire body to parts and primitives and the context relations by horizontal links between the body parts. This ability helps humans better locate body structures at different levels. In order for the model to have this ability for single-person pose estimation, we design a hierarchical network to model the context relations and hierarchical structure in the parse graph of body structure by convolutional neural networks. It overcomes the problem that most methods ignore one of the context relations and hierarchical structure in the parse graph. Our network contains bottom-up and top-down stages. In the bottom-up stage, the structural features of the hierarchy are captured from primitives to parts and the entire body. Then in the top-down stage, with the context information of each body part, the structural features of the body parts are refined separately rather than together from the entire body to parts and primitives. Experiments show that our model enhances the reasonableness of predictions and achieves superior results on the CrowdPose, COCO keypoint detection and MPII human pose datasets.

Abstract:
In autonomous driving, it is crucial to train a single segmentation model that can generalize well on various target environments. Due to the lack of pixel-level annotation and a large domain discrepancy between domain pairs, it could be tough to achieve encouraging performance for multi-target domain adaptive semantic segmentation. To this end, we propose a novel Multi-level Collaborative Learning (MCL) framework that consists of two core components, namely Multi-level Self-Training (MST) and Hierarchical Knowledge Distillation (HKD). Specifically, MST focuses on individual, collaborative, and ensemble learning, whilst HKD aims to play the model’s ensemble capability. These designs enable the proposed MCL to fully exploit the multiple target data to train more powerful teachers and yield more accurate domain alignment. In addition, we integrate style transfer, self-training, and knowledge distillation into an end-to-end training scheme, making the proposed MCL more practical in applications. Empirically, we conduct extensive experiments on multi-target benchmarks. The encouraging results show the effectiveness of our method and state-of-the-art performance has been achieved. Codes are available at https://github.com/feifei-cv/MCL.

Abstract:
Few-shot object detection achieves rapid detection of novel-class objects by training detectors with a minimal number of novel-class annotated instances. Transfer learning-based few-shot object detection methods have shown better performance compared to other methods such as meta-learning. However, when training with base-class data, the model may gradually bias towards learning the characteristics of each category in the base-class data, which could result in a decrease in learning ability during fine-tuning on novel classes, and further overfitting due to data scarcity. In this paper, we first find that the generalization performance of the base-class model has a significant impact on novel class detection performance and proposes a generalization feature extraction network framework to address this issue. This framework perturbs the base model during training to encourage it to learn generalization features and solves the impact of changes in object shape and size on overall detection performance, improving the generalization performance of the base model. Additionally, we propose a feature-level data augmentation method based on self-distillation to further enhance the overall generalization ability of the model. Our method achieves state-of-the-art results on both the COCO and PASCAL VOC datasets, with a 6.94% improvement on the PASCAL VOC 10-shot dataset.

Abstract:
Despite the extensive research on RGBT object tracking, there are still several challenges and issues in practical applications, such as modality differences, lighting variations and disappearance of the target, and changes in viewpoint. Existing methods mostly address these issues by fusing image features, while neglecting a significant amount of target label information. To address these challenges, this paper introduces text to drive the alignment of visible and infrared image features, transforming features from different modalities into the same feature space and fully using complementary features between different modalities. Furthermore, inspired by the success of prompt learning in various tasks, we utilize prior boxes and language as prompts to further guide the model in tracking the target. Extensive experiments demonstrate that the proposed VLCTrack tracker has excellent potential in RGBT object tracking. Compared to previous methods developed for this purpose, our approach achieves state-of-the-art performance on three benchmark datasets.

Abstract:
Face forgery detection receives widespread attention due to the great security threats arising from the development of face forgery technologies. Most existing works define it as a binary classification problem by modeling the spatial and temporal artifacts to distinguish real and fake videos. However, the detector tends to heavily rely on the binary labels and overfit method-specific forgery patterns of the training set, resulting in limited generalization ability. To mitigate this issue, we propose a Temporal Diversified Self-Contrastive Learning (TDSCL) framework, which guides the model to exploit generalized temporal inconsistencies for face forgery detection. Firstly, a Temporally Diversified Transformation (TDT) strategy is designed to create diverse training samples with multiple temporal scales. Subsequently, Short-term Self-contrastive Learning (STSC) and Long-term Self-contrastive Learning (LTSC) are proposed to perform temporal representations of the video at different temporal granularities to capture intrinsic and generalized forensics clues to expose fake videos, which can serve as auxiliary supervisions equipped with different backbones flexibly. Moreover, a Similarity-Guided Adaptive Fusion (SGAF) module is designed to adaptively reinforce the temporal inconsistencies for reliable classification. Extensive experiments verify that the proposed method achieves superior generalization ability over various state-of-the-art methods in different benchmark datasets.

Abstract:
State-of-the-art (SOTA) adversarial attacks expose vulnerabilities in object detectors, often resulting in erroneous predictions. However, existing adversarial attacks neglect the stealth and flexibility of adversarial examples, which are crucial for conducting contextually consistent and inconspicuous attacks. To address these issues, leveraging the observed phenomenon of predicted box offsets in real-world object detection scenarios, this paper presents a novel adversarial attack framework called ShiftAttack. It leverages the concept of dense detection in prevalent object detectors, by boosting the confidence of low Intersection over Union (IoU) predictions within the positive samples (the set of predicted boxes responsible for localizing the same target), which leads to the erroneous exclusion of true positive predictions during the post-processing stage. Such a paradigm is highly stealthy as the shifted predictions seem like natural detector mistakes rather than obvious manipulations. To enhance the flexibility of ShiftAttack this paper proposes a generative approach called ShiftAttack Generator (SAG), which can not only shift predicted boxes for any target in arbitrary directions and distances but also facilitate adaptive feature exchange between pre- and post-shift regions to optimize the attack. Additionally, the proposed SAG incorporates the Dynamic Hinge Loss (DHL) to ensure the imperceptibility of perturbations, effectively mitigating the Patch-Pattern associated with the use of \mathcal L_2 norm. Extensive experiments confirm that SAG surpasses other SOTA adversarial attacks in effectiveness, speed and stealthiness.

Abstract:
In the field of invertible image decolorization, how to reduce artifacts in the smooth grayscale regions and prevent color distortion at the boundaries of the reconstructed color image is a crucial issue. In this paper, we propose an invertible deep learning network with extraction and hiding of color information. Our approach separates the original color image into the luminance and chromaticity planes by using orthogonal transformation, which enhances the independence and completeness of color and luminance information. Then, the color feature extraction module is developed to minimize color information distortion, while the color hiding module is adopted to hide color information invisibly. Compared with existing deep-learning-based methods, the proposed network can preserve more color information while ensuring the quality of grayscale images by processing color and grayscale information separately. Furthermore, we propose a reversible data hiding strategy that enhances the performance of the reconstructed color images. Our method outperforms learned invertible image decolorization methods, as demonstrated through experiments on the VOC2012, Kodak24, and NCD datasets.

Abstract:
Most existing text-driven face image generation and manipulation methods are based on StyleGAN2, which is inherently limited to aligned faces and therefore makes these methods fail to preserve the highly variable face placement. Additionally, these methods directly leverage a pairwise loss to learn the correspondence between the image and text, which can not handle complex text descriptions, e.g., the text with multiple captions describes multiple facial attributes. To address these issues, we explore the feasibility of applying the more advanced StyleGAN3 to generate and manipulate the face images in an Open-World setup, e.g., the target face image is not required to be aligned and the text description contains multiple captions. To this end, we first design an improved iterative refinement strategy that adaptively predicts the generator weight offsets rather than residuals for the inverted latent code via a hypernetwork, which efficiently finds a desired generator with no image-specific optimization. We further analyze the disentanglement of different StyleGAN3 latent spaces and demonstrate that the \mathcal S space learns a more semantically-disentangled representation. To enable complex edits mentioned by the multi-caption text, we propose a cross-modal feature filtration module with a probability adaptation strategy to capture the image-text correspondences. Finally, we incorporate a channel-wise attention mechanism to obtain a global latent manipulation direction, which learns to assign importance weights to different channels. Extensive experiments demonstrate the superior performance of our proposed method compared against the state-of-the-art methods.

Abstract:
The dominant trackers generate a fixed-size rectangular region based on the previous prediction or initial bounding box as the model input, i.e., search region. While this manner obtains promising tracking efficiency, a fixed-size search region lacks flexibility and is likely to fail in some cases, e.g., fast motion and distractor interference. Trackers tend to lose the target object due to the limited search region or experience interference from distractors due to the excessive search region. Drawing inspiration from the pattern humans track an object, we propose a novel tracking paradigm, called Search Region Regulation Tracking (SRRT) that applies a small eyereach when the target is captured and zooms out the search field when the target is about to be lost. SRRT applies a proposed search region regulator to estimate an optimal search region dynamically for each frame, by which the tracker can flexibly respond to transient changes in the location of object occurrences. To adapt the object’s appearance variation during online tracking, we further propose a locking-state determined updating strategy for reference frame updating. The proposed SRRT is concise without bells and whistles, yet achieves evident improvements and competitive results with other state-of-the-art trackers on eight benchmarks. On the large-scale LaSOT benchmark, SRRT improves SiamRPN++ and TransT with absolute gains of 4.6% and 3.1% in terms of AUC. The code and models will be released.

Abstract:
As one of the most effective subspace clustering methods, the self-expression based sparsity method leverages the robust representational learning and non-linear transformation capacities of deep learning. This approach facilitates the mapping of data into a low-dimensional subspace, wherein the clustering operations are subsequently executed. However, most conventional self-expression methods do not handle the subspace clustering problem with sparse-labeled information. Considering the scarcity and value of labeled samples in various real-world applications, we propose a novel deep Few-Shot Subspace Clustering Learning (FS2CL) framework to improve the traditional self-expression-based techniques in the case of sparse label information, in which partial classes in the observation dataset have a scarcity of labeled samples and most other classes do not. We expect to obtain more discriminative low-rank representations that exhibit high cohesion among clusters. To overcome the limitation that the low-rank approximation is achieved by singular value decomposition, which is not differentiable and cannot be embedded in neural networks for gradient backpropagation, a Low-rank Representation Approximation (LRA) module is proposed to transform the non-differentiable singular value decomposition into a differentiable iterative process. This procedure produces a low-rank representation that maximizes the cohesion of features belonging to the same cluster. Subsequently, we propose a method for learning a low-dimensional learnable subspace bases matrix assisted by a small number of labeled samples, which captures the structure of each subspace. We then classify the data points belonging to the corresponding class by measuring the similarity between the instance and each subspace base. Due to the low dimension of the subspace bases matrix, it is possible to apply our method to large-scale datasets. The proposed method is superior to state-of-the-art clustering approaches through extensive comparison studies conducted on six benchmark datasets: MNIST, Fashion-MNIST, REUTERS-10K, STL-10, CIFAR10, and CIFAR100.

Abstract:
As an emerging and popular technique for boosting CNNs, structural reparameterization (SR) decouples the training and inference structures to alter the training dynamics and achieve cost-free improvement of a given network. Existing SR methods often prioritize network expressiveness enhancement but have yet to investigate approaches to mitigate significant bias and non-robustness of model prediction due to over-reliance on training data distribution and image noise. To this end, inspired by the effective strength of implicit regularization on the problem, this paper introduces an extra balanced implicit regularization mechanism into SR techniques to enhance the generalization of a given network for the first time. Specifically, we propose a novel SR module named DR-Block, which is used to complicate each convolutional layer of a given CNN during training. It draws on the advantages of deep matrix factorization with the regularization effect and further improves singular value dynamics by introducing batch normalization and dense connections to alleviate network degradation. At inference time, DR-Block can be equivalently reparameterized back into a single convolution for deployment. Furthermore, we empirically demonstrate the role of each design in DR-Block and explicitly reveal its inherent mechanism, which lies in enhancing the movement of large singular values while countering the attenuation of small ones. This helps enhance the interpretability of SR techniques. Experiments illustrate that DR-Block is an impressive alternative for a regular convolution layer of any structure and outperforms the existing SR methods in improving mainstream network architectures on various visual tasks. The code is available at https://github.com/qyan0131/DRBlock.

Abstract:
Weakly Supervised Incremental Semantic Segmentation (WISS) aims to enable deep neural networks to incrementally learn new classes using only image-level labels without catastrophic forgetting. Despite WISS eliminating the usage of costly and time-consuming pixel-by-pixel annotations, the image-level labels can not provide details about the location of new classes, resulting in inferior performance. To address these issues, we take inspiration from zero-shot learning to model the inter-class semantic relation utilizing class names as text prompts, thereby facilitating knowledge transfer between classes. However, some class names of the segmentation datasets are polysemous. Thus, we design a new prompt template to better capture the semantic relation by appending synonyms and definitions of the corresponding classes. Guided by this semantic relation, we propose semantic relation weighted distillation to transfer the knowledge from old to new classes, significantly improving plasticity while reducing forgetting. Additionally, we introduce a novel superclass-level distillation aimed at preserving shared global knowledge within the superclass, further alleviating catastrophic forgetting. We extensively evaluate our method by integrating it into state-of-the-art WISS approaches on Pascal VOC and COCO datasets. We observe consistent gains in performance across diverse experimental scenarios. Code is available at https://github.com/Magic-Nova77/PGSD.

Abstract:
Robust model fitting plays a critical role in artificial intelligence and computer vision, with its performance primarily depends on the utilization of sampling algorithms. However, existing sampling algorithms become less effective when initial correspondences between two images are corrupted by a large number of outliers, especially in the presence of multi-structure data. In this paper, we propose a novel sampling algorithm (called SPGSC) for robust model fitting, where minimal subsets are sampled with the guidance of the second-order proximity measure, which involves global geometric relationships instead of local consistency relationships. Specifically, we first propose a second-order proximity measure to facilitate graph construction, which helps detect a potential inlier from input data as the first datum (i.e., the seed datum) of a minimal subset. After that, we propose a second-order proximity based initial minimal subset generation strategy, which is able to choose a certain number of minimal subsets by the seed data for efficiently producing significant model hypotheses. Furthermore, to achieve better fitting performance, we propose a maximum spanning tree based refinement (MSTR) strategy, which is used to refine the previous sampled minimal subsets and improve the effectiveness and efficiency of the sampling process. Experimental results on three vision tasks (i.e., two-view based motion segmentation, affine matrix based segmentation, and 3D motion segmentation) show the superiority of the proposed SPGSC in comparison with other state-of-the-art algorithms.

Abstract:
Few-shot point cloud semantic segmentation is a challenging problem that aims to use only a limited number of labeled point clouds to recognize novel samples, significantly reducing the cost of manual annotation of point clouds. An effective solution is based on deep metric learning, by projecting points into an embedding space and constructing prototypes to calculate distance or similarity to query features. Since the intricate geometric attributes of point clouds, there is a substantial discrepancy between the distribution of query and support features, making the model incapable of distinguishing foreground and background points under similar geometric features. In this paper, we propose an efficient query-support feature alignment method for few-shot point cloud semantic segmentation. We incorporate the support (query) features containing interactive relationships to the query (support) features, effectively shortening the spatial distance between query and support samples. To further utilize the limited relationships and avoid introducing too much computation, we adopt the query point cloud prediction results as supervised information to reversely predict the category of the support prototypes. Extensive experiments on the S3DIS and ScanNet datasets show that our proposed methods significantly surpass the state-of-the-art methods under several N-way K-shot settings. The project is available at https://github.com/miniflash/SQFI.

Abstract:
Image deraining is a hot research topic, which aims to remove various rain streaks (raindrops) from rainy images and restore the backgrounds. Though image deraining has been extensively studied in recent years, few methods are able to effectively and efficiently derain real-world high-resolution rainy images. In general, existing image deraining methods are restricted by two main factors while processing high-resolution images. First, the computational complexity and memory usage of existing deep learning-based methods are high when it comes to derain high-resolution images. Second, as the image resolution increases, it is difficult to simultaneously extract and aggregate both global and local features for clean rain removal. In this paper, we propose a novel network, called Global-Local Grafting Fusion Network (GLGFN), for deraining real-world high-resolution images. Our GLGFN utilizes a staggered connection structure to achieve deeper sampling depth while maintaining low computational cost. It adopts the Transformer and CNN based encoders (backbones) to extract global and local features, respectively, and then grafts global features into local features to guide the extraction of rain streaks. In addition, for well fusing global and local features, we also propose a Grafting Fusion Module (GFM), which adopts Cross Sparse Attention (CSA) and Selective Kernel Fusion (SK Fusion) to efficiently aggregate global and local features. Extensive experiments conducted on several high-resolution real rainy datasets have demonstrated the effectiveness and efficiency of our proposed GLGFN. We will release our code and dataset.

Abstract:
The risk of malicious exploitation of advanced image steganography necessitates the removal of hidden information from images. However, it is crucial to preserve the visual quality of the images undergoing processed. This paper suggests a geometrical attack in frequency domain (GAF) to address this challenge. GAF employs a thin plate spline (TPS) to slightly geometrically perturb the frequency components of the stego image. It incorporates a channel weight estimator and a frequency jammer. The channel weight estimator assigns perturbation strengths to each DCT channel, while the frequency jammer performs the TPS transform on the DCT channels using the assigned perturbation strengths. Experimental results demonstrate that the proposed approach effectively hinders secret image recovery with a little distortion to the stego images. Furthermore, it well preserves the visual quality of clear images that do not contain secret information.

Abstract:
Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy (https://github.com/qzhb/CDS).

Abstract:
Weakly-supervised image segmentation (WSIS) is a fundamental task in the domain of computer vision that relies on image-level class labels. While multi-stage training procedures have been widely used in existing WSIS methods to obtain high-quality pseudo-masks as ground-truth, resulting in significant progress, single-stage WSIS methods have recently gained attention due to their potential for simplifying the training procedure. However, single-stage methods suffer from low-quality pseudo-masks that limit their practical applications. To address this problem, this paper proposes a novel single-stage WSIS method that utilizes a siamese network with contrastive learning to improve the quality of class activation maps (CAMs) and achieve a self-refinement result. The proposed method employs a cross-representation refinement method that expands reliable object regions by utilizing different feature representations from the backbone. Besides, a cross-transform regularization module is introduced that learns robust class prototypes for contrastive learning and captures global context information to feed back rough CAMs, thereby improving the quality of CAMs. The final high-quality CAMs are used as pseudo-masks to supervise the segmentation result. Experimental results on the PASCAL VOC 2012 and COCO datasets demonstrate that the proposed method significantly outperforms other state-of-the-art methods, achieving 72.38% and 72.95% mIoU on PASCAL VOC 2012 val set and test set, 42.51% mIoU on COCO val set, respectively. Furthermore, the proposed method has been extended to weakly supervised object localization, and experimental results demonstrate that it continues to achieve very competitive results. The source codes have been released at https://github.com/ChunyanWang1/RTC.

Abstract:
Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Abstract:
Deep learning systems typically suffer from catastrophic forgetting of old knowledge when learning from new data continually. Recently, various class incremental learning (CIL) methods have been proposed to address this issue, and some approaches achieve promising performances by relying on rehearsing the training data of previous tasks. However, storing data from previous tasks would encounter data privacy and memory issues in real-world applications. In this paper, we propose a statistical sampling adaptation method for efficient Exemplar-Free Class-Incremental Learning (EFCIL). Here, instead of preserving the images/features themselves of previous tasks/classes, we store image feature statistics from previous classes to maintain the decision boundary, which is memory-efficient and much semantic-representative. When utilizing the old-class feature statistics, we build a statistical feature adaptation network (SFAN) with a manifold consistency regularization and then train it in a transductive learning paradigm, which can map the outdated statistics onto the current feature space to facilitate a compatible and balanced classifier training subsequently. In this way, the final classifier can be jointly optimized with all the old-class features projected by SFAN and current new-class features, thus alleviating the classification bias problem in EFCIL. Experimental results greatly demonstrate the effectiveness of the proposed method, achieving superior performances than state-of-the-art approaches. Our source code is released in https://github.com/yxzhcv/ESSA-EFCIL.

Abstract:
Targeted Attacks on Object Detection (TAOD) aim to deceive the victim detector into recognizing a specific instance as the predefined target category while minimizing the changes to the predicted bounding box of that instance. Yet, this kind of flexible attack paradigm, which is capable of manipulating the decision outcome of the victim detector, received limited attention, especially in the context of attacking object detection in optical remote sensing images, where relevant research remains a blank. To fill this gap, this paper concentrates on TAOD in optical remote sensing images, and pays attention to a fundamental question, how to deploy TAOD via the raw predictions (the predictions before non-maximum suppression) of a victim detector. In this regard, we depart from widely adopted task-independent importance measurements and hard-weighted ensemble optimization schemes present in existing methods. Instead, we first define the task-specific importance score, which considers both the qualities and the attack costs of predictions. Further, we propose the Task-Specific Importance-Aware Candidate Predictions Selection Scheme (TSIA-CPSS) alongside the Soft-Weighted Ensemble Optimization Scheme (SW-EOS). A total of eleven detectors on DIOR and DOTA, two commonly employed benchmarks, are included to comprehensively evaluate our approach. Furthermore, we indicate that the effectiveness of our approach is not only substantial for vanilla TAOD, but also can be better generalized to extended scenarios, which encompasses random TAOD, TAOD on oriented object detection, and targeted patch attacks, highlighting the noteworthy potential of our approach. Our codes will be released on Github.

Abstract:
Cross-view geo-localization seeks to match geographic locations using images from varied sources, including drones and satellites. Interpreting images captured by drones poses significant challenges due to the varying positions and scales resulting from the camera’s aerial perspective. Traditional approaches have primarily focused on harnessing contextual cues, which may lead to overfitting. Therefore, it is crucial to find an optimal balance between leveraging contextual details and identifying relevant features. To address this, we introduce a novel method for cross-view geo-localization that employs counterfactual causal reasoning (CCR). This method aims to refine the model’s focus, ensuring a balanced emphasis on both the intricate details of the target structure and its broader contextual environment. Our method incorporates an Adaptive Dimension Interaction Block (ADIB), which effectively discerns feature interactions across multiple dimensions, enhanced by counterfactual causal reasoning to improve recognition of target structures and their contexts. In tasks of image-based drone-view target localization and drone navigation, our method achieves superior performance on the University-1652 and SUES-200 benchmark datasets. The code and model files will be made available at https://github.com/Cyberpunk1998/CCR.

Abstract:
Image-level weakly supervised semantic segmentation (WSSS) has received substantial attention due to its cost-effective annotation process. In WSSS, Class Activation Maps (CAMs) generated via classifier weights tend to focus on the most discriminative region, while the CAMs derived from class prototypes are significantly enhanced to cover more complete regions. However, the prototype CAMs still exhibit limitations such as incomplete localization maps on target objects and the presence of background noise. In this paper, we propose a novel WSSS framework called Classifier-Prototype Mutual Calibration (CPMC) that leverages the characteristics of both classifier and prototype CAMs to address the above issues. Specifically, an iterative refinement strategy based on context feature dependency is applied to refine the original classifier CAMs, which helps to generate improved prototype CAMs. Subsequently, local prototypes are constructed based on the false negative regions and false positive regions extracted from the previous two CAMs, which contribute to completing missing parts of the target object and suppressing background noise respectively. Therefore, CPMC can alleviate the aforementioned issues. Extensive experimental results on standard WSSS benchmarks (PASCAL VOC and MS COCO) show that our method significantly improves the quality of CAMs and achieves state-of-the-art performance. Our source code will be released.

Abstract:
Question-Driven Sign Language Translation (QSLT) addresses the challenge of translating sign language using pertinent questions in question-answering contexts. However, the pronounced modality complexity between question text and sign video poses a predicament: the model tends to overly depend on questions to generate translations, thereby neglecting the value of visual cues. To tackle this issue, the paper presents a Gloss-Bridged Translator (GBT), which introduces sign gloss as an intermediary conduit to establish semantic connections between questions and videos. By leveraging gloss, visual features are transformed into textual counterparts, mitigating the modality imbalance between these representations. Moreover, a cross-modal contrastive learning strategy is implemented, bolstering the global contextual relevance and local semantic alignment between questions and sign language. The proposed methodology is validated through extensive experiments on the proposed QSL dataset and other public sign language datasets. The results show the efficacy of integrating questions into sign language translation. The GBT yields remarkable improvements over prevailing SLT methods, attesting to its effectiveness and rationale. Our code and dataset is available at https://github.com/glq-1992/QSL.

Abstract:
Images captured in low-light conditions usually suffer from degradation problems. Recently, numerous deep learning-based methods are proposed for low-light image enhancement. They either focus on performance improvement with negligence of computational complicity, or are extremely computationally efficient networks with poor performance. In this work, we intend to figure out a solution, which strikes a balance between computational cost and performance. Moreover, we observe that different regions of an image contain different amounts of information, where the region with less information is easier to restore than that with more information. Hence, we propose to crop a low-light image into patches and classify these patches into “simple”, “medium” and “hard” categories based on their involved information. Then, we enhance different patch categories with different network complexities, therefore, a Category-specific Processing Network (CSPN) is proposed to achieve the computational complexity and performance balance. The patch classification is implemented by the proposed Grey-Level Co-occurrence Matrix (GLCM) entropy-based algorithm, which measures the content complexity of an image by analyzing the statistics of the difference between pixels. As the frequency domain contains exclusive feature information that is beneficial for improving image quality, the wavelet transform is introduced during the enhancement. Extensive experimental results demonstrate the superiority of our proposed CSPN over other state-of-the-art methods in various datasets with the least amount of computational cost.

Abstract:
Mirror detection aims to discover mirror regions in images to avoid misidentifying reflected objects. Existing methods mainly mine clues from spatial domain. We observe that the frequencies inside and outside the mirror region are distinctive. Besides, the low-frequency representing the feature semantics can help to locate the mirror region, and the high-frequency representing the details can refine it. Motivated by this, we introduce frequency guidance and propose the dual domain perception progressive refinement network (DPRNet) to mine dual-domain information. Specifically, we first decouple the images into high-frequency and low-frequency components by Laplace pyramid and vision Transformer, respectively, and design the frequency interaction alignment (FIA) module to integrate frequency features to initially localize the mirror region. To handle scale variations, we propose the multi-order feature perception (MOFP) module to adaptively aggregate adjacent features with progressive and gating mechanisms. We further propose the separation-based difference fusion (SDF) module to establish associations between entities and imagings and discover the correct boundary to mine the complete mirror region. Extensive experiments show that DPRNet outperforms the state-of-the-art method by an average of 3% with only about one-fifth of the parameters and FLOPs on four datasets. Our DPRNet also achieves promising performance on remote sensing and camouflage scenarios, validating its generalization. The code is available at https://github.com/winter-flow/DPRNet.

Abstract:
Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.

Abstract:
The use of social media networks and mobile devices has experienced tremendous growth in recent years. This has led to a surge in the number of videos recorded and uploaded to social media platforms like TikTok and YouTube. However, this increase has also resulted in the rise of illegal duplicate videos, which are essentially the same as the original videos but with minor editing effects and variations in coding. In addition, the large number of duplicate videos is a major storage and communication efficiency issue. The task of finding duplicate videos from a large repository is referred to as video deduplication. Video deduplication is a crucial task for applications like saving storage space and detecting copyright infringement. This work proposes a fast and robust location-aware video deduplication system capable of retrieving duplicate videos from a large repository extremely quickly. In addition, the proposed system has the ability to find the precise location of the query video in the retrieved videos. To identify and localize short video clips against large video repositories, we utilize robust image-level features from keypoint aggregation and deep learning along with an efficient KNN search of query frames with a multiple k-d tree setup, giving us a set of candidate video clips. Then, a fast temporal consistence pruning algorithm re-ranks the clip-level candidates and identifies the matching clip along with its temporal location in a sequence in an efficient way. The system was tested on 1 million frame/145 hour and 4.5 million frame/636 hour repositories generated via the large-scale FIVR-200K and VCSL datasets, respectively. The proposed system achieves a recall of 98.8% and 94.1% for the FIVR-200K and VCSL datasets, respectively. A query frame is searched as fast as 83.96ms and 462.59ms from a 1 million frame/ 145 hour and a 4.5 million frame/636 hour repository, respectively. These experimental results demonstrate that our system is highly accurate and that the time consumption is extremely low for retrieving video along with its timestamp information from large-scale repositories.

Abstract:
Recently, the asymmetric cost-based steganographic method using generative adversarial networks has achieved significant success. This highlights the substantial potential of deep learning-based asymmetric cost generation methods over traditional methods reliant on cost enhancement. However, the current frameworks for asymmetric cost learning ignore the correlation between positive and negative embedding costs, resulting in an imbalance asymmetric embedding costs. This can cause scattered modified pixels or even anomalous modified pixels in the stego image, thereby reducing steganographic security. In this paper, we propose a novel asymmetric steganographic cost learning framework, termed Steganographic embedding Cost Generation and Modulation (SCGM), to ensure a balance between asymmetric embedding costs by maintaining the correlation and therefore improve steganographic security. In our framework, we initially train a policy network to produce symmetric costs and subsequently use an adaptive modulation module we designed to achieve asymmetry. The modulation module facilitates the adaptive transformation of learned symmetric costs into asymmetric costs by autonomously learning modulation proportions during adversarial training with steganalysis. Moreover, we develop distinct adversarial loss functions for both the symmetric cost generation and the asymmetric cost modulation phases to further enhance steganographic security. Extensive experimental results have demonstrated that SCGM attains state-of-the-art performance in steganographic security, with an average error rate across steganalyzers that exceeds the existing best asymmetric cost-based steganography method by 2.77%.

Abstract:
Despite its recent advancements, multi-object tracking (MOT), one of the major research areas in video technology, still faces various challenges, including severe occlusion and diversity of tracking targets. In this paper, we introduce a novel strategy, Temporal Feature Mix (TFM), that can improve the overall robustness of multi-object trackers in diverse scenarios. More specifically, our approach simulates new and challenging scenes that can train networks to better localize the targets by blending high-level features from temporally adjacent frames with the insights that the high-level features are mainly activated on salient targets and the targets on the adjacent frames are nearly located. Therefore, our TFM can offer novel and diversified training experiences to the networks, achieved through the intensive augmentation of the high-level features of each target. As a result, our approach demonstrates notable performance improvement with three major MOT benchmarks and a newly constructed corruption dataset for MOT, underscoring its potential to enhance the robustness of MOT systems in real-world scenarios. Every related source code is released at https://github.com/kamkyu94/Temporal_Feature_Mix.

Abstract:
Long-term tracking is a commonly overlooked yet practical scenario in multi-object tracking. Handling occlusion and re-identifying long-lost targets are the main challenges for effective long-term tracking. In occlusion scenarios, both appearance and motion features can be unreliable, leading to association failure. For long-lost targets, predicting their long-term motion suffers from severe error accumulation, making the target re-identification challenging. In this paper, we propose a multi-object tracker called LTTrack for long-term tracking. For occlusion handling, we develop the Position-Based Association (PBA) module, which encodes relative and absolute positions as interaction and motion features for association. With interaction features, PBA can handle occlusion scenes where appearance and motion features are unreliable. For long-lost target re-identification, the Long-Term Motion (LTM) model is devised. By encoding long-term motion trends of targets for long-term motion prediction, LTM alleviates the error accumulation problem. Moreover, to prevent the erroneous deletion of long-lost tracks, we propose the Zombie Track Re-Match (ZTRM) strategy to re-identify long-lost targets so that they will neither be prematurely deleted nor disrupt the association of other tracks. Extensive experiments conducted on MOT17, MOT20, and DanceTrack demonstrate that LTTrack achieves performance comparable to state-of-the-art methods. The code and models are available at https://github.com/Lin-Jiaping/LTTrack.

Abstract:
Recently, Point-MAE has extended Masked Autoencoders (MAE) to point clouds for 3D self-supervised learning, which however faces two problems: (1) the shape similarity between the masked point cloud and original point cloud is high, and (2) the pretext task of reconstructing the original point cloud is straightforward which fails to compel the network to learn deep representative features. In this paper, we tackle these problems by proposing a PatchMixing strategy and a teacher-student training framework. First, with PatchMixing, we mix selected point patches of multiple point clouds and attempt to infer the object information from the resulting mixed point cloud. Due to the interference of other objects, the task is challenging but facilitates representation learning. Second, rather than directly restoring the original point cloud, we propose a novel pretext task that involves a two-branch teacher model and a student model. These models process the multiple input point clouds in different ways (no mixing, mixing + unmixing, mixing + masking), but are expected to output similar features, thereby compelling the network to extract essential features from the input. Extensive experiments show that our well-designed PatchMixing strategy and effective teacher-student learning architecture yield impressive results. Specifically, our model achieves a remarkable 92.9% classification accuracy in the Linear SVM task on the ModelNet40 dataset. Through pre-training and fine-tuning on downstream tasks, our method achieves an 89.8% classification accuracy on the most challenging split of ScanObjectNN and an outstanding 94.0% on ModelNet40.

Abstract:
Cross-resolution person re-identification (ReID) is a challenging task that addresses the issue of matching individuals across different resolution conditions. Traditional person ReID methods often assume that images have sufficiently high resolution and overlook the practical scenarios involving low-resolution or blurry images. Existing cross-resolution ReID approaches either utilize image super-resolution techniques to improve the quality of low-resolution images or extract and learn resolution invariant features for person representation. Although multi-task learning has been applied in ReID to integrate auxiliary tasks including attribute recognition, image super-resolution, and so on, how to incorporate the vital resolution learning task into cross-resolution ReID has rarely explored before. Therefore, we propose a novel multi-task resolution learning based ReID network named MRLReID. Our approach treats ross-resolution person ReID as the primary task and the resolution estimation as an auxiliary task. Our network simultaneously learns the resolution information and person identity information of images, aiming to improve cross-resolution person ReID performance. Considering that existing similuated cross-resolution datasets are too simple to mimic unconstrained scenario, we further employ image degradation technique to simulate more realistic cross-resolution ReID datasets. We evaluate our method on two real-world cross-resolution datasets and two newly simulated cross-resolution datasets, and both intra-dataset and cross-dataset evaluations demonstrate the effectiveness and superiority of our method in cross-resolution person ReID. The codes and datasets are available at https://github.com/amateurbo/MRLReID.

Abstract:
Camouflaged object detection has been considered a challenging task due to its inherent similarity and interference from background noise. It requires accurate identification of targets that blend seamlessly with the environment at the pixel level. Although existing methods have achieved considerable success, they still face two key problems. The first one is the difficulty in removing texture noise interference and thus obtaining accurate edge and frequency domain information, leading to poor performance when dealing with complex camouflage strategies. The latter is that the fusion of multiple information obtained from auxiliary subtasks is often insufficient, leading to the introduction of new noise. In order to solve the first problem, we propose a frequency domain reconstruction module based on contrast learning, through which we can obtain high-confidence frequency domain components, thus enhancing the model’s ability to discriminate target objects. In addition, we design a frequency domain representation decoupling module for solving the second problem to align and fuse features from the RGB domain and the reconstructed frequency domain. This allows us to obtain accurate edge information while resisting noise interference. Experimental results show that our method outperforms 12 state-of-the-art methods in three benchmark camouflaged object detection datasets. In addition, our method shows excellent performance in other downstream tasks such as polyp segmentation, surface defect detection, and transparent object detection.

Abstract:
Memory-based methods have substantially enhanced the precision of video object segmentation (VOS) by storing features in an expanding memory bank. However, this comes at the cost of increased computational demands and storage overhead. While recent methods have sought to alleviate this issue via compression or selection strategies, their reliance solely on history cues and simple memory structures result in precision degradation and intrinsic limitations, such as error accumulation and poor robustness. In this paper, we introduce HFVOS, an efficient yet effective framework to bolster VOS performance in both speed and precision by meticulously considering the memory design with low redundancy, high accuracy, and adaptability. First, we construct a novel hierarchical memory update pipeline with the proposed Buffered Memory Mechanism, which incorporates both future and history cues to reduce redundancy and improve the utility of memory. Second, we propose an Adaptive Dual-stream Selection Network (ADSN) to carry out the adaptive selection and drop operations of the memory update, and integrate an ADSN based long-term memory to enhance the robustness, especially for long videos. Furthermore, to further boost HFVOS, a progressive selection loss is designed to facilitate ADSN gradually adapting to fewer features while preserving high precision. Experiments show that HFVOS achieves the state-of-the-art segmentation precision and speed on both short-term datasets (DAVIS-17 val: 86.8% \mathcal J \& \mathcal F and 33.0 FPS, DAVIS-16 val: 92.0% \mathcal J \& \mathcal F and 42.0 FPS) and long-term datasets (LVOS val: 58.0% \mathcal J \& \mathcal F and 37.4 FPS).

Abstract:
DeepFakes have raised serious societal concerns, leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is a standard detection method that usually follows a two-phase pipeline, i.e., it extracts the face first and then determines its authenticity by classification. While those methods perform well in ideal experimental environment, they face challenges when dealing with DeepFakes in the wild involving complex background and multiple faces of varying sizes. Moreover, most face forgery recognition methods can only process one face at a time. One straightforward way to address this issue is to simultaneous process multi-face by integrating face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the discriminative features of different object categories rather than the subtle forgery traces among the faces, the direct adaptation suffers from limited representation ability. In this paper, we propose Contrastive Multi-FaceForensics (COMICS), an end-to-end framework for multi-face forgery detection. COMICS integrates face extraction and forgery detection in a seamless manner and adapts to the advanced object detection architectures. The core of the proposed framework is a bi-grained contrastive learning approach that explores face forgery traces at both the coarse- and fine-grained levels. Specifically, coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs at multiple layers produced by the proposal generator, and the fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency among different faces. Extensive experiments on the OpenForensics and FFIW datasets demonstrate that our method outperforms other counterparts and shows great potential for being integrated into various architectures. Codes are available at https://github.com/zhangconghhh/COMICS.

Abstract:
Solving the complex challenges of sophisticated terrain and multi-scale targets in remote sensing (RS) images requires a synergistic combination of Transformer and convolutional neural network (CNN). However, crafting effective CNN architectures remains a major challenge. To address these difficulties, this study introduces the knowledge guided evolutionary Transformer for RS scene classification (Evo RSFormer). It amalgamates adaptive evolutionary CNN (Evo CNN) with Transformers in a hybrid strategy synergistically, which combines fine-grained local feature extraction of CNNs with long-range contextual dependency modeling of Transformers. Furthermore, for the development of Evo CNN blocks, this paper presents a knowledge-guided adaptive efficient multi-objective evolutionary neural architecture search (MOE2-NAS) strategy. This approach markedly diminishes the labor-intensive characteristics associated with traditional CNN design, striking a balance for both accuracy and compactness. Additionally, by leveraging domain knowledge from natural scene analysis into the RS field, MOE2-NAS facilitates the efficiency of classical NAS. It utilizes a priori knowledge to generate promising initial solutions and constructs a surrogate model for efficient search. The effectiveness of the proposed Evo RSFormer has been rigorously tested on various benchmark RS datasets, including UC Merced, NWPU45, and AID. Empirical results strongly support the superiority of Evo RSFormer over existing methods. Furthermore, experiments on MOE2-NAS have been studied to confirm the important role of knowledge guidance in improving the efficiency of NAS.

Abstract:
Image sequence interpolation is a critical research area in computer vision with broad applications in video frame interpolation and medical image interlayer interpolation. Traditional deep learning-based methods in this domain predominantly rely on deep convolutional neural networks (CNNs), which, despite their effectiveness, are limited by the inherent constraints of CNN architecture, impacting their interpolation accuracy. To address these limitations, we introduce the Pre-ISIformer, a parallel multi-channel adaptive image sequence interpolation network founded on pre-trained transformers. This innovative network is composed of three integral modules: 1) Global feature extraction module is designed to extract primary features from the input images using a pre-trained Swin-transformer model, ensuring comprehensive global feature coverage. 2) Feature sequence construction module adaptively decomposes the object’s motion path across different frames, facilitating a detailed analysis of motion dynamics. And 3) Intermediate image reconstruction module is responsible for accurately capturing target displacements. Furthermore, we incorporate distinct metrics for pixel loss and gradient loss to meticulously reconstruct the texture and contours of the intermediate images. Our network has been rigorously tested on various datasets for two primary applications: video frame interpolation and interlayer interpolation in medical imaging. The results from these experiments showcase the superior performance and effectiveness of the Pre-ISIformer, establishing it as a significant advancement in the field of image sequence interpolation.

Abstract:
With the astonishing development of 3D sensors, point cloud based 3D object detection is attracting increasing attention from both industry and academia, and widely applied in various fields, such as robotics and autonomous driving. However, how to balance the 3D object detecting accuracy and speed is still a challenging problem. In this paper, we study this issue and propose a novel and effective 3D point cloudy object detection network based on hierarchical cascaded point-voxel fusion, called HCPVF. Firstly, a novel bird’s-eye-view(BEV) attention mechanism with linear complexity is developed to improve point cloud feature backbone network, which can be implemented easily to mine the point-to-point similarity in BEV’s view, by two cascaded linear layers and two normalization layers. This operation captures long-range dependencies and reduces the uneven sampling of sparse BEV features, making the extracted point cloudy features more discriminative. Secondly, the proposed HCPVF module is equipped with dual-level hierarchical cascaded detection head, including voxel level and the following point level. The voxel level is composed of coarse Region of interest(RoI) pooling and fine RoI pooling, which are cooperated to aggregate voxel features from different grid divisions and predict relatively coarse detection boxes. In the following, the point level is based on Key Points Transformer. It firstly encodes the spatial context information between the original point and the voxel level box. And then, a novel dual-weighted decoder is developed to enhance the context interaction by weighting the channel and spatial dimensions to obtain more accurate detection results. This design utilizes the voxel based method with high computational efficiency and the point based method with more complete spatial information, fusing low-level voxel features and high-level point features through hierarchical cascaded strategy. Extensive experiments demonstate that the proposed HCPVF achieves state-of-the-art 3D detection performance while maintaining computational efficiency on both the Waymo Open Dataset and the highly-competitive KITTI benchmark.

Abstract:
Fine-grained Zero-shot Learning on the large-scale dataset ImageNet21K is an important task that has promising perspectives in many real-world scenarios. One typical solution is to explicitly model the knowledge passing using a Knowledge Graph (KG) to transfer knowledge from seen to unseen instances. By analyzing the hierarchical structure and the word descriptions on ImageNet21K, we find that the noisy semantic information, the sparseness of seen classes, and the lack of supervision of unseen classes make the knowledge passing insufficient, which limits the KG-based fine-grained ZSL. To resolve this problem, in this paper, we enhance the knowledge passing from three aspects. First, we use more powerful models such as the Large Language Model and Vision-Language Model to get more reliable semantic embeddings. Then we propose a strategy that globally enhances the knowledge graph based on the convex combination relationship of the semantic embeddings. It effectively connects the edges between the non-kinship seen and unseen classes that have strong correlations while assigning an importance score to each edge. Based on the enhanced knowledge graph, we further present a novel regularizer that locally enhances the knowledge passing during training. We extensively conducted comparative evaluations to demonstrate the advantages of our method over state-of-the-art approaches.

Abstract:
Action Quality Assessment (AQA) is a task that tries to answer how well an action is carried out. While remarkable progress has been achieved, existing works on AQA assume that all the training data are visible for training at one time, but do not enable continual learning on assessing new technical actions. In this work, we address such a Continual Learning problem in AQA (Continual-AQA), which urges a unified model to learn AQA tasks sequentially without forgetting. Our idea for modeling Continual-AQA is to sequentially learn a task-consistent score-discriminative feature distribution, in which the latent features express a strong correlation with the score labels regardless of the task or action types. From this perspective, we aim to mitigate the forgetting in Continual-AQA from two aspects. Firstly, to fuse the features of new and previous data into a score-discriminative distribution, a novel Feature-Score Correlation-Aware Rehearsal is proposed to store and reuse data from previous tasks with limited memory size. Secondly, an Action General-Specific Graph is developed to learn and decouple the action-general and action-specific knowledge so that the task-consistent score-discriminative features can be better extracted across various tasks. Extensive experiments are conducted to evaluate the contributions of proposed components. The comparisons with the existing continual learning methods additionally verify the effectiveness and versatility of our approach. Data and code are available at https://github.com/iSEE-Laboratory/Continual-AQA.

Abstract:
Most of existing category-level object pose estimation methods devote to learning the object category information from point cloud modality. However, the scale of 3D datasets is limited due to the high cost of 3D data collection and annotation. Consequently, the category features extracted from these limited point cloud samples may not be comprehensive. This motivates us to investigate whether we can draw on knowledge of other modalities to obtain category information. Inspired by this motivation, we propose CLIPose, a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information, which can fully leverage abundant semantic knowledge in image and text modalities. To make the 3D encoder learn category-specific features more efficiently, we align representations of three modalities in feature space via multi-modal contrastive learning. In addition to exploiting the pre-trained knowledge of the CLIP’s model, we also expect it to be more sensitive with pose parameters. Therefore, we introduce a prompt tuning approach to fine-tune image encoder while we incorporate rotations and translations information in the text descriptions. CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS).

Abstract:
Neural Architecture Search (NAS) aims to automatically find effective architectures within a predefined search space. However, the search space is often extremely large. As a result, directly searching in such a large search space is non-trivial and also very time-consuming. To address the above issues, in each search step, we seek to limit the search space to a small but effective subspace to boost both the search performance and search efficiency. To this end, we propose a novel Neural Architecture Search method via Dominative Subspace Mining (DSM-NAS) that finds promising architectures in automatically mined subspaces. Specifically, we first perform a global search, i.e., dominative subspace mining, to find a good subspace from a set of candidates. Then, we perform a local search within the mined subspace to find effective architectures. More critically, we further boost search performance by taking well-designed/searched architectures to initialize candidate subspaces. Experimental results demonstrate that DSM-NAS not only reduces the search cost but also discovers better architectures than state-of-the-art methods in various benchmark search spaces.

Abstract:
Linear regression, a widely-used method in representation learning, initially faced limitations in incorporating structural information within the regression space. Existing models designed to extract structural insights often prioritize the proximity of data points in feature space, while overlooking crucial interdependencies and co-occurrences among them. In response to the challenges posed by the inherent limitations, we introduce a novel representation learning model based on linear regression. This model seamlessly integrates three essential modules: flexible regression learning, graph embedding learning, and embedded block-diagonal self-representation learning. The collaborative functioning of these modules establishes a closed optimization loop. The self-representation matrix directly captures the latent graph structure across the entire data domain, without the need for setting additional parameters such as the neighborhood scale of the graph. Concurrently, it facilitates flexible regression learning by uncovering latent structural patterns. Experimental results on multiple benchmark datasets demonstrate the superiority of our approach over state-of-the-art methods, providing a more comprehensive solution for representation learning.

Abstract:
By mapping iterative optimization algorithms into neural networks (NNs), deep unfolding networks (DUNs) exhibit well-defined and interpretable structures and achieve remarkable success in the field of compressive sensing (CS). However, most existing DUNs solely rely on the image-domain unfolding, which restricts the information transmission capacity and reconstruction flexibility, leading to their loss of image details and unsatisfactory performance. To overcome these limitations, this paper develops a dual-domain optimization framework that combines the priors of (1) image- and (2) convolutional-coding-domains and offers generality to CS and other inverse imaging tasks. By converting this optimization framework into deep NN structures, we present a Dual-Domain Deep Convolutional Coding Network (D3C2-Net), which enjoys the ability to efficiently transmit high-capacity self-adaptive convolutional features across all its unfolded stages. Our theoretical analyses and experiments on simulated and real captured data, covering 2D and 3D natural, medical, and scientific signals, demonstrate the effectiveness, practicality, superior performance, and generalization ability of our method over other competing approaches and its significant potential in achieving a balance among accuracy, complexity, and interpretability. Code is available at https://github.com/lwq20020127/D3C2-Net.

Abstract:
A challenging task in embodied artificial intelligence is enabling the robot to carry out a navigational task following natural language instruction. In the task, the navigator needs to understand objects, directions, as well as room types, which serve as landmarks for navigation. Although it is easy to encode objects and directions with an external encoder like an object detector, current navigators struggle to encode room type information properly due to the low accuracy offered by existing classifiers. This inadequacy poses confusion that navigators find difficult to overcome. Even humans may sometimes fail to determine the exact type of a room since multiple room types may exist in one panorama. To mitigate this problem, we propose to encode room type information in a prompt manner. Specifically, we first establish multi-modal, learnable prompt pools containing knowledge of room types. By querying the prompt pools, the navigator can obtain room-type prompts of the current view, and incorporate them into the navigator using a prompt-based learning method. Experimental results on the REVERIE, R2R and SOON datasets demonstrate the effectiveness of our approach.

Abstract:
Multi-scale features are crucial in encoding objects with varying scales in vision tasks. The classic top-down and bottom-up feature pyramid networks are a common strategy for multi-scale feature extraction. However, these approaches suffer from the loss or degradation of feature information, which impairs the fusion effect of non-adjacent levels. In this paper, we propose an Asymptotic Feature Pyramid Network (AFPN) that supports direct interaction between non-adjacent levels. AFPN starts by fusing two adjacent low-level features and asymptotic incorporates higher-level features into the fusion process. This fusion way avoids the significant semantic gap between non-adjacent levels. Adaptive spatial fusion operation is further used to mitigate potential multi-object information conflicts during feature fusion at each spatial location. To reduce parameters, computational requirements, and inference speed, we propose a Lightweight Asymptotic Feature Pyramid Network (LightAFPN) that uses the concept of reparametrization. We evaluate the proposed method on the MS-COCO 2017, PASCAL VOC and Cityscapes datasets in both object detection and semantic segmentation frameworks. Experimental evaluation shows that our method achieves more competitive results than other state-of-the-art feature pyramid networks. The code is available at https://github.com/gyyang23/AFPN.

Abstract:
Gaze estimation can be applied in various scenarios, seeking to comprehend human visual attention through camera images. Contemporary research predominantly employs deep learning to directly output gaze from facial or ocular images. however, most methods concentrate solely on estimating gaze direction, overlooking gaze point. We propose two multitask learning frameworks for estimating gaze point and gaze direction, with the objective of achieving unsupervised learning of gaze point and supervised gaze estimation via gaze intersection. Two attention layers are proposed to guide the generation of facial features, addressing the challenge posed by unlabeled gaze point. The focus attention layer employs the eyes to guide facial features, connecting both features and utilizing similarity to enhance eye information. Another approach utilizes only the full face image, employing self-attention to enhance pertinent information. Four loss functions are employed to constrain networks in 2D and 3D spaces. The combination of eye position constraints and attention layers ensures the accuracy of gaze point prediction. Gaze intersection can be used to obtain gaze depth, thereby solving the problem of depth-overlapping. The advantages of the proposed method in gaze tracking are verified through comprehensive experiments.

Abstract:
Few-shot font generation (FFG) aims to preserve the underlying global structure of the original character while generating target fonts by referring to a few samples. It has been applied to font library creation, a personalized signature, and other scenarios. Existing FFG methods explicitly disentangle content and style of reference glyphs universally or component-wisely. However, they ignore the difference between glyphs in different styles and the similarity of glyphs in the same style, which results in artifacts such as local distortions and style inconsistency. To address this issue, we propose a novel font generation approach by learning the Difference between different styles and the Similarity of the same style (DS-Font). We introduce contrastive learning to consider the positive and negative relationship between styles. Specifically, we propose a multi-layer style projector (MSP) for style encoding and realize a distinctive style representation via our proposed Cluster-level Contrastive Style (CCS) loss. The MSP module is employed to assist the generator during training to enhance the style consistency between the generated glyph and the reference glyphs. In addition, we design a glyph-independent patch discriminator, which comprehensively considers different areas of the image and ensures that each style can be distinguished independently. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results than state-of-the-art methods.

Affiliations: Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiao Tong University, Shanghai, China; Orthodontics Department, Ninth People’s Hospital, Shanghai Jiao Tong University, Shanghai, China; Institute of Image Processing and Pattern Recognition, the Department of Automation, Ningbo Artificial Intelligence Institute, Shanghai Jiao Tong University, Shanghai, China; School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Orthodontics, Ninth People’s Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China

Abstract:
Achieving accurate face reconstruction with geometry details from a single-view images is an important task for orthodontics. Although the 3D Morphable Model (3DMM) based methods provide an effective framework, the low-dimensional linear space is insufficient to cover geometric details. In this paper, we propose a hybrid shape deformation representation with multi-branch supervision for detail prediction. In orthodontic scenarios, shape deformation can be considered as the aggregation of intuitive appearance component and ambiguous geometry component, which is involved in frontal shape and depth correction respectively. Hence orthogonal decomposition is employed to decompose the shape deformation into frontal-plane position offset and depth offset. Frontal-plane position offset is represented in an explicit-local-dependent manner based on grid deformation while depth offset is represented in an implicit-local-dependent manner based on dense prediction. To facilitate orthodontic-based evaluation, we construct an orthodontic-specific dataset and design a novel metric to involve the relative position dependency between regions of interest. Experimentally, we demonstrate outstanding performance of face reconstruction on FaceScape, MICC Florence and orthodontic-specific dataset with both quantitative and qualitive evaluation.

Abstract:
Remote sensing (RS) scene classification based on deep neural networks (DNNs) has recently drawn remarkable attention. However, the DNNs contain a great number of parameters and require a huge amount of computational costs, which are hard to deploy on edge devices such as onboard embedded systems. To address this issue, in this paper, we propose a target-aware knowledge distillation (TAKD) method for RS scene classification. By considering the characteristics among the target and background regions of the RS images, the TAKD can adaptively distill the knowledge from the teacher model to create a lightweight student model. Specifically, we first introduce a target extraction module that utilizes heatmaps to highlight target regions on the teacher’s feature maps. Next, we propose an adaptive fusion module that aggregates these heatmaps to capture objects with varying scales. Finally, we design a target-aware loss that enables the transfer of knowledge in the target regions from the teacher model to the student model, greatly reducing background disturbance. Our distillation scheme that does not require extra learning parameters is both simple and effective, significantly improving the accuracy of the student model without any additional computational or resource costs. Our experiments on three benchmark datasets demonstrate that our proposed TAKD outperforms the existing state-of-the-art distillation methods.

Abstract:
Anomaly detection is an area of video analysis and plays an increasing role in ensuring safety, preventing risks, and guaranteeing quick response in intelligent surveillance systems. It has become a popular research topic and has piqued the interest of researchers in different communities, such as computer vision, machine learning, remote sensing, and data mining, in recent years. This promotes novel mobile systems where drones are equipped with cameras to help people find better and more efficient solutions to automatically detect anomalies (e.g., car accidents, traffic congestion, street fighting) in traffic surveillance videos. However, anomaly detection methods are still rarely studied and developed in the remote sensing community due to anomalous events rarely occurring in real life, along with the high similarities between the objects of interest with small sizes, multi-scale objects, complex backgrounds of great variations, and high overlap between objects. Therefore, in order to fully exploit the spatio-temporal information for anomaly detection in traffic surveillance circumstances, we propose a future frame prediction network based on transformer architectures to detect abnormal events from drone videography in an unsupervised way. Our model treats consecutive video frames from an input clip and feeds features to a transformer encoder to capture spatial and temporal representations from the sequence. Then, it leverages a decoder to predict the next frame. Furthermore, an event with high reconstruction error is identified as an anomaly in the test phase. Thoroughly empirical studies demonstrate that our method achieves superior performance on the UIT-ADrone dataset and largely outperforms the state-of-the-art anomaly methods on the Drone-Anomaly dataset in aerial surveillance. The source code is available online at https://github.com/Tungufm/ASTT.

Abstract:
Existing wavelet pooling methods discard the high-frequency sub-bands, which can improve the noise-robustness of convolutional neural networks (CNNs) but lose the essential detailed features. Besides, most of them depend on different wavelets, which is not adaptive. In this paper, a novel efficient lifting-based wavelet pooling (LWPooling) is proposed to alleviate the problems above. Firstly, wavelet pooling is rethought based on the equivalence of 2D discrete wavelet transform (DWT) and standard average pooling (SAP), which suggests the lack of detailed information on traditional wavelet pooling. Secondly, the efficient LWPooling module is proposed to adaptively capture and preserve the critical high-frequency features via lifting-based wavelets. It can constrain the features linear independence, which efficiently makes important features salient. Thirdly, the lifting-based wavelet collaborative network (LWCNet) is constructed for classification and segmentation tasks based on the efficient LWPooling module. Experiments are validated on Cifar10, Cifar100, and ADE20K datasets. It suggests that the efficient LWPooling can enhance CNN’s representation and achieve a particular performance advantage compared to average, maximum, and original wavelet pooling. Besides, the proposed LWCNet shows the potential for scene parsing. The code implementation will be available at https://github.com/yutinyang/LWCNet.

Abstract:
Blind face restoration is an important task in computer vision and has gained significant attention due to its wide-range applications. Previous works mainly exploit facial priors to restore face images and have demonstrated high-quality results. However, generating faithful facial details remains a challenging problem due to the limited prior knowledge obtained from finite data. In this work, we delve into the potential of leveraging the pretrained Stable Diffusion for blind face restoration. We propose BFRffusion which is thoughtfully designed to effectively extract features from low-quality face images and could restore realistic and faithful facial details with the generative prior of the pretrained Stable Diffusion. In addition, we build a privacy-preserving face dataset called PFHQ with balanced attributes like race, gender, and age. This dataset can serve as a viable alternative for training blind face restoration networks, effectively addressing privacy and bias concerns usually associated with the real face datasets. Through an extensive series of experiments, we demonstrate that our BFRffusion achieves state-of-the-art performance on both synthetic and real-world public testing datasets for blind face restoration and our PFHQ dataset is an available resource for training blind face restoration networks. The codes, pretrained models, and dataset are released at https://github.com/chenxx89/BFRffusion.

Abstract:
Digital image have become the main source of human information acquisition and exchange, which is widely used in aerospace, biomedical and military fields. Therefore, to ensure the secure transmission of digital image, this paper proposes a secure spatio-temporal chaotic pseudorandom generator for image encryption is proposed. Firstly, we consider the potential impact of precision loss in digital circuits on the degradation of chaotic systems. Therefore, we employ the unscented Kalman filter (UKF) to assess accuracy loss in both Logistic, Sine and Chebyshev maps, which is compensated for by introducing perturbations into the spatio-temporal chaotic system. Secondly, we design new Sine maps and Chebyshev maps with time-varying delays to perturb the time dimension of the non-adjacent coupled lattice and improve the complexity and security of the chaotic system. In the end, we use the newly designed spatio-temporal chaotic system as a pseudo-random generator to design a new image encryption scheme. In this paper, we present a security proof for the newly proposed spatio-temporal chaotic system and image encryption scheme. Furthermore, security experiments demonstrate that the spatiotemporal chaotic system and image encryption scheme presented in this paper exhibit improved uniform distribution, absence of chaos degradation or predictability issues while offering randomness suitable for engineering applications.

Abstract:
Air-writing is a challenging task that combines the fields of computer vision and natural language processing, offering an intuitive and natural approach for human-computer interaction. However, current air-writing solutions face two primary challenges: (1) their dependency on complex sensors (e.g., Radar, EEGs and others) for capturing precise handwritten trajectories, and (2) the absence of a video-based air-writing dataset that covers a comprehensive vocabulary range. These limitations impede their practicality in various real-world scenarios, including the use on devices like iPhones and laptops. To tackle these challenges, we present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024), serving as a pioneering benchmark for video-based air-writing. This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras, eliminating the need for complex sensors. AWCV-100K-UCAS2024 includes 8.8 million video frames, encompassing the complete set of 3,755 characters from the GB2312-80 level-1 set (GB1). Furthermore, we introduce our baseline approach, the video-based character recognizer (VCRec). VCRec adeptly extracts fingertip features from sparse visual cues and employs a spatio-temporal sequence module for analysis. Experimental results showcase the superior performance of VCRec compared to existing models in recognizing air-written characters, both quantitatively and qualitatively. This breakthrough paves the way for enhanced human-computer interaction in real-world contexts. Moreover, our approach leverages affordable RGB cameras, enabling its applicability in a diverse range of scenarios. The code and data examples will be made public at https://github.com/wmeiqi/AWCV.

Abstract:
Existing audio-visual cross-modal matching methods focus on mitigating cross-modal heterogeneity but ignore the impact of intra-class discrepancy of the same identity in different scenarios, which might greatly limit the matching performance. To simultaneously handle both problems of intra-class discrepancy and cross-modal heterogeneity, we propose a novel public-private attributes-based variational adversarial network ( P^2 VANet), which captures the consistency within and between classes, for audio-visual cross-modal matching. In particular, P^2 VANet first uses a variational auto-encoder, which captures the inherent global information in diverse scenarios from the hidden variable through reconstruction, to reduce the intra-class discrepancy. Then it integrates a public attributes guidance module to capture the consistency of audio and visual by supervision of the common high-level semantic information to mitigate cross-modal heterogeneity. In addition, P^2 VANet designs a private attributes embedding module to enhance the discriminative features inherent in each class to decrease inter-class similarity. Extensive experiments on audio-visual cross-modal matching demonstrate the effectiveness of the proposed approach compared with the state-of-the-art methods.

Abstract:
The emergence of digital avatars has prompted an exponential increase in the demand for human point clouds with realistic and intricate details. The compression of such data becomes challenging due to massive amounts of data comprising millions of points. Herein, we leverage the human geometric prior in the geometry redundancy removal of point clouds to greatly promote compression performance. More specifically, the prior provides topological constraints as geometry initialization, allowing adaptive adjustments with a compact parameter set that can be represented with only a few bits. Therefore, we propose representing high-resolution human point clouds as a combination of a geometric prior and structural deviations. The prior is first derived with an aligned point cloud. Subsequently, the difference in features is compressed into a compact latent code. The proposed framework can operate in a plug-and-play fashion with existing learning-based point cloud compression methods. Extensive experimental results show that our approach significantly improves the compression performance without deteriorating the quality, demonstrating its promise in serving a variety of applications.

Abstract:
Text patterns typically exhibit distinct boundaries and sparse color histograms. However, in current hybrid codec frameworks, the positions of coding units are often misaligned with the text patterns, resulting in prediction and color mapping tools consuming a large number of bits to indicate these patterns. Nowadays, some text detection and recognition methods have been proposed to accurately locate and analyze the text regions in screen images. Combined with these techniques, we propose a character position-aware compression framework for screen text image. On the encoder side, a low-complexity detection method is adopted to locate the text characters. Then it copies the detected characters to the position aligned with the coding unit (CU) grid to form a text layer. This text-layer representation can further increase the efficiency of existing screen content coding tools such as Intra Block Copy (IBC). Moreover, we design several compression tools based on this representation. We extend the two Motion Vector (MV) prediction modes: Adaptive Motion Vector Prediction (AMVP) and Merge. We modify the MV encoding syntax according to the layout characteristics of the text layer. We present a Gradient-guided In-loop Filter (GIF) to sharpen the text lines using a convolutional network. Experiments conducted on VVC reference software VTM all_intra configuration show that the proposed framework can achieve an average bitrate savings of 4.6% and 3.6% under the w/ GIF and w/o GIF versions, with a corresponding increase in CPU encoding complexity of 72% and 10%.

Abstract:
With the extensive use of multi-view data in practice, multi-view spectral clustering has received a lot of attention. In this work, we focus on the following two challenges, namely, how to deal with the partially contradictory graph information among different views and how to conduct clustering without the parameter selection. To this end, we establish a novel graph learning framework, which avoids the linear combination of the partially contradictory graph information among different views and learns a unified graph for clustering without the parameter selection. Specifically, we introduce a flexible graph degeneration with a structured graph constraint to address the aforementioned challenging issues. Besides, our method can be employed to deal with large-scale data by using the bipartite graph. Experimental results show the effectiveness and competitiveness of our method, compared to several state-of-the-art methods.

Abstract:
With the rapid growth of Internet technology, security concerns have risen, particularly with the prevalence of Deepfakes, a popular visual forgery technique. Therefore, there is necessary to research more powerful methods to detect Deepfakes. However, many Convolutional Neural Networks-based detection methods struggle with cross-database performance, often overfitting to specific color textures. We observe that image noises can weaken the influence of color textures and expose the forgery traces in the noise domain. This is because tampering techniques, when altering face images, disrupt the consistency of feature distribution in the noise space. And the forgery traces in the noise space are complementary to the tampering artifacts present in the image space information. Therefore, we propose a novel face forgery detection network that combines spatial domain and noise domain. Our Dual Feature Fusion Module and Local Enhancement Attention Module contribute to more comprehensive feature representations, enhancing our method’s discriminative ability. Experimental results demonstrate superior performance compared to existing methods on mainstream datasets. https://github.com/jhchen1998/DeepfakeDetection.

Abstract:
Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method. Our source code and models are available anonymously at https://github.com/NUST-Machine-Intelligence-Laboratory/HPAN.

Abstract:
With the rapid advancements of the text-to-image generative model, AI-generated images (AGIs) have been widely applied to entertainment, education, social media, etc. However, considering the large quality variance among different AGIs, there is an urgent need for quality models that are consistent with human subjective ratings. To address this issue, we extensively consider various popular AGI models, generated AGI through different prompts and model parameters, and collected subjective scores at the perceptual quality and text-to-image alignment, thus building the most comprehensive AGI subjective quality database AGIQA-3K so far. Furthermore, we conduct a benchmark experiment on this database to evaluate the consistency between the current Image Quality Assessment (IQA) model and human perception, while proposing StairReward that significantly improves the assessment performance of subjective text-to-image alignment. We believe that the fine-grained subjective scores in AGIQA-3K will inspire subsequent AGI quality models to fit human subjective perception mechanisms at both perception and alignment levels and to optimize the generation result of future AGI models. The database is released on https://github.com/lcysyzxdxc/AGIQA-3k-Database.

Abstract:
Instruction tuning large language models are making rapid advances in the field of artificial intelligence where GPT-4 models have exhibited impressive multi-modal perception capabilities. Such models have been used as the core assistant for many tasks including art generation. However, high-quality art generation relies heavily on human prompt engineering which is in general uncontrollable. To address these issues, we propose a multi-task AI generated content (AIGC) system for art generation. Specifically, a dense representation manager is designed to process multi-modal user queries and generate dense and applicable prompts to GPT. To enhance artistic sophistication of the whole system, we fine-tune the GPT model by a meticulously collected prompt-art dataset. Furthermore, we introduce artistic benchmarks for evaluating the system based on professional knowledge. Experiments demonstrate the advantages of our proposed MtArtGPT system.

Abstract:
Automatic generation of painting images is an interesting and difficult task, especially for regional traditional paintings with unique cultural styles while lacking large-scale training sets. In this paper, a hierarchical painting generation method is proposed, which can disentangle the generation of content and style. By mimicking the human painting process, the proposed method introduces multiple content blocks first and gradually generates image contents. In each block, a spatial self-modulation module is proposed to inject local details while preserving the global layout. After the preliminary generation of contents, a series of style blocks are presented to gradually adjust the artistic style. In the style block, an edge-oriented style-modulation module is proposed, which focuses on the lines and edges. In addition, edge adversarial training is used to further improve the quality of generated lines. To train and evaluate the proposed method, we construct datasets for five types of Chinese folk paintings. Experimental results demonstrate that the proposed method can generate high-quality and diverse painting images. More importantly, it can disentangle content and style sufficiently, so that the generation of specific contents or styles can be controlled freely. The datasets and source codes is available at https://github.com/Ritsu-mio/HPGN.

Abstract:
Deepfake techniques can forge the visual or audio signals in the video, which leads to inconsistencies between visual and audio (VA) signals. Therefore, multimodal detection methods expose deepfake videos by extracting VA inconsistencies. Recently, deepfake technology has started VA collaborative forgery to obtain more realistic deepfake videos, which poses new challenges for extracting VA inconsistencies. Recent multimodal detection methods propose to first extract natural VA correspondences in real videos in a self-supervised manner, and then use the learned real correspondences as targets to guide the extraction of VA inconsistencies in the subsequent deepfake detection stage. However, the inherent VA relations are difficult to extract due to the modality gap, which leads to the limited auxiliary performance of the aforementioned self-supervised methods. In this paper, we propose Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection (PVASS-MDD), which consists of PVASS auxiliary and MDD stages. In the PVASS auxiliary stage in real videos, we first devise a three-stream network to associate two augmented visual views with corresponding audio clues, leading to explore common VA correspondences based on cross-view learning. Secondly, we introduce a novel cross-modal predictive align module for eliminating VA gaps to provide inherent VA correspondences. In the MDD stage, we propose to the auxiliary loss to utilize the frozen PVASS network to align VA features of real videos, to better assist multimodal deepfake detector for capturing subtle VA inconsistencies. We conduct extensive experiments on existing widely used and latest multimodal deepfake datasets. Our method obtains a significant performance improvement compared to state-of-the-art methods.

Abstract:
Image compression at extremely low bit-rates has always been a challenging task in bandwidth limited scenarios, such as aerospace and deep-sea explorations. Recent years have seen great success of deep learning in image compression, however, few of them are specially designed for extremely low bit-rate conditions. To solve this issue, in this paper, we propose a novel invertible image generation based framework for extremely low bit-rate image compression. The proposed framework is composed of three modules, including an invertible image generation (IIG) module, a generated image compression (GIC) module and a compressed image adjustment (CIA) module. The role of IIG module is to generate a compression-friendly image from the original image. In the IIG module, image generation and restoration are modelled as two mutually reversible processes to avoid the information loss. After the IIG module, the GIC module is employed to compress the generated images to save the coding bit-rates. After that, the CIA module is used to shrink the quality gap between the compressed generated image and the un-compressed image. Finally, the image from the CIA module is sent back to the IIG module to restore the original image. The experimental results on three different datasets show that the proposed framework achieves state-of-the-art performance in image compression with extremely low bit-rates. We also extend the proposed framework to feature compression towards object detection, which saves 90% bit-rates than the VVC standard with the same detection accuracy.

Affiliations: School of Software, Northwestern Polytechnical University, Xi’an, China; School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi’an, China; Guangxi Key Laboratory of Multisource Information Mining and the Security College of Computer Science & Engineering, Guangxi Normal University, Guilin, China; School of Data Science, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China

Abstract:
Popular methods usually use a degradation model in a supervised way to learn a watermark removal model. However, it is true that reference images are difficult to obtain in the real world, as well as collected images by cameras suffer from noise. To overcome these drawbacks, we propose a perceptive self-supervised learning network for noisy image watermark removal (PSLNet) in this paper. PSLNet depends on a parallel network to remove noise and watermarks. The upper network uses task decomposition ideas to remove noise and watermarks in sequence. The lower network utilizes the degradation model idea to simultaneously remove noise and watermarks. Specifically, mentioned paired watermark images are obtained in a self-supervised way, and paired noisy images (i.e., noisy and reference images) are obtained in a supervised way. To enhance the clarity of obtained images, interacting two sub-networks and fusing obtained clean images are used to improve the effects of image watermark removal in terms of structural information and pixel enhancement. Taking into texture information account, a mixed loss uses obtained images and features to achieve a robust model of noisy image watermark removal. Comprehensive experiments show that our proposed method is very effective in comparison with popular convolutional neural networks (CNNs) for noisy image watermark removal. Codes can be obtained at https://github.com/hellloxiaotian/PSLNet.

Abstract:
The person re-identification task aims to retrieve the same identity under different cameras. The main difficulties of the task lie in the collection of a large amount of annotated data and the diversity of pedestrians. Therefore, how to learn a robust and discriminative representation feature with unlabeled data is the key to this task. The pseudo label based methods have shown significant effectiveness in the field by generating pseudo labels from unlabeled data instead of ground-truth labels. However, existing researches typically suffer two limitations: 1) The extracted features are insufficient to reflect the subtle local semantics; 2) The pseudo labels generated by clustering methods cannot avoid introducing noise, which will seriously affect the performance of the discriminative feature. In this paper, to address the above problems, we propose a Distribution-Guided Hierarchical Calibration Contrastive Network (DHCCN) to better exploit local clues and hierarchical representation, which can consider cross-granularity consistency and reduce the noise of pseudo labels by the calibrated feature distribution. A Hierarchical Feature Extractor is employed to capture the multi-granularity response of each image, and fuse both global salience and local subtle texture information of a pedestrian to generate the hierarchical feature. In addition, to reduce the error of the pseudo labels, we introduce a Feature Distribution Corrector to calibrate noisy features of low-confidence samples evaluated by a Gaussian Mixture Model. At last, we integrate cross-granularity consistency constraint by the difference between the global and local feature, which can help generate more accurate feature embedding and improve robustness of the model. Therefore, we can receive a performance that is close to the supervised person re-identification task by narrowing the gap between the pseudo and ground-truth label. Experiments on four standard benchmarks demonstrate the effectiveness of our method against the state-of-the-art unsupervised re-identification methods. The code is available at https://github.com/Li-Yongxi/2023-DHCCN.

Abstract:
Graph convolutional networks (GCNs) have attracted considerable interest in skeleton-based action recognition. Existing GCN-based models have proposed methods to learn dynamic graph topologies generated from the feature information of vertices to capture inherent relationships. However, these models have two main limitations. Firstly, they struggle to effectively utilize high-dimensional or structural information, which limits their capacity for feature representation and consequently hinders performance improvement. Secondly, among these models, the multi-scale methods that aggregate information at different scales often over-capture unnecessary relationships between vertices. This leads to an over-smoothing problem where smoothed features are extracted, making it difficult to distinguish the features of each vertex. To address these limitations, we propose the multi-scale structural graph convolutional network (MSS-GCN) for skeleton-based action recognition. Within the MSS-GCN framework, the common intersection graph convolution (CI-GC) leverages the overlapped neighbor information, indicating the overlap between neighboring vertices for a given pair of root vertices. The graph topology of CI-GC is designed to compute the structural correlation between neighboring vertices corresponding to each hop, thereby enriching the context of inter-vertex relationships. Then, our proposed multi-scale spatio-temporal modeling aggregates local-global features to provide a comprehensive representation. In addition, we propose a Graph Weight Annealing (GWA) method, which is a graph scheduling method to mitigate the over-smoothing caused by multi-scale aggregation. By varying the importance between a vertex and its neighbors, we demonstrate that the over-smoothing problem can be effectively mitigated. Moreover, our proposed GWA method can easily be adapted to different GCN models to enhance performance. Combining the MSS-GCN model and the GWA method, we propose a powerful feature extractor that effectively classifies actions for skeleton-based action recognition in various datasets. We evaluate our approach on three benchmark datasets: NTU RGB+D, NTU RGB+D 120, and NW-UCLA. The proposed MSS-GCN achieves state-of-the-art performance on all three datasets, further validating the effectiveness of our approach.

Abstract:
Multi-view representation learning, aimed at uncovering the inherent structure within multi-view data, has developed rapidly in recent years. In practice, due to temporal and spatial desynchronization, it is common that only part of the data is aligned between views, which leads to the Partial View Alignment (PVA) problem. To address the challenge of representation learning on partially view-aligned multi-view data, we propose a new cross-view graph contrastive learning network, which integrates multi-view information to align data and learn latent representations. First, view-specific autoencoders are used to construct an end-to-end multi-view representation learning framework for learning specific view representations. Furthermore, to achieve cluster-level alignment, we introduce a cross-view graph contrastive learning module to guide the learning of discriminative representations. Compared to the existing methods, the proposed cluster-level alignment method successfully extends the view alignment to more than two views. Meanwhile, the results of clustering and classification experiments on several popular multi-view datasets can also illustrate the effectiveness and superiority of the proposed method.

Abstract:
Visual tracking is a task of localizing a target unceasingly in a video with an initial target state at the first frame. The limited target information makes this problem an extremely challenging task. Existing tracking methods either perform matching based similarity learning or optimization based discrimination reasoning. However, these two types of tracking methods suffer from the problem of ineffectiveness for distinguishing target objects from background distractors and the problem of insufficiency in maintaining spatio-temporal consistency among successive frames, respectively. In this paper, we design a joint spatio-temporal similarity and discrimination learning (STSDL) framework for accurate and robust tracking. The designed framework is composed of two complementary branches: a similarity learning branch and a discrimination learning branch. The similarity learning branch uses an effective transformer encoder-decoder to gather rich spatio-temporal context information to generate a similarity map. In parallel, the discrimination learning branch exploits an efficient model predictor to train a target model to produce a discriminative map. Finally, the similarity map and the discriminative map are adaptively fused for accurate and robust target localization. Experimental results on six prevalent datasets demonstrate that the proposed STSDL can obtain satisfactory results, while it retains a real-time tracking speed of 50 FPS on a single GPU.

Abstract:
Talking head animation transforms a source anime image to a target pose, where the transformation includes the change of facial expression and head movement. In contrast to existing approaches that operate on the low-resolution image ( 256× 256 ), we study this task at a higher resolution, e.g., 512× 512 . High-resolution talking head animation, however, raises two major challenges: i) how to achieve smooth global transformation while maintaining rich details of anime characters under large-displacement pose variations; ii) how to address the shortage of data, because no related dataset is publicly available. In this paper, we present a Hierarchical Feature Warping and Blending (HFWB) model, which tackles talking head animation hierarchically. Specifically, we use low-level features to control global transformation and high-level features to determine the details of anime characters, under the guidance of feature flow fields. These features are then blended by selective fusion units, outputting transformed anime images. In addition, we construct an anime pose dataset—AniTalk-2K, aiming to alleviate the shortage of data. It contains around 2000 anime characters with thousands of different face/head poses at a resolution of 512× 512 . Extensive experiments on AniTalk-2K demonstrate the superiority of our approach in generating high-quality anime talking heads over state-of-the-art methods.

Abstract:
Subspace learning has been widely applied for joint feature extraction and dimensionality reduction, demonstrating significant efficacy. Numerous subspace learning methods with diverse assumptions regarding the criteria for the target subspaces have been developed to obtain compact and interpretable data representations. However, when applied to image data, existing methods fail to fully exploit the inherent correlations within the image set. This paper proposes a Robust Discriminative t-Linear Subspace Learning model (RDtSL) to tackle this issue using t-product. The model mainly has four strengths: 1) Taking advantage of t-product, RDtSL learns the projection basis directly from the image set while fully exploiting its internal correlations; 2) Based on its energy preservation module, RDtSL retains the primary energy of samples in the learned subspace, maintaining satisfactory performance even with low subspace dimensions; 3) Class-distinctive features are effectively preserved in the learned representations due to the incorporation of the classification module; 4) Relying on its graph embedding module, RDtSL learns an affinity graph of samples adaptively to enrich the data representations with locality and similarity information. The harmonious balance maintained between the three proposed modules helps RDtSL learn discriminative and informative data representations. We also develop an iterative algorithm to solve RDtSL. Extensive experiments on benchmark databases demonstrate the superiority of the proposed model.

Abstract:
Multi-modal salient object detection (MSOD) aims to boost saliency detection performance by integrating visible sources with depth or thermal infrared ones. Existing methods generally design different fusion schemes to handle certain issues or challenges. Although these fusion schemes are effective at addressing specific issues or challenges, they may struggle to handle multiple complex challenges simultaneously. To solve this problem, we propose a novel adaptive fusion bank that makes full use of the complementary benefits from a set of basic fusion schemes to handle different challenges simultaneously for robust MSOD. We focus on handling five major challenges in MSOD, namely center bias, scale variation, image clutter, low illumination, and thermal crossover or depth ambiguity. The fusion bank proposed consists of five representative fusion schemes, which are specifically designed based on the characteristics of each challenge, respectively. The bank is scalable, and more fusion schemes could be incorporated into the bank for more challenges. To adaptively select the appropriate fusion scheme for multi-modal input, we introduce an adaptive ensemble module that forms the adaptive fusion bank, which is embedded into hierarchical layers for sufficient fusion of different source data. Moreover, we design an indirect interactive guidance module to accurately detect salient hollow objects via the skip integration of high-level semantic information and low-level spatial details. Extensive experiments on three RGBT datasets and seven RGBD datasets demonstrate that the proposed method achieves the outstanding performance compared to the state-of-the-art methods.

Abstract:
Recently, semi-supervised semantic segmentation methods based on weak-to-strong consistency learning have achieved the most advanced performance. The key to such a technique lies in strong perturbations and multi-objective co-training. However, CutMix, the most commonly used data augmentation in this field, limits the strength of perturbations as it only focuses on single random local context. Besides, complex optimization targets also reduce computational efficiency. In this work, we propose an efficient consistency learning based framework. Specifically, a novel unsupervised data augmentation strategy, EntropyMix, is present for semi-supervised semantic segmentation. Patches of unlabeled data from multi-view augmentations are combined into new training samples based on their prediction entropy, which provides more informative and powerful perturbations for consistency regularization and impels the model to focus on cross-view local context. On this basis, we further propose Self Pseudo Entropy knowledgE Distillation (SPEED) to learn global pixel relations from multi- and cross-view perturbations by optimizing a linear combination of feature- and logit-level distillation loss, enhancing model performance without additional auxiliary segmentation heads or a complex pre-trained teacher model. The collocation of the two ideas above is a plug-and-play technique without additional modification. Extensive experimental results on PASCAL VOC and Cityscapes datasets under various training settings demonstrate the superiority of the proposed data augmentation strategy and self-distillation loss, achieving new state-of-the-art performance. Remarkably, our method reaches mIoU of 75.16% using only 0.87% labeled data on PASCAL VOC and mIoU of 76.98% using only 6.25% labeled data on Cityscapes. The code is available at https://github.com/xiaoqiang-lu/SPEED.

Abstract:
In two-view correspondence learning, prevalent multi-layer perceptron (MLP)-based methods struggle with context capturing. To remedy this issue, recent advances innovatively stack convolutional neural network (CNN)-based Resblocks sequentially, showing an inherent proficiency in local context extraction. Yet, such non-issue-specific designs inherit the drawback of CNN’s difficulty in aggregating global context, leading to performance bottlenecks. To address this problem, this prospective study further explores the potential of the CNN-based framework and proposes MC-Net, a top-performing network that integrates both local and global context elegantly and seamlessly. Specifically, considering that sparse motion vectors and a dense motion field can be converted into each other through interpolation and sampling, we first transform unordered matches into image-structured data by estimating the dense motion field implicitly. Then, we design a hierarchical rectifying module to rectify the error of each ordered motion vector with CNN at multiple levels, enabling MC-Net to perceive global context from coarse-level features and local context from fine-level features simultaneously, which facilitates to tackle the discontinuities of the motion field in case of large scene disparity. Finally, we reconstruct comprehensive context-embedded features from rectified motion fields at all levels. Also, instead of using the residuals between rectified and pre-rectified motion vectors at the same layer to reject outliers as in previous studies, which seriously affects the inlier prediction accuracy, we rethink this operation meticulously and modify it to the difference between motion vectors obtained from each layer’s reconstruction and ones from the first layer before transformation, ensuring purer residuals and enhancing the matching performance without extra computational burden. Extensive experiments show that MC-Net outperforms state-of-the-arts on multiple domains and datasets.

Abstract:
Recent years have witnessed significant advancements in face image generation using generative adversarial networks (GANs), leading to a high demand for GAN-generated face image quality assessment (GFIQA). However, the intrinsic distortion caused by the generation brings a significant challenge for existing image quality assessment (IQA) models which are typically designed for natural images. In addition, the image distortion usually varies depending on different GAN models, resulting in a high generalization capability that a GFIQA model should possess. To account for this, we first establish a large GFIQA database by collecting various GFIs from existing popular GAN models. Subsequently, we further propose a causal representation learning (CRL) scheme for the generalized GFIQA model (CRL-GFIQA) with the assumption that the causal knowledge of human quality assessment is shareable in different scenarios. In particular, we disentangle the learned features into casual and non-causal components by an invertible neural network, facilitating the proposed CRL-GFIQA model with a high generalization on unseen domains. Extensive experimental results demonstrate the effectiveness of our CRL-GFIQA model. The codes and the constructed dataset will be publicly available.

Abstract:
Real-time video streaming is getting indispensable in people’s daily life, and poses heavy loads and stringent performance requirements on the network. For Internet Service Providers (ISPs), ensuring high-quality real-time video communication is a widely concerned issue. However, inferring the quality of real-time video streaming based on passively-collected network traffic is a great challenge due to limited information in the User Datagram Protocol (UDP) header and the encryption of the application-level protocol. In this paper, we propose IReaV-T to Infer Real-time Video streaming quality with a generalized Transformer, which understands the intrinsic state of the network and predicts the future real-time video quality. By applying novel embedding methods, IReaV-T could make full use of observed traffic features and distinguish different real-time video applications. Extensive comparative experiments demonstrate the effectiveness of IReaV-T, showing that IReaV-T could predict future real-time video quality with mean squared Video Multimethod Assessment Fusion (VMAF) score error less than 6.

Abstract:
Referring Expression Comprehension (REC) is a fundamental task in the vision and language domain, which aims to locate an image region according to a natural language expression. REC requires the models to capture key clues in the text and perform accurate cross-modal reasoning. A recent trend employs transformer-based methods to address this problem. However, most of these methods typically treat image and text equally. They usually perform cross-modal reasoning in a crude way, and utilize textual features as a whole without detailed considerations (e.g., spatial information). This insufficient utilization of textual features will lead to sub-optimal results. In this paper, we propose a Language Guided Reasoning Network (LGR-NET) to fully utilize the guidance of the referring expression. To localize the referred object, we set a prediction token to capture cross-modal features. Furthermore, to sufficiently utilize the textual features, we extend them by our Textual Feature Extender (TFE) from three aspects. First, we design a novel coordinate embedding based on textual features. The coordinate embedding is incorporated to the prediction token to promote its capture of language-related visual features. Second, we employ the extracted textual features for Text-guided Cross-modal Alignment (TCA) and Fusion (TCF), alternately. Third, we devise a novel cross-modal loss to enhance cross-modal alignment between the referring expression and the learnable prediction token. We conduct extensive experiments on five benchmark datasets, and the experimental results show that our LGR-NET achieves a new state-of-the-art. Source code is available at https://github.com/lmc8133/LGR-NET.

Abstract:
Deep learning based object detection methods have made significant progress in recent years. However, these methods often suffer from a substantial performance drop when domain shifts occur, making it difficult to generalize a source domain trained object detector to a new target domain. To address this problem, we propose an Online Meta Learning Framework (OMLF) for unsupervised domain adaptive object detection. In our proposed framework, we adopt the Polar Harmonic Fourier Moment (PHFM) to generate target-like intermediate data. The purpose is to construct a two-pair framework that learns meta knowledge (i.e. model initial parameters) from the pair of “source-to-intermediate” to assist another pair of “intermediate-to-target”. Moreover, the optimizing process requires a heavy computational load due to triggering higher-order gradients. To alleviate this problem, we introduce a shortest-path update strategy that accelerates optimization. When evaluated on several benchmark adaptation scenarios (i.e. normal-to-foggy weather, cross cameras, synthetic-to-real, and real-to-artistic), our OMLF achieves state-of-the-art results, demonstrating its effectiveness.

Abstract:
Compared with natural image segmentation, small sample image segmentation tasks, such as medical image segmentation and defect detection, have been less studied. Recent studies made efforts on bringing together Convolutional Neural Networks (CNNs) and Transformers in a serial or interleaved architecture in order to incorporate long-range dependencies into the features extracted using CNNs. In this study, we argue that these architectures limit the capability of the combination of CNNs and Transformers. To this end, we propose a dual-stream small sample image segmentation network, namely, the Interactive Coupling of Convolutions and Transformers Based UNet (ICCT-UNet, code and models are available at: https://indtlab.github.io/projects/ICCTUNet), motivated by the success achieved using the UNet in the scenario of small sample image segmentation. Within this network, a CNN stream is paralleled with a Transformer stream while maintaining feature exchange inside each block through the proposed Window-Based Multi-head Cross-Attention (W-MHCA) mechanism. To derive an overall segmentation, the features learned by both the streams are further fused using a Residual Fusion Module (RFM). Experimental results show that the ICCT-UNet outperforms, or at least performs comparably to, its counterparts on eight sets of medical and defective images. These promising results should be attributed to the effective combination of the local and global features fulfilled by the proposed interactive coupling method.

Abstract:
Despite deep neural networks have made outstanding achievements in many static tasks, when faced with a continuous stream of data, they suffer from catastrophic forgetting since the previous data is usually inaccessible. Stored data or generative model is commonly used for maintaining the model performance but with memory utilization and privacy safety issues. Prototype-based methods address these issues by keeping only one prototype for each class but with limitations in its ability to trade-off the model stability and plasticity. In this paper, a novel exemplar-free class-incremental learning method is proposed which improves the stability of the representation learning and the decision boundary to a great degree. First, based on the results of our exploration into the impact of the batch normalization (BN) layer on representation learning, we propose to remove the BN layer (RBNL) in the incremental training phase to improve the stability of model representation learning. Then, to further maintain the feature space, we design the prototype mixing (PM), which expands the deep features by randomly and linearly combining prototypes of the old classes to generate hybrid prototypes with composite labels for fine-tuning the fully connected layer. Experimental results on three benchmark datasets, CIFAR-100, TinyImageNet, and ImageNet, show that our proposed method can effectively balance the stability and plasticity of the model, and outperforms the state-of-the-art works.

Abstract:
Reducing cumulative registration error is critical to accurate 3D multi-view registration. Meta-shape based methods optimize rigid transformations of point clouds by iteratively registering each point cloud with a meta-shape, which remain popular solutions to 3D multi-view registration. However, the merits and demerits of existing meta-shape based methods remain unclear. Moreover, we argue that simpler meta-shape based solutions can achieve even better performance. To this end, we evaluate seven representative meta-shape based methods in this work, including four existing ones and three modified ones, in order to investigate the problem of defining a good meta-shape. In particular, we first abstract the main steps of considered methods. Then, experiments on both object and scene datasets with real and synthetic cumulative registration errors are deployed for an in-depth evaluation. Finally, based on the experimental outcomes, we give a discussion on the advantages and limitations of meta-shape based methods. We demonstrate prior works have used unnecessarily complicated techniques for cumulative error elimination and our slightly modified simpler solutions can achieve competitive performance on experimental datasets.

Abstract:
Due to the various appearance of the polyps and the tiny contrast between the polyp area and its surrounding background, accurate polyp segmentation has become a challenging task. To tackle this issue, we introduce a boundary-enhanced framework for polyp segmentation, called the Focused on Boundary Segmentation (FoBS) framework, that leverages multi-level collaboration among sample, feature, and optimization. It places greater emphasis on the polyp boundary to improve the accuracy of segmentation. Firstly, a boundary-aware mixup method is designed to improve the model’s awareness of the boundary. More importantly, we propose deformable laplacian-based feature refining to explicitly strengthen the representation ability of the boundary features. It employs a deformable Laplacian refinement function to capture discriminative information from a deformable perceptual field, thereby improving its ability to adapt to boundary variations. In addition, we introduce the self-adjusting refinement coefficient learning that enables adaptive control over the refinement strength at each location. Furthermore, we develop a location-sensitive compensation criterion that assigns more importance to the degraded feature after feature refinement during optimization. Extensive quantitative and qualitative experiments on four polyp benchmarks demonstrate the effectiveness of our method for automatic polyp segmentation. Our code is available at https://github.com/TFboys-lzz/FoBS.

Abstract:
Nuclear norm maximization has shown the power to enhance the transferability of unsupervised domain adaptation model (UDA) in an empirical scheme. In this paper, we identify a new property termed equity, which indicates the balance degree of predicted classes, to demystify the efficacy of nuclear norm maximization for UDA theoretically. With this in mind, we offer a new discriminability-and-equity maximization paradigm built on squares loss, such that predictions are equalized explicitly. To verify its feasibility and flexibility, two new losses termed Class Weighted Squares Maximization (CWSM) and Normalized Squares Maximization (NSM), are proposed to maximize both predictive discriminability and equity, from the class level and the sample level, respectively. Importantly, we theoretically relate these two novel losses (i.e., CWSM and NSM) to the equity maximization under mild conditions, and empirically suggest the importance of the predictive equity in UDA. Moreover, it is very efficient to realize the equity constraints in both losses. Experiments of cross-domain image classification on three popular benchmark datasets show that both CWSM and NSM contribute to outperforming the corresponding counterparts.

Abstract:
Few-shot semantic segmentation (FSS) aims to segment novel classes with only a few annotated samples. Existing methods to FSS generally combine the annotated mask and the corresponding support image to generate the class-specific representation, and perform the segmentation for the query image by matching the features of the query image to these representations. However, the segmentation performance could be fragile for the lack of an effective method to handle the inappropriate use of query features and the neglection of correlation between features in support and query images. In this work, we propose a novel Disentanglement and Recombination Network (DRNet) to alleviate this problem. Concretely, we first apply the self-attention on both support foreground features and query foreground features. Then, the foreground features of the support and query branches are recombined using the cross-attention after self-attention computation, which can encourage the foreground feature alignment between branches. Finally, the prototypes are generated from the recombined foreground features and support background features, and are utilized to guide the segmentation for given images. Considering the sensitivity of prototypes related to the subtle differences among objects from different classes and the same class, we further introduce a joint learning strategy to derive accurate segmentation of both seen and unseen objects in the support image and the query image respectively. Extensive experiments on the PASCAL- 5^i and COCO- 20^i datasets demonstrate the superiority of our DRNet comparing with the recent popular methods. The code is released on https://github.com/GS-Chang-Hn/DRNet-fss.

Abstract:
Aggregating information from multiple views is essential to accurately identifying similar objects. Nevertheless, existing datasets have limitations that hinder the development of practical multi-view object classification methods for real-world scenarios. The limitations include synthetic and coarse-grained objects in the datasets and the absence of a validation split to enable standard hyperparameter tuning. This paper proposes a new dataset, MVP-N (Multi-View, Retail Products, Label Noise), which contains 16k real captured views and 9k multi-view sets collected from 44 retail products. In MVP-N, each view is annotated with a human-perceived information quantity (HPIQ) for analyzing how views are utilized in information aggregation. Moreover, the fine-grained categorization of objects provides the inter-class view similarity and intra-class view variance, enabling the research on learning from noisy labels of the multi-view images. Finally, a new soft label scheme, HS-HPIQ, is proposed considering the hidden stratification phenomenon in the multi-view images and achieves superior performance. To assess the effectiveness of MVP-N and the proposed HS-HPIQ, this study overviews 50 recent multi-view-based methods regarding their practicality in real-world scenarios. Six feature aggregation methods and twelve soft label methods are benchmarked on MVP-N with a deep analysis. The dataset and code are publicly available at https://github.com/SMNUResearch/MVP-N.

Abstract:
Effectively measuring and modeling the reliability of a trained model is essential to the real-world deployment of monocular depth estimation (MDE) models. However, the intrinsic ill-posedness and ordinal-sensitive nature of MDE pose major challenges to the estimation of uncertainty degree of the trained models. On the one hand, utilizing current uncertainty modeling methods may increase memory consumption and usually take more time. On the other hand, measuring the uncertainty based on model accuracy can also be problematic, where uncertainty reliability and prediction accuracy are not well decoupled. In this paper, we propose to model the uncertainty of MDE models from the perspective of the inherent probability distributions originating from the depth probability volume and its extensions, and to assess it more fairly with more comprehensive metrics. By simply introducing additional training regularization terms, our model, with surprisingly simple formations and without requiring extra modules or multiple inferences, can provide uncertainty estimations with state-of-the-art reliability, and can be further improved when combined with ensemble or sampling methods. A series of experiments demonstrate the effectiveness of our methods. Code and results are available at https://github.com/npucvr/MDEUncertainty.

Abstract:
Multi-modal Emotion Recognition (MER) aims to identify various human emotions from heterogeneous modalities. With the development of emotional theories, there are more and more novel and fine-grained concepts to describe human emotional feelings. Real-world recognition systems often encounter unseen emotion labels. To address this challenge, we propose a versatile zero-shot MER framework to refine emotion label embeddings for capturing inter-label relationships and improving discrimination between labels. We integrate prior knowledge into a novel affective graph space that generates tailored label embeddings capturing inter-label relationships. To obtain multimodal representations, we disentangle the features of each modality into egocentric and altruistic components using adversarial learning. These components are then hierarchically fused using a hybrid co-attention mechanism. Furthermore, an emotion-guided decoder exploits label-modal dependencies to generate adaptive multimodal representations guided by emotion embeddings. We conduct extensive experiments with different multimodal combinations, including visual-acoustic and visual-textual inputs, on four datasets in both single-label and multi-label zero-shot settings. Results demonstrate the superiority of our proposed framework over state-of-the-art methods.

Abstract:
Feature matching is an essential computer vision task that requires the establishment of high-quality correspondences between two images. Constructing sparse dynamic graphs and extracting contextual information by searching for neighbors in feature space is a prevalent strategy in numerous previous works. Nonetheless, these works often neglect the potential connections between dynamic graphs from different layers, leading to underutilization of available information. To tackle this issue, we introduce a Sparse Dynamic Graph Interaction block for feature matching. This innovation facilitates the implicit establishment of dependencies by enabling interaction and aggregation among dynamic graphs across various layers. In addition, we design a novel Multiple Sparse Transformer to enhance the capture of the global context from the sparse graph. This block selectively mines significant global contextual information along spatial and channel dimensions, respectively. Ultimately, we present the Multi-layer Sparse Graph Attention Network (MSGA-Net), a framework designed to predict probabilities of correspondences as inliers and to recover camera poses. Experimental results demonstrate that our proposed MSGA-Net surpasses state-of-the-art methods on challenging indoor and outdoor datasets. Code will be available at https://github.com/gongzhepeng/MSGA-Net.

Abstract:
Modern Siamese trackers mainly rely on classifying and regressing pre-defined anchor boxes or per-pixel points, which are assigned as positive and negative samples based on box intersection-over-union (IoU) or point distance with corresponding ground-truth for training. However, this rigid configuration potentially involves some noisy and ambiguous positive samples, leading to an inconsistency problem between classification and regression, which limits the tracking performance. In this paper, we propose a novel probabilistic assignment approach that dynamically determines positive/negative samples for each instance. To be specific, we first customize the confidence scores of positive candidates by comprehensively exploring the outputs from both classification and regression heads, and fit these scores as a probability distribution. Therefore, it is intuitive to conduct adaptive label assignment according to their probabilities. Then, we also consider dynamic re-weighting factor for each positive sample, jointly optimizing the classification and regression losses in a synchronized manner. Moreover, we introduce a decoupled IoU prediction branch to bridge the gap between the training and inference objectives for accurate tracking. Thanks to well-aligned procedures, our method significantly improves the performance of both CNN-based and Transformer-based trackers. Extensive experiments conducted on several tracking benchmarks including LaSOT and GOT-10k, demonstrate the effectiveness and efficiency of the proposed probabilistic assignment tracker.

Abstract:
Mainstream methods of multi-person pose estimation are not end-to-end. Recently, some methods build an end-to-end framework based on the DETR framework, aiming to eliminate the need for hand-crafted modules like heuristic grouping and NMS post-processing. However, these DETR-based methods suffer from a heavy memory burden of processing the high-resolution backbone feature maps with transformers. In this paper, we propose an end-to-end multi-person pose estimation method with a fully convolutional network, termed EFCPose. Different from DETR-based methods, it directly predicts instance-aware poses in a pixel-wise manner with lightweight convolutional heads, avoiding the heavy memory burden. Overall, our method adopts the center-offset formulation and a one-to-one label assignment strategy to achieve the multi-person pose estimation in an end-to-end manner. The main contribution of our fully convolutional heads includes two aspects. On the one hand, we propose an unaligned center-offset representation to learn more reliable semantic centers to replace the inconsistent geometric centers, improving the performance of instance detection. On the other hand, we propose a novel regression strategy named limb-aware adaptive regression, which leverages separate adaptive points to convert challenging long-range offsets into simplified short-range offsets and incorporates limb constraints to elevate the regression quality of joint offsets. Compared with current DETR-based end-to-end methods, EFCPose avoids high computational complexity and achieves higher accuracy. Extensive experiments on COCO Keypoint and CrowdPose benchmarks show that EFCPose outperforms other state-of-the-art bottom-up and single-stage methods without flipping augmentation.

Abstract:
Collecting paired pixel-aligned hazy/haze-free image pairs in real-world is arduous for full-supervised image dehazing. Alternatively, methods employing unpaired hazy/clear images have been developed, yet their learning ability about content information of the hazy images is easily disturbed by content-independent clear images, causing artifact problems, particularly for thick hazy images. To address the above issues, we propose a new reference-based image dehazing paradigm with hazy/reference images, where the reference image is clear and taken at the same scene as the hazy image. Therefore, how to maximize the reference value from the hazy/reference images with similar content but unaligned pixels becomes a key issue. Here, we construct a reference-based contrastive learning framework to realize the effective utilization of hazy/reference image pairs. Specifically, internal contrastive learning is designed to preserve the local content invariance between the dehazed images and hazy images in a patch-wise contrastive manner, while the other external contrastive learning learns the global content consistency between the dehazed images and reference images in an overall contrastive manner. Additionally, we design a style consistency loss committee consisting of a regular adversarial loss and a style loss. The former aims to ensure each dehazed image consistent with the overall style distribution of the entire reference set, while the latter is intended to make each dehazed image have an exclusive style with the corresponding reference image. Extensive experiments corroborate that the reference-based dehazing paradigm is recommendable and reliable, and the proposed method performs admirably against other state-of-the-art methods.

Abstract:
In recent years, learning-based methods have achieved significant advancements in multi-exposure image fusion. However, two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference. Reliance on aligned image pairs in existing methods causes susceptibility to artifacts due to device motion. Additionally, existing techniques often rely on handcrafted architectures with huge network engineering, resulting in redundant parameters, adversely impacting inference efficiency and flexibility. To mitigate these limitations, this study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion. Specifically, targeting the extreme discrepancy of exposure, we propose the self-alignment module, leveraging scene relighting to constrain the illumination degree for following alignment and feature extraction. Detail repletion is proposed to enhance the texture details of scenes. Additionally, incorporating a hardware-sensitive constraint, we present the fusion-oriented architecture search to explore compact and efficient networks for fusion. The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios. Moreover, it significantly reduces inference time by 69.1%. The code will be available at https://github.com/LiuZhu-CV/CRMEF.

Abstract:
The research on the single image dehazing task has been widely explored. However, as far as we know, no comprehensive study has been conducted on the robustness of the well-trained dehazing models. Therefore, there is no evidence that the dehazing networks can resist malicious attacks. In this paper, we focus on designing a group of attack methods based on first order gradient to verify the robustness of the existing dehazing algorithms. By analyzing the general purpose of image dehazing task, four attack methods are proposed, which are predicted dehazed image attack, hazy layer mask attack, haze-free image attack and haze-preserved attack. The corresponding experiments are conducted on six datasets with different scales. Further, the defense strategy based on adversarial training is adopted for reducing the negative effects caused by malicious attacks. In summary, this paper defines a new challenging problem for the image dehazing area, which can be called as adversarial attack on dehazing networks (AADN). Code is available at https://github.com/Xiaofeng-life/AADN_Dehazing.

Abstract:
Point cloud registration is a critical task in various 3D applications. Supervised approaches are restricted by the difficulty and cost of acquiring ground-truth annotations. Thus, unsupervised point cloud registration has emerged as a promising alternative. However, existing unsupervised methods often overlook the importance of feature interactions, leading to feature matching ambiguity. To address these challenges, we propose an unsupervised point cloud registration framework termed Global Topology-aware Interactions Network (GTINet), which contains a global structural relations (GSR) module and a contextual topological interactions (CTI) module. The GSR module transforms local features into global features through global graph convolutions. Based on the obtained global features, the CTI module learns geometric feature similarities and relative positional knowledge for both the source and target point clouds. The CTI module further learns contextual feature interactions through topology-aware attention layers. By improving the discriminativeness of features, our GTINet reduces the feature matching ambiguity caused by local structural similarity. Extensive experiments demonstrate that our method achieves state-of-the-art unsupervised registration performance on the ModelNet40, 7Scene, and KITTI datasets. Our work provides a novel perspective for conducting unsupervised point cloud registration. We will release our code for future research.

Abstract:
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.

Abstract:
Palmprint recognition has seen significant advancements and garnered considerable attention recently. However, deep learning methods have yet to effectively incorporate insights from traditional approaches to extract palmprint-specific features. Moreover, intra-class spatial variation problems, which degrade the recognition performance, have not been adequately addressed. To tackle these limitations, this study proposes an Aligned Multilevel Gabor Convolution Network (AMGNet) to identify the informative and salient aspects of the palmprints. The network unifies a multilevel Gabor feature fusion branch with a spatial alignment branch, enabling the joint mining of aligned multilevel features specific to palmprints. Within the feature fusion branch, we incorporate two specialized Gabor convolution modules: one targets the principal lines of the palm, while the other focuses on the wrinkles, augmenting the discriminative power of the acquired features. To enhance the model’s robustness against within-class variations, we design a spatial alignment branch that specifically enables the rectification of palmprints’ spatial positions. In conjunction with this, we introduce a novel direction-based CosAngle loss function to facilitate geometric alignment among samples from same palms while spatially distancing those from different palms. Furthermore, we construct a palmprint database consisting of 3, 000 palms from 1, 500 individuals to explore large-scale population potential. Extensive experimental results on six benchmark datasets demonstrate that our proposed method outperforms other popular approaches in palmprint recognition tasks.

Abstract:
Current methods for detecting deep fakes concentrate on specific patterns of forgery like noise characteristics, local textures, or frequency statistics. These approaches assume training and test sets exhibit similar data distributions, which bring severe performance drops and further limit broader applications when migrating unseen domains. Existing works show that reconstruction learning is effective in capturing unseen forgery clues. However, 2D reconstruction is insufficient and can not handle non-frontal face reconstruction, while 3D reconstruction provides more critical details of facial structure and finds accurate forgery regions. In this paper, we propose a bi-source reconstruction based classification network (BRCNet) to incorporate 2D and 3D reconstruction as the supervisions and learn the optimal feature representation. In detail, we employ an encoder-decoder architecture to facilitate reconstruction learning, enhancing the learned representations to detect forgery patterns that are unknown. To further capture forgery evidence across multiple scales, instead of using encoder features from the reconstruction network only, we build a feature improvement network to combine feature details from encoder and decoder features in a multi-scale fashion. In addition, we use the reconstruction difference to supervise the feature aggregation, which enables detecting the subtle and trivial discrepancies between fake and real video frames. Extensive experiments are conducted to validate the performance of our proposed method on several deep fake benchmarks. The results demonstrate the efficacy of our approach, offering promising results and showcasing its potential for practical applications. The source code is available at https://github.com/cccvl/BRCNet.

Abstract:
The existing image-to-video translation methods generally follow a frame-by-frame generative paradigm, while extracting the temporal information from a reference video or an audio stream. Inspired by the recent success in text-guided image generation, we explore a more challenging but promising task, Text-guided Image-to-Video (TI2V) translation. Given an image and a brief text description as input, TI2V aims to generate a facial expression video following the image and text. To this end, we first propose an automatic video captioning pipeline to generate dense textual descriptions for facial video datasets, using both expression labels and action units. These dense textual descriptions provide precise semantic guidance for TI2V learning. Then we design and train an efficient framework, FaceCLIP, on these datasets to deal with the TI2V translation task. FaceCLIP adopts a video autoencoder to model the temporal information of training videos, and a pretrained CLIP model to embed the video frames and the text description. We design a reconstruction loss and an embedding alignment loss to train the autoencoder to obtain the text-guided video generative ability. Recognizing that expressions are closely tied to facial landmark motions, the reconstruction loss is applied to facial landmarks rather than each video frame, significantly enhancing training efficiency. We compare FaceCLIP with several potential baseline methods, and extensively evaluate the performance using multiple metrics. Both qualitative and quantitative results validate the superiority of FaceCLIP in terms of both visual quality and expression-text consistency. Moreover, the unique ability of FaceCLIP to generate videos based on abstract texts demonstrates its stronger generalization capability.

Abstract:
Source-free unsupervised domain adaptation (SFUDA) aims to conduct prediction on the target domain by leveraging knowledge from the well-trained source model. Due to the absence of source data in the SFUDA setting, the existing methods mainly build the target classifier by fine-tuning the source model incorporated with empirical adaptation losses. Although these methods have achieved somewhat promising results, nearly all of them typically suffer from the closed-fitting dilemma that their models are dominantly affected by these easy-to-distinguish instances than those hard-to-distinguish ones, resulting from the absence of the labeled source data. To address aforementioned issues, we propose the Dipolar Confidence Learning (DCL) for SFUDA. Specifically, we conduct positive confidence learning on the samples with standard outputs to avoid overfitting of the model to these samples. In contrast, we perform negative confidence learning for the samples with abnormal outputs to optimize the complementary label, which forces the network to pay more attention to these confusing samples. Furthermore, to achieve more generalized domain alignment, both the confidence-based fuzzy mixup and rotation-based self-supervised learning are respectively constructed to boost the representation ability of the target model. Finally, extensive experiments are conducted to demonstrate the effectiveness and performance superiority of the proposed method.

Abstract:
Owing to the inherent complementarity among LiDAR, camera, and IMU, a growing effort has been paid to laser-visual-inertial SLAM recently. The existing approaches, however, are limited in two aspects. First, at the front-end, they usually employ a discrete-time representation that requires high-precision hardware/software synchronization and are based on geometric laser features, leading to low robustness and scalability. Second, at the backend, visual loop constraints suffer from scale ambiguity and the sparseness of the point cloud deteriorates the scan-to-scan loop detection. To solve these problems, for the front-end, we propose a continuous-time laser-visual-inertial odometry which formulates the carrier trajectory in continuous time, organizes point clouds in probabilistic submaps, and jointly optimizes the loss terms of laser anchors, visual reprojections, and IMU readings, achieving accurate pose estimation even with fast motion or in unstructured scenes where it is difficult to extract meaningful geometric features. At the backend, we propose building 5-DoF laser constraints by matching projected 2D submaps and 6-DoF visual constraints via laser-aided visual relocalization, ensuring mapping consistency in large-scale scenes. Results show that our framework achieves high-precision estimation and is more robust than its counterparts when the carrier works in large scenes or with fast motion. The relevant codes and data are open-sourced at https://cslinzhang.github.io/Ct-LVI/Ct-LVI.html.

Affiliations: Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China; National-Regional Key Technology Engineering Laboratory for Medical Ultrasound and the Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University, Shenzhen, China; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; MoE Key Laboratory of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Hyperspectral imaging (HSI) unlocks the huge potential to a wide variety of applications relying on high-precision pathology image segmentation, such as computational pathology. It can acquire biochemical properties even invisible to naked eyes from histological specimens. Since 1) spectra contain discriminative and continuous patterns for differentiating tissues/cells, and 2) the discriminability of spectra relies on both fine-grained relations in the high-resolution spectrum and coarse relations in the low-resolution spectrum, the key to achieving high-precision hyperspectral pathology image segmentation is to felicitously model the intra- and inter-scale context especially for spectra. In this paper, we propose a spectral transformer (SpecTr) for hyperspectral pathology image segmentation, which first captures global context for intra-scale spectral features, and subsequently extract coarse and fine-grained discriminative spectral information from inter-scale features, respectively. To learn intra-scale spectral context, we propose a Spectral Attentive Module (SAM). Unlike the existing Transformer model that is designed for modalities such as natural images, our proposed SAM is efficient in capturing sparse and pivotal spectral context while avoiding the heterogeneous underlying distributions and noises of different bands. Besides, to reduce the computational complexity of the HSI segmentation model, we further propose a global-local attention module to effectively learn a condensed spectral feature. Experiments show that HSIs can become a more powerful image modality for understanding microscopic pathology images than RGB images, and the proposed SpecTr outperforms other competing methods for hyperspectral pathology image segmentation, with an improvement of 3% compared with the popular 3D-nnUNet and other transformer-based methods. Our code is available at https://github.com/DeepMed-Lab-ECNU/SpecTr.

Abstract:
Unsupervised anomaly detection is required to detect/segment anomalous samples/regions that deviate from the normal pattern while learning only through the normal sample category. Towards this end, this paper proposes a novel framework for anomaly detection by introducing normal images as guidance called Normal Image Guided Segmentation Framework (NIGSF). It consists of a Normal Guided Network (NGN) and a Saliency Augmentation Module (SAM). NGN constructs the contrast set, which is a candidate set for extracting normal sample features. Then, a normal feature extractor is developed to extract detailed and complete features containing normal semantic information as guidance features. Meanwhile, the guidance feature fusion module is introduced to realize normal semantic guidance in the feature space, and then the segmentation module discriminates the features that are different from the normal guidance features as anomalies. SAM aims to generate forged anomaly samples utilizing available normal samples. It introduces saliency maps and random Perlin noise to generate saliency Perlin noise maps and then to generate diverse forged anomaly samples. Extensive experiments are conducted to evaluate the performance of NIGSF on three anomaly detection benchmark datasets. The results demonstrate the effectiveness of each proposed module and the superiority of the proposed method. Specifically, NIGSF outperforms the runner-up by 5.4% in terms of anomaly segmentation AP metric.

Abstract:
In recent years, the proliferation of smartphones has led to an upsurge in the digitization of document files via these portable devices. However, images captured by smartphones often suffer from distortions, thereby negatively affecting digital preservation and downstream applications. To address this issue, we introduce DRNet, a novel deep network for document image rectification. Our approach is based on three key designs. Firstly, we exploit the intrinsic geometric consistency inherent in document images to guide the learning process of distortion rectification. Secondly, we design a coarse-to-fine rectification network to leverage the representations derived from the distorted document image, thereby enhancing the rectification result. Thirdly, we propose a unique perspective for supervising the learning of rectification networks, where undistorted document images are employed for supervision, which is free of warping mesh as ground truth in existing methods. Technically, both low-level pixel alignment and high-level semantic alignment jointly contribute to the learning of the mapping relationship between deformed document images and distortion-free ones. We evaluate our method on the challenging DocUNet Benchmark dataset, where it sets a series of state-of-the-art records, demonstrating the superiority of our approach compared to existing learning-based solutions. Additionally, we conduct a comprehensive series of ablation experiments to further validate the effectiveness and merits of our method.

Abstract:
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous efforts have been devoted to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, by going in depth into the principle of traditional WS-AVVP pipelines, two additional challenges are identified: confusing multimodal calculation will hamper the precise measurement of audio-visual imbalanced feature learning, as well as the global supervision provided by video-level labels can not provide explicit guidance for robust semantic feature learning in each action subspace. To cope with the above issues, the modality-separated decision unit (MSDU) and semantic-aware feature extractor (SAFE) are designed for precise measurement of imbalanced feature learning and unambiguous semantic-aware feature extraction separately. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.

Affiliations: Fujian Key Laboratory for Intelligent Processing and Wireless Transmission of Media Information, College of Physics and Information Engineering, Fuzhou University, Fuzhou, China; Fujian Key Laboratory of Network Computing and Intelligent Information Processing, College of Computer and Data Science, Fuzhou University, Fuzhou, China; School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China; Fujian Key Laboratory for Intelligent Processing and Wireless Transmission of Media Information, College of Physics and Information Engineering, and the Fujian Science and Technology Innovation Laboratory for Optoelectronic Information, Fuzhou University, Fuzhou, China; Department of Electrical Engineering, Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan

Abstract:
Image co-segmentation and co-localization exploit inter-image information to identify and extract foreground objects with a batch mode. However, they remain challenging when confronted with large object variations or complex backgrounds. This paper proposes a multi-view graph embedding (MV-Gem) learning scheme which integrates diversity, robustness and discernibility of object features to alleviate this phenomenon. To encourage the diversity, the deep co-information containing both low-layer general representations and high-layer semantic information is generated to form a multi-view feature pool for comprehensive co-object description. To enhance the robustness, a multi-view adaptive weighted learning is formulated to fuse the deep co-information for feature complementation. To ensure the discernibility, the graph embedding and sparse constraint are embedded into the fusion formulation for feature selection. The former aims to inherit important structures from multiple views, and the latter further selects important features to restrain irrelevant backgrounds. With these techniques, MV-Gem gradually recovers all co-objects through optimization iterations. Extensive experimental results on real-world datasets demonstrate that MV-Gem is capable of locating and delineating co-objects in an image group.

Abstract:
Video-based person re-identification (re-ID) aims to match the same pedestrian of video sequences across non-overlapping cameras. Video re-ID methods generally adopt frame-level feature extraction for different video frames, but they still lack effective spatio-temporal interaction, easily leading to the multi-frame misalignment problem. In this paper, we propose a Hierarchical Attention-aware Spatio-temporal Interaction (HASI) network, including an Attention-aware Temporal Interaction (ATI) module and a Hierarchical Local-spatial Enhancement (HLE) module for video-based person re-ID. In order to avoid the spatial misalignment between video frames, the ATI module employs multiple Frame-to-Frame Temporal Interaction (2FTI) blocks with the Multi-head Inter-frame Alignment Attention (MIAA) to make the current frame iteratively interact with each rest frame of a video in a positive single-cycle manner, rather than only interacting with the adjacent frame or directly building the relationship of all frames at once. This module can not only obtain the long-range non-adjacent temporal information, but also learn the pairwise frame-to-frame relationships. Moreover, the HLE module is designed to enhance the local fine-grained features from multiple Transformer layers, whilst delivering low-level information to further enrich middle-level and high-level semantic knowledge. Thus, our method can learn multi-perspective pedestrian information, including inter-frame long-range interaction information and intra-frame multi-layer global and local information. Extensive experiments demonstrate the superiority of the proposed HASI method compared with the state-of-the-art methods on the three challenging video-based re-ID datasets, i.e., MARS, iLIDS-VID, and PRID-2011.

Abstract:
Spike camera is a bio-inspired sensor with ultra-high temporal resolution and low energy consumption. It captures visual signals using an “integrate-and-fire” mechanism and outputs a continuous stream of binary spikes. Reconstructing image sequence from spikes streams is critical for spike camera. Several reconstruction methods have been proposed in recent years. However, the computational cost of these methods is relatively high. Inspired by the fact that spiking neural networks (SNNs) are energy efficient and support time-series signal processing inherently, we propose a lightweight SNN for spike camera image reconstruction (abbreviated to SSIR). Experimental results show that SSIR achieves comparable performance with the state-of-the-art (SOTA) methods at much lower computation and energy cost.

Abstract:
Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

Abstract:
Temporal action localization (TAL) is a prevailing task due to its great application potential. Existing works in this field mainly suffer from two weaknesses: (1) They often neglect the multi-label case and only focus on temporal modeling. (2) They ignore the semantic information in class labels and only use the visual information. To solve these problems, we propose a novel Co-Occurrence Relation Module (CORM) that explicitly models the co-occurrence relationship between actions. Besides the visual information, it further utilizes the semantic embeddings of class labels to model the co-occurrence relationship. The CORM works in a plug-and-play manner and can be easily incorporated with the existing sequence models. By considering both visual and semantic co-occurrence, our method achieves high multi-label relationship modeling capacity. Meanwhile, existing datasets in TAL always focus on low-semantic atomic actions. Thus we construct a challenging multi-label dataset UCF-Crime-TAL that focuses on high-semantic actions by annotating the UCF-Crime dataset at frame level and considering the semantic overlap of different events. Extensive experiments on two commonly used TAL datasets, i.e., MultiTHUMOS and TSU, and our newly proposed UCF-Crime-TAL demenstrate the effectiveness of the proposed CORM, which achieves state-of-the-art performance on these datasets.

Abstract:
Stylized image captioning (SIC) aims to generate captions with target style for images. The biggest challenge is that the collection and annotation of stylized data are pretty difficult and time-consuming. Most existing methods learn massive factual captions or additional stylized bookcorpus independently to assist in generating stylized caption, which ignore core relationships between existing image-fact-style trident data. In this paper, we propose a novel image-fact-style trident semantic framework TridentCap for stylized image captioning, which includes an image-fact semantic fusion encoder (SFE) and a trident stylization decoder (TSD). Unlike existing methods, we directly mine the core relationship in image-fact-style trident data and use factual semantic and image to build cross-modal semantic feature space, achieving the coherence between image and text. Specifically, SFE aims to learn the image-related prior language knowledge information from factual text and leverage fine-grained region-level semantic correlations of image and factual text to achieve cross-modal semantic information alignment and integration. TSD is designed to decouple the dual-source fused semantic feature based on the target style to achieve stylized caption generation. In addition, we design a pseudo labels filter (PLF) to obtain and expand massive image-fact-style trident data by building pseudo stylized annotations for all image-fact data in traditional caption datasets, which can further strengthen stylized caption learning. It is a generic algorithm to solve the problem of insufficient data and can be used into any existing stylized caption models. We conduct extensive experiments on SentiCap and FlickrStyle datasets, which achieve consistently improvement on almost all metrics. Our code will be released at: https://github.com/WangLanxiao/TridentCap_Code.

Abstract:
Image deblurring based only on the blurry image is challenging as motion information is lost while imaging. Event cameras capture the texture of moving objects in high temporal resolution with asynchronous events. In this paper, we extract motion features from events and fuse them with background features from the image for event-based image deblurring. Spiking neural network (SNN), a widely recognized event feature extractor, is well suited for motion feature extraction due to its high temporal resolution. However, extracting motion information from events exclusively with SNN is challenging. We propose a novel Temporal-local-Spatio Spiking Transformer (TSST) to extract motion intensity and motion attention regions in the spatio-temporal domain. Motion intensity extracted from spiking features is represented as a high temporal resolution motion attention map to guide the fusion of the two networks. In the temporal domain, motion intensity maps spiking features to CNN features as motion features to avoid blurring. In the spatial domain, the motion intensity shows the motion regions and gives the weight of the motion feature during fusion. Moreover, a hybrid feature extraction encoder (HFEE) is introduced, which fully fuses the motion and background features for deblurring. The gradient is back-propagated from CNN to SNN, and the hybrid deblurring network is jointly optimized. We evaluated the performance of our model on the public dataset GoPro and a real event dataset we captured. Codes and pretrained models are available at https://github.com/XDULzx/MotionSNN.

Abstract:
Segment anything model (SAM) has achieved great success in the field of natural image segmentation. Nevertheless, SAM tends to consider shadows as background and therefore does not perform segmentation on them. In this paper, we propose ShadowSAM, a simple yet effective framework for fine-tuning SAM to detect shadows. Besides, by combining it with long short-term attention mechanism, we extend its capability for efficient video shadow detection. Specifically, we first fine-tune SAM on ViSha training dataset by utilizing the bounding boxes obtained from the ground truth shadow mask. Then during the inference stage, we simulate user interaction by providing bounding boxes to detect a specific frame (e.g., the first frame). Subsequently, using the detected shadow mask as a prior, we employ a long short-term network to learn spatial correlations between distant frames and temporal consistency between adjacent frames, thereby achieving precise shadow information propagation across video frames. Extensive experimental results demonstrate the effectiveness of our method, with notable margin over the state-of-the-art approaches in terms of MAE and IoU metrics. Moreover, our method exhibits accelerated inference speed compared to previous video shadow detection approaches, validating the effectiveness and efficiency of our method. The source code is now publicly available at https://github.com/harrytea/Detect-AnyShadow.

Abstract:
A growing number of earth observation satellites are able to simultaneously gather multimodal images of the same area due to the expanding availability and resolution of satellite remote sensing data. This paper proposes a novel multimodal balanced self-learning interaction network (MBSI-Net) for the classification task. It involves a dual-branch teacher-student network that enables knowledge interaction and transfer between the multimodalities. Firstly, in order to introduce statistical information in addition to local and global structural information, a texture feature equalization module (TFE-Module) is proposed. This can enhance the texture information of features through histogram equalization and further improve the representation ability of features. Secondly, to enable the student network to provide timely feedback questions, the paper proposes a feature fusion module (F2-Module) that models and enhances teacher features through the student network. This helps to raise the classification’s accuracy by incorporating information from multimodal images. Finally, the paper proposes a loss function based on structural similarity analysis to ensure balanced self-learning between the student and the teacher networks. Taking the multispectral (MS) and the panchromatic (PAN) images of the same scene as examples, through experimental verification, the proposed method can achieve good results on multiple datasets compared with other methods. Therefore, it offers an effective method for classifying and fusing multimodal data.

Abstract:
Group detection, especially for large-scale scenes, has many potential applications for public safety and smart cities. Existing methods fail to cope with frequent occlusions in large-scale scenes with multiple people, and are difficult to effectively utilize spatio-temporal information. In this paper, we propose an end-to-end framework, GroupTransformer, for group detection in large-scale scenes. To deal with the frequent occlusions caused by multiple people, we design an occlusion encoder to detect and suppress severely occluded person crops. To explore the potential spatio-temporal relationship, we propose spatio-temporal transformers to simultaneously extract trajectory information and fuse inter-person features in a hierarchical manner. Experimental results on both large-scale and small-scale scenes demonstrate that our method achieves better performance compared with state-of-the-art methods. On large-scale scenes, our method significantly boosts the performance in terms of precision and F1 score by more than 10%. On small-scale scenes, our method still improves the performance of F1 score by more than 5%. We will release the code for research purposes.

Abstract:
Perceptual image encryption degrades image quality by selectively encrypting some key information of the plain images. The encrypted images are partially perceptible according to the security or quality requirements. Although several types of attacks have tried to infer privacy information from the encrypted images, they can only either extract statistical information or enhance image sketch. In this paper, we take one step further and fully recover the plain images from perceptually encrypted counterparts by designing a non-local attack network (NL-ANet). NL-ANet is composed of densely cascaded multiscale non-local modules (MSNL) and a hierarchical attention fusion module (HAFM). In particular, to better reconstruct encryption distortion, we introduce MSNL to capture powerful hierarchical features from different scales, and propose HAFM to adaptively aggregate and enhance informative hierarchical features for reconstruction. We also propose a new instantiation of the multi-head non-local block with channel attention (MHCA) to explore the long-range dependencies of global contextual information. Extensive experiments show that NL-ANet is encryption-agnostic and superior on different perceptual encryption schemes under different encryption strengths. NL-ANet also achieves better performance than state-of-the-art image restoration methods.

Abstract:
Recent years have witnessed strong demands for video composition in online video communications, enabling a series of new functionalities for video conferencing including virtual conference rooms, virtual reunions, and virtual backgrounds. In video composition, typically the foreground videos including the human bodies and faces are subject to compression due to the constrained bandwidth, whereas the virtual background is uncompressed and in pristine quality. The disharmony caused by the incoherent quality of foreground and background, which may worsen the quality of experience, has not been extensively studied. In this paper, we focus on this particular problem and present an image quality harmonization framework. Our principle is to align the quality of the background with that of the foreground such that they share similar levels of distortion. This is achieved by inferring the quantization parameter for background compression based on the foreground information. In particular, we aim to learn the quality and compression parameters in a self-supervised manner without laborious human annotation. Furthermore, a large dataset is constructed to provide sufficient training samples and testing scenarios for validation. The composite videos show superior harmonized quality in both quantitative and qualitative comparisons, demonstrating the effectiveness of the proposed framework.

Abstract:
As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.

Abstract:
Incomplete multi-view clustering (IMVC), excavating diversity and consistency from multiple incomplete views, has aroused widespread research enthusiasm. Nevertheless, most existing methods still encounter the following issues: 1) they generally concentrate on pair-wise instance correlation, which consumes at least a quadratic complexity and precludes them from applying at large scales; 2) they only concentrate on pair-wise instance relevance, whereas ignoring the discriminative correlation hidden across views. To overcome these drawbacks, we propose the Self-Completed Bipartite Graph Learning (SCBGL) method for fast IMVC, which adaptively learns a self-completed consensus bipartite graph with the guidance of global information. Specifically, SCBGL learns the consensus anchor matrix shared among diverse views and further constructs a consensus intra-view bipartite graph with missing instances to explore the diversity and complementarity underlying different views. Meanwhile, we concatenate all the multiple features with projection learning to learn global anchors that would be employed to construct an inter-view bipartite graph. Furthermore, SCBGL dexterously utilizes the abundant inter-view information to tutor the self-completion of the consensus intra-view bipartite graph. By devising an alternatively iterative strategy, we present an efficient algorithm, which enjoys a linear time complexity, to solve the proposed SCBGL model. Numerous experiments conducted on large-scale datasets substantiate the superior performance of the SCBGL beyond the state-of-the-arts.

Abstract:
Forecasting human trajectory is an essential technology in intelligent surveillance systems, robot navigation systems, autonomous driving systems, etc. Most of the trajectory prediction models based on RNN and Transformers use autoregressive methods to generate future trajectories, which may accumulate displacement errors and are inefficient for training and testing. To address these problems, we propose a novel decoder named MRG decoder, which introduces a Mapping-Refinement-Generation structure to generate trajectory in a non-autoregressive manner. Furthermore, we design the MRGTraj trajectory prediction model based on the proposed MRG decoder. Firstly, we employ a Transformer as an encoder to extract encoded features from the past trajectory. Secondly, we introduce an interaction-aware latent code generator to learn a Gaussian distribution from the social context among pedestrians for latent code sampling. Finally, we feed the encoded features to the MRG decoder and sample the latent code multiple times from the learned Gaussian distribution, providing additional inputs to the MRG decoder to generate multiple socially acceptable future trajectories. Experimental results on two public datasets, ETH and UCY, validate the effectiveness of the MRGTraj model. Besides, the MRGTraj model achieves superior prediction performance, with improvements of 13.21% on FDE metrics and a 71.29% speed-up compared to state-of-the-art models. The code is available at https://github.com/wisionpeng/MRGTraj.

Abstract:
Existing salient object detection (SOD) methods mainly rely on U-shaped convolution neural networks (CNNs) with skip connections to combine the global contexts and local spatial details that are crucial for locating salient objects and refining object details, respectively. Despite great successes, the ability of CNNs in learning global contexts is limited. Recently, the vision transformer has achieved revolutionary progress in computer vision owing to its powerful modeling of global dependencies. However, directly applying the transformer to SOD is suboptimal because the transformer lacks the ability to learn local spatial representations. To this end, this paper explores the combination of transformers and CNNs to learn both global and local representations for SOD. We propose a transformer-based Asymmetric Bilateral U-Net (ABiU-Net). The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively. The asymmetric bilateral decoder also consists of two paths to process features from the transformer and CNN encoder paths, with communication at each decoder stage for decoding coarse salient object locations and fine-grained object details, respectively. Such communication between the two encoder/decoder paths enables AbiU-Net to learn complementary global and local representations, taking advantage of the natural merits of transformers and CNNs, respectively. Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive experiments demonstrate that ABiU-Net performs favorably against previous state-of-the-art SOD methods. The code is available at https://github.com/yuqiuyuqiu/ABiU-Net.

Affiliations: School of Information Engineering, and the Institute of Computer Applications, Henan Institute of Science and Technology, Xinxiang, China; School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China; School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; School of Computer Science, Nankai University, Tianjin, China

Abstract:
Underwater images typically suffer from various quality degradation issues due to the scattering and absorption of light, but these degraded-quality underwater images are unbeneficial for analysis and applications. To effectively solve these quality degradation issues, an underwater image enhancement method via weighted wavelet visual perception fusion is introduced, called WWPF. Concretely, we first present an attenuation-map-guided color correction strategy to correct the color distortion of an underwater image. Subsequently, we employ the maximum information entropy optimized global contrast strategy to the color-corrected image to obtain a global contrast-enhanced image. Meanwhile, we apply a fast integration optimized local contrast strategy to the color-corrected image to get a local contrast-enhanced image. To exploit the complementary of the global contrast-enhanced image and the local contrast-enhanced image, we introduce a weighted wavelet visual perception fusion strategy to obtain a high-quality underwater image by fusing the high-frequency and low-frequency components of images at different scales. Our extensive experiments on three benchmarks validate that our WWPF outperforms the state-of-the-art methods in qualitative and quantitative. Besides, the underwater images processed by our WWPF also benefit practical underwater applications. The code is available https://github.com/Li-Chongyi/WWPF_code.

Abstract:
Acquiring high-resolution 3D surface structures is a crucial task in computer vision as it provides more detailed surface textures and clearer structures. Photometric stereo can measure per-pixel surface normals of a 3D object using various shading cues. However, obtaining high-resolution images in a linear response photometric stereo imaging system can be challenging. Additionally, photometric stereo, as a per-pixel reconstruction method, requires higher-resolution surface normal maps to accurately depict complex surface structures, particularly in regions that demand more attention and precise reconstruction. Therefore, measuring high-resolution surface normals via low-resolution photometric stereo images is of great importance. Motivated by these, we propose a Super-resolution Photometric Stereo Network, namely SR-PSN. In order to address the issues of measuring the high-resolution surface normals from low-resolution photometric images, we mainly (1) apply a dual-position threshold normalization pre-processing scheme to effectively handle the spatially-varying reflectance of non-Lambertian surfaces, (2) adopt a local affinity feature module to learn the rich structural representation by explicitly revealing the neighbor relationships, (3) employ a parallel multi-scale feature extractor, which preserves high-resolution representations and deep feature extraction, and (4) propose a shared-weight regressor to handle the multi-scale features, to prevent the model collapsing into learning non-important features related to a certain fixed scale. Extensive ablation experiments validate the effectiveness of our proposed modules. Furthermore, quantitative experiments conducted on public benchmarks demonstrate that SR-PSN outperforms state-of-the-art calibrated photometric stereo methods. Notably, SR-PSN achieves superior results while utilizing photometric stereo images with only half the resolution of other methods. It effectively restores the structure of complex surfaces, producing a high-resolution normal map.

Abstract:
Due to inherent interactivity, time-sync comment of videos have attracted increasing attention and were widely adopted in online video platforms. In addition to enhancing user engagement, time-sync comments provide abundant semantic information that can greatly enhance video understanding, which however is largely overlooked in mainstream video recommender systems. To address this issue, we propose a Hierarchical Multi-modal Attention Network (HMAN) to effectively utilize time-sync comment for recommendation. Specifically, we design a Multi-level Text Condense (MTC) Module to capture the accurate semantics of time-sync comments via text-level and vision-level condense operations. Then we propose a Range Convolution Block (RCB) to capture both visual and textual information from variable-length event segments leveraging the variable respective field. After that, we design a Hierarchical Multi-modal Branch Fusion (HMBF) Module to obtain a comprehensive multi-modal representation of the time-sync comments video. Finally, with the obtained video representation, recommendation scores are obtained through its inner product with user embedding. Extensive experiments demonstrate the effectiveness of the proposed HMAN, and ablation studies on different variants of HMAN further validate the utility of each component and the necessity of the hierarchical multi-modal branch fusion method.

Abstract:
Anchor based incomplete multiview clustering has grasped growing interest recently because of its great success in effectively partitioning multimodal data. However, due to the absence of label information, the constructed anchors could be mismatched. Such an Anchor Mismatching Problem (AMP) will cause the structure of generated bipartite graph to be chaotic, degrading the clustering performance. To tackle this issue, we design an algorithm termed Constructing Corresponding Anchors for Incomplete Multiview Clustering (CCA-IMC). Specifically, we first devise a permutation strategy to transform anchors on each view. Subsequently, we directly generate the consensus bipartite graph, which is shared for all incomplete views, by the transformed anchors rather than by fusing each view-specific bipartite graph. Afterwards, all anchors and permutation matrices as well as the consensus bipartite graph are jointly optimized in one common framework so as to promote each other. In such ways, anchors are rearranged towards correct matching relationship according to the consensus graph structure. In addition to these, our CCA-IMC has also been proven to be with linear time and memory overheads, which makes it able to scale up to work with large-scale tasks. Massive experiments implemented on ten popular datasets give evidence of our superiorities compared to current strong IMC competitors.

Abstract:
Low-light image enhancement aims to improve the perceptual quality of images captured in conditions of insufficient illumination. However, such images are often characterized by low visibility and noise, making the task challenging. Recently, significant progress has been made using deep learning-based approaches. Nonetheless, existing methods encounter difficulties in balancing global and local illumination enhancement and may fail to suppress noise in complex lighting conditions. To address these issues, we first propose a multi-scale illumination adjustment network to balance both global illumination and local contrast. Furthermore, to effectively suppress noise potentially amplified by the illumination adjustment, we introduce a wavelet-based attention network that efficiently perceives and removes noise in the frequency domain. We additionally incorporate a discrete wavelet transform loss to supervise the training process. Particularly, the proposed wavelet-based attention network has been shown to enhance the performance of existing low-light image enhancement methods. This observation indicates that the proposed wavelet-based attention network can be flexibly adapted to current approaches to yield superior enhancement results. Furthermore, extensive experiments conducted on benchmark datasets and downstream object detection task demonstrate that our proposed method achieves state-of-the-art performance and generalization ability.

Abstract:
Over the past few years, learning-based video compression has become an active research area. However, most works focus on P-frame coding. Learned B-frame coding is under-explored and more challenging. This work introduces a novel B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. B-CANF additionally features two novel elements: frame-type adaptive coding and B-frames. Our frame-type adaptive coding learns better bit allocation for hierarchical B-frame coding by dynamically adapting the feature distributions according to the B-frame type. Our B-frames allow greater flexibility in specifying the group-of-pictures (GOP) structure by reusing the B-frame codec to mimic P-frame coding, without the need for an additional, separate P-frame codec. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance as compared to the other learned B-frame codecs and shows comparable BD-rate results to HM-16.23 under the random access configuration in terms of PSNR. When evaluated on different GOP structures, our B-frames achieve similar performance to the additional use of a separate P-frame codec.

Abstract:
Biomedical videos require tremendous storage space and transmission bandwidth, so efficient coding methods are urgently required. Existing methods can be roughly divided into motion-based methods and wavelet-based methods. Motion-based methods use motion estimation designed for natural videos and independently optimize prediction, transform, and entropy coding modules. Wavelet-based methods treat the more redundant time dimension exactly the same as other spatial dimensions. They are both unable to completely remove the redundant spatial-temporal information in biomedical videos. In this paper, to address these problems, we build an end-to-end framework named DBVC with 3-D motion estimation, MV coding, 3-D motion compensation, and residual coding networks for efficient 3-D biomedical video coding. First, we propose a simple yet efficient 3-D motion estimation network to extract motion information. Specifically, we obtain the region with the most intense motion by a segmentation network and then perform unsupervised motion estimation exclusively on this region. After that, to encode and decode the estimated motion vectors, we apply a 3-D autoencoder-based MV coding network. Moreover, we use a lossless learnable wavelet transform for residual coding, which makes lossless coding possible. To the best of our knowledge, this is the first end-to-end video coding framework that supports both lossy and lossless coding, thus meeting the requirements of 3-D biomedical video coding. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on both 3-D biological videos and 3-D medical videos.

Abstract:
Image-text matching is a fundamental task in bridging the semantics between vision and language. The key challenge lies in establishing accurate alignment between two heterogeneous modalities. Existing cross-modal fine-grained matching methods normally include two alignment directions, “word to region” and “region to word”, and the overall image-text similarity is calculated from the alignments. However, the alignment of these two directions is typically independent, that is, the alignment of “word to region” and “region to word” is irrelevant, so the alignment consistency cannot be guaranteed in two directions, which inevitably introduces inconsistent alignments, leading to potential inaccurate image-text matching results. In this paper, we propose a novel Bidirectional cOnsistency netwOrks for cross-Modal alignment (BOOM), which achieves more accurate cross-modal semantic alignments by imposing explicit consistency constraints in both directions. Specifically, according to three aspects reflected by alignment consistency, i.e., significance, wholeness, and alignment orderliness, we design a novel systematic multi-granularity consistency constraints: point-wise consistency, which enforces consistency of the most significant single word item in bidirectional alignments; set-wise consistency, which maintains more comprehensive and accurate bidirectional entire alignment values consistent and order-wise consistency, which ensures order consistency of bidirectional alignment results. Bidirectional cross-modal alignment between words and regions is corrected from three different perspectives: maximum, distribution, and order. Extensive experiments on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our BOOM achieves state-of-the-art performance.

Abstract:
Modeling sequences with spatial-temporal graph convolutional networks has become a mainstream paradigm in skeleton-based action recognition. However, many existing methods adopt redundant or cluttered structures to mine the key action features, thus making it difficult to achieve a balanced or leading performance in accuracy and efficiency. In this paper, we propose a novel framework, referred to as Motion Complement and Temporal Multifocusing Network (MCTM-Net), to capture the relationships within skeleton sequences by means of an efficient decomposition of the spatiotemporal graph model. Specifically, for spatial modeling, we introduce a motion-related relational descriptor that extends the channel dimension so as to enhance the modeling of motion salient regions as a complement to the conventional physical adjacency relationships. An improved parameterized physical relationship model is also proposed to better fit the data characteristics. As for temporal modeling, we propose an efficient multi-focus temporal information acquisition strategy that aggregates the information from multiple temporal spans and adjacent regions. We conduct extensive experiments on multiple representative datasets, including NTU-RGB+D (60&120), Northwestern-UCLA, and UWA3D Multiview Activity II, to validate our innovations. The experimental results show the effectiveness of our method. The code will be available at https://github.com/cong-wu/MCMT-Net.

Abstract:
The Scene Graph Generation (SGG) task aims to detect all the objects and their pairwise visual relationships in a given image. Although SGG has achieved remarkable progress over the last few years, almost all existing SGG models follow the same training paradigm: they treat both object and predicate classification in SGG as a single-label classification problem, and the ground-truths are one-hot target labels. However, this prevalent training paradigm has overlooked two characteristics of current SGG datasets: 1) For positive samples, some specific subject-object instances may have multiple reasonable predicates. 2) For negative samples, there are numerous missing annotations. Regardless of the two characteristics, SGG models are easy to be confused and make wrong predictions. To this end, we propose a novel model-agnostic Label Semantic Knowledge Distillation (LS-KD) for unbiased SGG. Specifically, LS-KD dynamically generates a “soft” label for each subject-object instance by fusing a predicted Label Semantic Distribution (LSD) with its original one-hot target label. LSD reflects the correlations between this instance and multiple predicate categories. Meanwhile, we propose two different strategies to predict LSD: iterative self-KD and synchronous self-KD. Extensive ablations and results on three SGG tasks have attested to the superiority and generality of our proposed LS-KD, which can consistently achieve decent trade-off performance between different predicate categories.

Abstract:
For 3D action recognition, the main challenge is to extract long-range semantic information in both temporal and spatial dimensions. In this paper, in order to better excavate long-range semantic information from large number of unlabelled skeleton sequences, we propose Self-supervised Spatial-temporal Representation Learning (SSRL), a contrastive learning framework to learn skeleton representation. SSRL consists of two novel inference tasks that enable the network to learn global semantic information in the temporal and spatial dimensions, respectively. The temporal inference task learns the temporal persistence of human actions through temporally incomplete skeleton sequences. And the spatial inference task learns the spatially coordinated nature of human action through spatially partially skeleton sequence. We design two transformation modules to efficiently realize these two tasks while fitting the encoder network. To avoid the difficulty of constructing and maintaining high-quality negative samples, our proposed framework learns by maintaining consistency among positive samples without the need of any negative sample. Experiments demonstrate that our proposed method can achieve better results in comparison with state-of-the-art methods under a variety of evaluation protocols on NTU RGB+D 60, PKU-MMD and NTU RGB+D 120 datasets.

Abstract:
In this paper, we present a weakly-supervised RGB-D salient object detection model via scribble supervision. Specifically, as a multimodal learning task, we focus on effective multimodal representation learning via inter-modal mutual information regularization. In particular, following the principle of disentangled representation learning, we introduce a mutual information upper bound with a mutual information minimization regularizer to encourage the disentangled representation of each modality for salient object detection. Based on our multimodal representation learning framework, we introduce an asymmetric feature extractor for our multimodal data, which is proven more effective than the conventional symmetric backbone setting. We also introduce multimodal variational auto-encoder as stochastic prediction refinement techniques, which takes pseudo labels from the first training stage as supervision and generates refined prediction. Experimental results on benchmark RGB-D salient object detection datasets verify both effectiveness of our explicit multimodal disentangled representation learning method and the stochastic prediction refinement strategy, achieving comparable performance with the state-of-the-art fully supervised models. Our code and data are available at: https://npucvr.github.io/MIRV/.

Abstract:
Dynamic hand gesture is an emerging and promising biometric trait containing both physiological and behavioral characteristics. Possessing the two kinds of characteristics makes dynamic hand gesture have more identity information enabling more accurate and secure authentication theoretically, but also poses a challenge of efficient fine-grained spatiotemporal feature extraction. This challenge involves a seemingly paradoxical problem that high-frame-rate videos are required for behavioral characteristic analysis, but they can also introduce high computational costs. To mitigate this issue, we propose a Frequency Spatiotemporal Attention Network (FSTA-Net) with a focus on satisfying the high-performance and low-computation requirements of authentication systems. The FSTA-Net is established with a two-stage identity characteristic analysis paradigm for short- and long-term modeling. Specifically, considering that models prefer to analyze physiological characteristics which are relatively straightforward to understand, we first design a Behavior Enhanced (BE) module to emphasize hand motions and reduce redundant information to facilitate local identity feature distillation in the first stage. We then present a Frequency Spatiotemporal Attention (FSTA) module to summarize global identity features with decent FLOPs and GPU memory occupation in the second stage. Incorporating the BE and FSTA modules enables them to complement each other’s strengths, resulting in a clear-cut improvement in equal error rate and running speed. Extensive experiments on the SCUT-DHGA dataset demonstrate the superiority of the FSTA-Net. The code is available at https://github.com/SCUT-BIP-Lab/FSTA-Net.

Abstract:
Cross-modality person re-identification task is a challenging task aiming to recognize images of the same identity between different modalities. To alleviate the cross-modality discrepancies between images, existing approaches mainly guide models to mine modality invariant features. Although those approaches are effective, they lose the modality-specific features that include important information beneficial to VI-ReID. Therefore, some approaches are using generative adversarial networks to compensate for modality information. However, the quality of images generated by these methods is usually poor, and most of them focus only on the learning of modality-sharable features. To solve these problems, this paper proposes a generative-based cross-modality image fusion strategy (GC-IFS), which can generate high-quality cross-modality paired images and fuse the information of the two modalities. Firstly, considering the importance of the identity discriminative information of the generated image, we propose a contrastive-learning image generation (CLIG) network to generate cross-modality paired images. Meanwhile, to fully integrate and utilize the information of the two modalities and eliminate the influence of cross-modality discrepancies, we design a part-based dual multi-modality feature fusion (P-DMFF) module to extract the unified feature representation. Extensive experiments on SYSU-MM01 and RegDB datasets demonstrate that our strategy outperforms the state-of-the-art methods for the VI-ReID task.

Abstract:
Recently, the performance of salient object detection (SOD) has been significantly improved by utilizing edge information for auxiliary training. However, the extraction and utilization of edge cues and multi-level feature fusion are still two issues in existing edge-aware models. In this paper, we devise a novel SOD network with edge-guided learning and specific aggregation, named ELSA-Net, to cooperatively address these two issues. First, we propose the edge-guided learning strategy, which utilizes edge cues as low-level guidance to improve saliency prediction. Specifically, we design a two-stream model that uses a saliency branch and an edge branch to detect the interior and the boundary of salient objects, respectively. Then, an edge-guided interaction module (EGI) is further designed to achieve feature enhancement by embedding edge information into the saliency branch as the spatial weights. In addition, two specific aggregation modules are proposed for the progressive fusion of multi-level features in the above two streams, thus making full use of semantic and detailed information. The high-level interactive fusion module (HIF) leverages the correlation between two deeper features to obtain more powerful global contexts. And the low-level weighted fusion module (LWF) focuses on the complement of fine information by selectively integrating input features. Extensive experiments show that the proposed approach outperforms 19 state-of-the-art methods on five datasets, which validates its effectiveness both quantitatively and qualitatively.

Abstract:
Recent years have witnessed a growing interest in compressed video action recognition due to the rapid growth of online videos. It remarkably reduces the storage by replacing raw videos with sparsely sampled RGB frames and other compressed motion cues (motion vectors and residuals). However, existing compressed video action recognition methods face two main issues: First, the inefficiency caused by the usage of coarse-level information under full resolution, and second, the disturbing due to the noisy dynamics in motion vectors. To address the two issues, this paper proposes a dynamic spatial focus method for efficient compressed video action recognition (CoViFocus). Specifically, we first use a light-weighted two-stream architecture to localize the task-relevant patches for both the RGB frames and motion vectors. Then the selected patch pair will be processed by a high-capacity two-stream deep model for the final prediction. Such a patch selection strategy crops out the irrelevant motion noise in motion vectors, as well as reduces the spatial redundancy of the inputs, leading to the high efficiency of our method in the compressed domain. Moreover, we found that the motion vectors can help our method to address the possibly happened static-issue, which means that the focus patches get stuck at some regions related to static objects rather than target actions, which further improves our method. Extensive results on both the HMDB-51 and UCF-101 datasets demonstrate the effectiveness and efficiency of our method in compressed video action recognition tasks.

Abstract:
Guided depth map super-resolution (GDSR) is one of the mainstream methods in depth map super-resolution, as high-resolution color images can guide the reconstruction of the depth maps and are often easy to obtain. However, how to make full use of extracted guidance information of the color image to improve the depth map reconstruction remains a challenging problem. In this paper, we first design a multi-scale feedback module (MF) that extracts multi-scale features and alleviates the information loss in network propagation. We further propose a novel multi-scale feedback network (MSF-Net) for guided depth map super-resolution, which can better extract and refine the features by sequentially joining MF blocks. Specifically, our MF block uses parallel sampling layers and feedback links between multiple time steps to better learn information at different scales. Moreover, an inter-scale attention module (IA) is proposed to adaptively select and fuse important features at different scales. Meanwhile, depth features and corresponding color features are interacted using cross-domain attention conciliation module (CAC) after each MF block. We evaluate the performance of our proposed method on both synthetic and real captured datasets. Extensive experimental results validate that the proposed method achieves state-of-the-art performance in both objective and subjective quality.

Abstract:
Currently, many robust principal component analysis (PCA) methods with low-dimensional representations have been proposed to improve the overall performance of image recognition. However, most existing methods are still sensitive to noise in the selection of structure information. To overcome this problem, we propose a novel structure for PCA, called arbitrary triangle structure adaptive mean PCA (ATAM-PCA). On the basis of ensuring that the variance, reconstruction error and input data with the flexible l_2,p -norm can establish a triangle structure, ATAM-PCA maximizes the summation of the difference between the variance and reconstruction error of each projection data sample, which improves the universality and robustness of the overall structure, but also successfully protects the rotational invariance and geometry of the data. Moreover, ATAM-PCA applies the adaptive mean strategy to perform the data centralization task, which further effectively reduces the negative impact of noise. Finally, we design a fast iterative algorithm for solving ATAM-PCA. Extensive experimental results based on several image databases show the effectiveness and advantages of our method.

Abstract:
Semantic alignment aims to establish pixel correspondences between images based on semantic consistency. It can serve as a fundamental component for various downstream computer vision tasks, such as style transfer and exemplar-based colorization, etc. Many existing methods use local features and their cosine similarities to infer semantic alignment. However, they struggle with significant intra-class variation of objects, such as appearance, size, etc. In other words, contents with the same semantics tend to be significantly different in vision. To address this issue, we propose a novel deep neural network of which the core lies in global feature enhancement and adaptive multi-scale inference. Specifically, two modules are proposed: an enhancement transformer for enhancing semantic features with global awareness; a probabilistic correlation module for adaptively fusing multi-scale information based on the learned confidence scores. We use the unified network architecture to achieve two types of semantic alignment, namely, cross-object semantic alignment and cross-domain semantic alignment. Experimental results demonstrate that our method achieves competitive performance on five standard cross-object semantic alignment benchmarks, and outperforms the state of the arts in cross-domain semantic alignment.

Abstract:
Graph Matching has recently emerged as an attractive technique applied to various computer vision tasks. Graph Matching based multi-label image classification, in particular, treats each image as a bag of instances and reformulates the classification task as an instance-label matching selection problem, achieving state-of-the-art results on diverse benchmarks. However, the generalization and scalability of such learned model cannot be well guaranteed due to its manually predetermined graph structure and high-dimension embedding of dense connections between instances and labels. To address these limitations, in this work, we propose a novel T ransformer Driven M atching S election framework for Multi-Label Image C lassification (C-TMS), where instance structural relationships, class-wise global dependencies, and the co-occurrence possibility of varying instance-label assignments are simultaneously taken into consideration in a unified and adaptive manner. Moreover, the parallelization capability of the Transformer enables efficient computation, making our model scalable to large-scale datasets. Specifically, we first represent instances and labels as nodes in the visual space and label space respectively, and then compute the hidden representation of each node in its individual space, by attending a self-attention strategy over its entire neighborhood. Subsequently, the cross-attention is adopted to excavate the correct assignments between instances and labels, and further interprets how classifying each label depends on the instances within an image and its interaction with other labels. Finally, an asymmetric focal loss is designed to optimize the instance-label correspondence, and read out image-level category confidences. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.

Abstract:
Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data (i.e., frame, clip, dataset) and performed in an online fashion. Experiments on \textit DAVIS_\textit 16 , FBMS, and SegTrackV2 datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at 3× faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/CluterNet.

Abstract:
The counting task, which plays a fundamental role in numerous applications (e.g., crowd counting, traffic statistics), aims to predict the number of objects with various densities. Existing object counting tasks are designed for a single object class. However, it is inevitable to encounter newly coming data with new classes in our real world. We name this scenario as evolving object counting. In this paper, we build the first evolving object counting dataset and propose a unified object counting network as the first attempt to address this task. The proposed network consists of two key components: a class-agnostic mask module and a class-incremental module. The class-agnostic mask module learns generic object occupation prior by predicting a class-agnostic binary mask (e.g., 1 denotes there exists an object at the considering position in an image and 0 otherwise). The class-incremental module is used to handle new classes and provides discriminative class guidance for density map prediction. The combined outputs of the class-agnostic mask module and image feature extractor are used to predict the final density map. When new classes arrive, we first add new neural nodes to the last regression and classification layers of the class-incremental module. Then, instead of retraining the model from scratch, we utilize knowledge distillation to help the model retain and consolidate what it has previously learned. We also employ a support sample bank to store a small number of typical training samples for each class, which are used to prevent the model from forgetting key information from old data. With this design, our model can efficiently and effectively adapt to new classes while maintaining good performance on already-seen data without large-scale retraining. Extensive experiments on the collected dataset demonstrate favorable performance. The dataset and code will be available at: https://github.com/Tanyjiang/EOCO.

Abstract:
RGB-guided depth completion aims at predicting dense depth maps from sparse depth measurements and corresponding RGB images, where how to effectively and efficiently exploit the multi-modal information is a key issue. Guided dynamic filters, which generate spatially-variant depth-wise separable convolutional filters from RGB features to guide depth features, have been proven to be effective in this task. However, the dynamically generated filters require massive model parameters, computational costs and memory footprints when the number of feature channels is large. In this paper, we propose to decompose the guided dynamic filters into a spatially-shared component multiplied by content-adaptive adaptors at each spatial location. Based on the proposed idea, we introduce two decomposition schemes \mathcal A and \mathcal B , which decompose the filters by splitting the filter structure and using spatial-wise attention, respectively. The decomposed filters not only maintain the favorable properties of guided dynamic filters as being content-dependent and spatially-variant, but also reduce model parameters and hardware costs, as the learned adaptors are decoupled with the number of feature channels. Extensive experimental results demonstrate that the methods using our schemes outperform state-of-the-art methods on the KITTI dataset, and rank 1st and 2nd on the KITTI benchmark at the time of submission. Meanwhile, they also achieve comparable performance on the NYUv2 dataset. In addition, our proposed methods are general and could be employed as plug-and-play feature fusion blocks in other multi-modal fusion tasks such as RGB-D salient object detection.

Abstract:
Tampered images can easily be used for illegal activities, such as spreading rumors, economic fraud, fabricating false news, and illegally obtaining experience benefits, etc. With the improvement and development of artificial intelligence (AI), image manipulation technology has also been further improved, more and more retouching software in daily life adopts AI technology. So far, there is no AI-based tampered dataset. To address this challenge, we propose a dataset-IPM15K. It utilizes the most advanced image processing technology and contains a total of 150,00 doctored vital images. This dataset also could serve as a catalyst for progressing many vision tasks, e.g., localization, segmentation, and alpha-matting, etc. Additionally, we propose an effective multi-feature fusion identification network (MFI-Net) to identify these challenging images. Our model consists of four modules: the detail extraction module (DEM), which utilizes different sizes of convolutions and perceptual fields to extract more valuable information of tampered locations; the multi-branch attention fusion module (MAFM), which fully exploits contextual information of different levels to capture subtle traces of tampering; the feature decoder component (FDC), which combines fused features to identify tampered regions; and the detail enhancement block (DEB), which continues to supplement the detailed information of the detected regions. Extensive experiments on three public datasets and the proposed dataset show that MFI-Net outperforms various state-of-the-art (SOTA) manipulation detection baselines.

Abstract:
Adaptive live video streaming applications utilize a predefined collection of bitrate-resolution pairs, known as a bitrate ladder, for simplicity and efficiency, eliminating the need for additional run-time to determine the optimal pairs during the live streaming session. These applications do not incorporate two-pass encoding methods due to increased latency. However, an optimized bitrate ladder could result in lower storage and delivery costs and improved Quality of Experience (QoE). This paper presents a Just Noticeable Difference (JND)-aware constrained Variable Bitrate (cVBR) Two-pass Per-title encoding Scheme (JTPS) designed specifically for live video streaming. JTPS predicts a content- and JND-aware bitrate ladder using low-complexity features based on Discrete Cosine Transform (DCT) energy and optimizes the constant rate factor (CRF) for each representation using random forest-based models. The effectiveness of JTPS is demonstrated using the open source video encoder x265, with an average bitrate reduction of 18.80% and 32.59% for the same PSNR and VMAF, respectively, compared to the standard HTTP Live Streaming (HLS) bitrate ladder using Constant Bitrate (CBR) encoding. The implementation of JTPS also resulted in a 68.96% reduction in storage space and an 18.58% reduction in encoding time for a JND of six VMAF points.

Abstract:
Salient object detection (SOD) aims to identify the most prominent regions in images. However, the large model sizes, high computational costs, and slow inference speeds of existing RGB-D SOD models have hindered their deployment on real-world embedded devices. To address this issue, we propose a novel method named AirSOD, which is committed to lightweight RGB-D SOD. Specifically, we first design a hybrid feature extraction network, which includes the first three stages of MobileNetV2 and our Parallel Attention-Shift convolution (PAS) module. Using the novel PAS module enables capturing both long-range dependencies and local information to enhance the representation learning while significantly reducing the number of parameters and computational complexity. Secondly, we propose a Multi-level and Multi-modal feature Fusion (MMF) module to facilitate feature fusion, and a Multi-path enhancement for Feature Refinement (MFR) decoder for feature integration. The proposed method significantly reduces the model size by 63%, decreases the computational complexity by 43%, and improves the inference speed by 43% compared with the cutting-edge model (MobileSal). We test our AirSOD on six widely-used RGB-D SOD datasets. Extensive experimental results demonstrate that our method obtains satisfactory performance. The source codes will be made available.

Abstract:
Multimodal image fusion is one of the important research directions in the field of multimodal fusion. This technique can realize image and data enhancement by using complementary multimodal images and be widely used in medicine, industry, security and fire protection, automatic driving and consumer electronics. In this work, we propose a transformer-based universal fusion (TUFusion) algorithm, and it has a multidomain fusion capability. The advantage of TUFusion algorithm is the design of hybrid transformer and convolutional neural network (CNN) encoder structure and a new composite attention fusion strategy, which has the ability of global and local information integration. Compared with the classical state-of-the-art multimodal image fusion methods, the experimental result on multidomain data sets showed that the TUFusion algorithm has certain universality in image fusion. Meanwhile, the TUFusion algorithm we proposed achieves good values on peak signal to noise ratio (PSNR), root mean square error (RMSE) and structural similarity index measure (SSIM). The code of the TUFusion algorithm in this article is available at https://github.com/windrunners/TUFusion.

Abstract:
The current studies of Scene Graph Generation (SGG) focus on solving the long-tailed problem for generating unbiased scene graphs. However, most de-biasing methods over-emphasize the tail predicates and underestimate head ones throughout training, thereby wrecking the representation ability of head predicate features. Furthermore, these impaired features from head predicates harm the learning of tail predicates. In fact, the inference of tail predicates heavily depends on the general patterns learned from head ones, e.g., “standing on” depends on “on”. Thus, these de-biasing SGG methods can neither achieve excellent performance on tail predicates nor satisfying behaviors on head ones. To address this issue, we propose a Dual-branch Hybrid Learning network (DHL) to take care of both head predicates and tail ones for SGG, including a Coarse-grained Learning Branch (CLB) and a Fine-grained Learning Branch (FLB). Specifically, the CLB is responsible for learning expertise and robust features of head predicates, while the FLB is expected to predict informative tail predicates. Furthermore, DHL is equipped with a Branch Curriculum Schedule (BCS) to make the two branches work well together. Experiments show that our approach achieves a new state-of-the-art performance on VG and GQA datasets and makes a trade-off between the performance of tail predicates and head ones. Moreover, extensive experiments on two downstream tasks (i.e., Image Captioning and Sentence-to-Graph Retrieval) further verify the generalization and practicability of our method. Our code is available at https://github.com/aa200647963/SGG-DHL/.

Abstract:
In recent years, multi-kernel learning (MKL) methods have been widely used in performing nonlinear data subspace clustering tasks, benefiting from the fact that they do not require the selection and tuning of predefined kernels. However, the effect of raw noise on the data structure in the feature space has been neglected in most MKL studies so far. In this paper, we propose a robust subspace clustering method called purity kernel tensor low-rank learning (KTLL), which effectively isolates noise transfer from the original data space to the high-dimensional feature space. Specifically, we construct the kernel pool obtained by MKL as a primitive third-order kernel tensor, separate the corrupted information in the feature space, and use the separated pure kernel tensor to learn the optimal affinity matrix. The tensor learning of the kernel pool can effectively mine the higher-order correlations among different kernel matrices, thus improving the clustering performance of KTLL.We have conducted extensive experiments to compare KTLL with state-of-the-art MKL and deep subspace clustering algorithms, and our results demonstrate the superiority of KTLL.

Abstract:
Ground-to-aerial (G2A) geo-localization remains extremely challenging due to the drastic appearance and geometry differences between ground and aerial views, especially when their relative orientation is unknown. In this paper, we focus on the challenging problem of unaligned G2A geo-localization, where the query ground-level image is not perfectly orientation-aligned with respect to reference aerial imagery. We cast this problem as a metric embedding task and propose a decoupled hierarchical (DeHi) architecture to progressively learn meaningful multi-grained features. Specifically, DeHi first leverages CNN to extract high-level semantic features, and then introduces a novel orthogonally factorized transformer model consisting of part-level and global transformer encoders to learn part-level and global feature descriptors sequentially. For the purpose of enhancing representation power, cross-level connections are introduced to enrich part-level and global descriptors by CNN features, and the pooled part-level descriptor is combined with the global descriptor to construct the final query representation. Furthermore, such a decoupled hierarchical architecture allows for incorporating multi-level deep supervision. We introduce two part-level losses combined with one cross-level loss to complement the widely used global retrieval loss. Extensive experiments on standard benchmark datasets show significant boosting in recall rates compared with the previous state-of-the-art. Remarkably, DeHi improves the recall rate @top-1 from 78.59% to 82.38% (+3.79%) and from 72.91% to 77.94% (+5.03%) on CVUSA and CVACT datasets, respectively, under random orientation misalignments. Besides, DeHi maintains competitive inference efficiency with less parameters compared to existing transformer-based methods.

Abstract:
Most existing Vision Transformer-based frameworks for weakly supervised semantic segmentation utilize class activation maps to generate pseudo masks. Although it mitigates the class-agnostic issue, this approach still suffers from misclassification and noise in segmentation results. To overcome these limitations, we propose an attention-based framework named Cross-block Sparse Class Token Contrast (CB-SCTC), which incorporates Dynamic Sparse Attention module (DSA) and Cross-block Class Token Contrast scheme (CB-CTC). Specifically, the proposed Cross-block Class Token Contrast scheme forces diversity between the final class tokens by learning from the lower similarity of the class tokens in the relatively shallower blocks. Moreover, the Dynamic Sparse Attention module is designed to post-process the output from the softmax function in the attention mechanism to reduce noise. Extensive experiments prove the proposed framework is a valid alternative to class activation maps. Our framework demonstrates competitive mIoU scores on the PASCAL VOC 2012(val:75.5%, test:75.2%) and MS COCO 2014 dataset(val:46.9%). Our code is available at https://github.com/Jingfeng-Tang/CB-SCTC.

Abstract:
Weakly supervised Referring Expression Grounding (REG) aims to localize the target entity in an image based on a given expression, where the mapping between image regions and expressions is unknown during training. It faces two primary challenges. Firstly, conventional methods involve selecting regions to generate reconstructed texts for computing the backpropagation loss between regions and expressions. However, semantic deviations in text reconstruction may result in significant cross-modal bias, leading to substantial losses even in cases of correctly matched regions. Secondly, the absence of region-level ground truth in weakly supervised REG results in a lack of stable and reliable supervision during training. To tackle these challenges, we propose a Progressive Semantic Reconstruction Network (PSRN), which utilizes a two-level matching-reconstruction process based on the key triad and adaptive phrases, respectively. We leverage progressive semantic reconstruction with a three-staged training strategy to mitigate the deviations in the reconstructed texts. Additionally, we introduce a Constrained Interactions operation and an Attention Coordination mechanism to facilitate additional bidirectional supervision between the two matching processes. Experiments on three benchmark datasets of RefCOCO, RefCOCO+ and RefCOCOg demonstrate that the proposed PSRN has the competing results. Our source code will be released at https://github.com/5jiahe/psrn.

Abstract:
Face recognition systems have raised concerns due to their vulnerability to different presentation attacks, and system security has become an increasingly critical concern. Although many face anti-spoofing (FAS) methods perform well in intra-dataset scenarios, their generalization remains a challenge. To address this issue, some methods adopt domain adversarial training (DAT) to extract domain-invariant features. Differently, in this paper, we propose a domain adversarial attack (DAA) method by adding perturbations to the input images, which makes them indistinguishable across domains and enables domain alignment. Moreover, since models trained on limited data and types of attacks cannot generalize well to unknown attacks, we propose a dual perceptual and generative knowledge distillation framework for face anti-spoofing that utilizes pre-trained face-related models containing rich face priors. Specifically, we adopt two different face-related models as teachers to transfer knowledge to the target student model. The pre-trained teacher models are not from the task of face anti-spoofing but from perceptual and generative tasks, respectively, which implicitly augment the data. By combining both DAA and dual-teacher knowledge distillation, we develop a dual teacher knowledge distillation with domain alignment framework (DTDA) for face anti-spoofing. The advantage of our proposed method has been verified through extensive ablation studies and comparison with state-of-the-art methods on public datasets across multiple protocols.

Abstract:
Although multi-modal large language models possess impressive cross-modal reasoning and prediction capabilities, they lack a unified and rigorous evaluation standard. In this paper, we introduce a future event prediction task to assess the cross-modal temporal prediction capabilities of these models. This task requires the model to generate descriptions of events that may occur in the future based on input video. To tackle this new task, we propose an object-centric cross-modal knowledge reasoning framework, which combines a basic information encoder, an adaptive multi-segment filter, a spatial-temporal relation encoder, a vision-text interaction module, and a pre-trained large language model decoder. The adaptive multi-segment filter captures selectively capture critical visual information in videos, enhancing the model’s focus on relevant features. The spatial-temporal relation encoder decomposes and associates the objects and scene information in the video. Additionally, the vision-text interaction module enhances the connection between visual sequences and their corresponding textual narratives, ensuring semantic coherence and consistency. To evaluate our framework, we constructed a dataset containing descriptions, dialogues of future events, and object-centric event reasoning chains. Experimental results indicate that the proposed framework outperforms all previous methods for future event prediction. Ablation studies further demonstrate the effectiveness of the designed modules.

Abstract:
Video coding has become more and more important since high-resolution and high-quality videos have been used in a variety of application areas. Deblocking filter (DBF) is a video coding technology which can improve both video quality and coding efficiency. However, its hardware architecture design suffers from huge computations and high memory requirements. Moreover, the latest Versatile Video Coding (VVC) standard extends DBF with several complex enhancements, which makes the design more difficult. In this paper, a high-throughput and memory-efficient DBF hardware architecture for VVC systems is presented. By analyz-ing the DBF algorithm, we firstly propose a unified filter core to perform edge filtering process with low complexity, and two resource sharing techniques are utilized to reduce hardware costs. Furthermore, we propose a whole DBF architecture to process all the edges in a coding tree unit (CTU). To improve its throughput, we propose novel pre-calculation processing flow and double processing flow to fully utilize pipelining and parallel processing techniques. At the same time, to reduce its memory requirements, we propose four novel data reuse approaches to fully utilize intermediate data reusabilities. Synthesis results show that our proposed hardware architecture can support real-time VVC DBF processing of 7680× 4320 at 158 frames/s at 500 MHz working frequency. The hardware costs are only 163.2k gate count and three two-port on-chip SRAMs with data width of 128 bits and depth of 32. Compared with other state-of-the-art works for previous standards, our proposed VVC DBF hardware architecture achieves good results in performance, area efficiency and memory efficiency.

Abstract:
Existing pulmonary nodule detection methods often train models in a fully-supervised setting that requires strong labels (i.e., bounding box labels) as label information. However, manual annotation of bounding boxes in CT images is very time-consuming and labor-intensive. To alleviate the annotation burden, in this paper, we investigate pulmonary nodule detection by leveraging both strong labels and weak labels (i.e., center point labels) for training, and propose a novel hybrid-supervised pulmonary nodule detection (HND) method. The training of HND involves a heterogeneous teacher-student learning framework in two stages. In the first stage, we design a point-based consistency calibration network (PCC-Net) as a teacher, which is pre-trained to generate high-quality pseudo bounding box labels given point-augmented CT images as inputs. In the second stage, we develop an information bottleneck-guided pulmonary nodule detection network (IBD-Net) as a student to perform pulmonary nodule detection. In particular, we introduce information bottleneck to learn reliable pulmonary nodule-specific heatmaps under the guidance of PCC-Net, largely enhancing the model’s interpretability and improving the final detection performance. Based on the above designs, our method can effectively detect pulmonary nodule regions with only a limited number of bounding box labels. Experimental results on the public pulmonary nodule detection dataset LUNA16 show that our HND method achieves an excellent balance between the annotation cost and the detection performance.

Abstract:
3D object detection is a fundamental task in scene understanding. Numerous research efforts have been dedicated to better incorporate Hough voting into the 3D object detection pipeline. However, due to the noisy, cluttered, and partial nature of real 3D scans, existing voting-based methods often receive votes from the partial surfaces of individual objects together with severe noises, leading to sub-optimal detection performance. In this work, we focus on the distributional properties of point clouds and formulate the voting process as generating new points in the high-density region of the distribution of object centers. To achieve this, we propose a new method to move random 3D points toward the high-density region of the distribution by estimating the score function of the distribution with a noise conditioned score network. Specifically, we first generate a set of object center proposals to coarsely identify the high-density region of the object center distribution. To estimate the score function, we perturb the generated object center proposals by adding normalized Gaussian noise, and then jointly estimate the score function of all perturbed distributions. Finally, we generate new votes by moving random 3D points to the high-density region of the object center distribution according to the estimated score function. Extensive experiments on two large scale indoor 3D scene datasets, SUN RGB-D and ScanNet V2, demonstrate the superiority of our proposed method. The code will be released at https://github.com/HHrEtvP/DiffVote.

Abstract:
Interactive Video Object Segmentation (iVOS) is inherently demanding, requiring real-time interaction between humans and computers. Enhancing user experience involves considerations such as user input habits, segmentation quality, running time, and memory consumption. However, existing methods compromise user experience by employing a single input mode and exhibiting slow running speeds. Specifically, these approaches restrict user interaction to a single frame, limiting the expression of user intent. To overcome these limitations and better align with user habits, we introduce a framework that facilitates flexible input modes by ID-queried concurrent propagation (IDPro). In particular, we have devised the Across-Frame Interaction Module (AFI), allowing users to freely annotate various objects across multiple frames. The AFI module transfers scribble information across interactive frames, generating multi-frame masks. Additionally, we leverage an id-queried mechanism to process multiple objects. To achieve more efficient propagation and a lightweight model, we propose a truncated re-propagation strategy, replacing the previous multi-round fusion module, which employs an across-round memory that stores crucial interaction information. Our SwinB-IDPro attains a new state-of-the-art performance on DAVIS 2017 (89.6%, \mathcal J\& \mathcal F\text@60 ). Furthermore, our R50-IDPro exhibits over 3 × faster performance than the leading competitor in challenging multi-object scenarios.

Abstract:
Highly degraded images greatly challenge existing algorithms to detect objects of interest in adverse scenarios, such as rain, fog, and underwater. Recently, researchers develop sophisticated deep architectures in order to enhance image quality. Unfortunately, the visually appealing output of the enhancement module does not necessarily generate high accuracy for deep detectors. Another feasible solution for low-quality image detection is to transform it into a domain adaptation problem. Typically, these approaches invoke complicated training strategies such as adversarial learning and graph matching. False detection is likely to occur in local regions of a low-quality image. In this paper, we propose a simple yet effective strategy with two learners for low-quality image detection. We devise the crux learner to generate cruxes that have great impacts on detection performance. The catch-up leaner with a simple residual transfer mechanism maps the feature distributions of crux regions to those favouring a deep detector. These two learners can be plugged into any CNN-based feature extraction networks, e.g., ResNetXT101 and ResNet50, and yield high detection accuracy on various degraded scenarios. Extensive experiments on several public datasets demonstrate that our method achieves more promising results than state-of-the-art detection approaches. The codes: https://github.com/xiaoDetection/learning-cruxes-to-push.

Abstract:
Semantic Scene Completion (SSC) requires a comprehensive perception of both the geometry and semantics across the entire 3D scene. In the domain of autonomous driving, the majority of existing SSC methods rely on single-modal images (e.g., MonoScene, TPVformer) or point clouds (e.g., S3CNet, JS3C-Net), without taking into account the complementary information from bimodal sources. In this work, we propose an Image and Point Cloud continuous fusion in Voxel Network (IPVoxelNet) to address SSC within the voxelized space. IPVoxelNet represents images and point clouds within a unified voxelized space and utilizes the Image and Point Cloud Fusion (IPF) layers for continuous fusion of bimodal features. Specifically, IPVoxelNet utilizes pixel-to-voxel reprojection to map pixels into 3D space, leveraging the dense semantics of images. Unordered point clouds are represented in voxel space through regularization. IPVoxelNet independently learns the geometry and semantics of each modality. Additionally, we propose cross-modal knowledge distillation to transfer geometric information from point clouds to images. We validate our model on the challenging SemanticKITTI and nuScenes-Occupancy datasets, achieving state-of-the-art results across multiple classes. IPVoxelNet demonstrates competitive performance in both geometry (SC IoU) and semantics (mIoU).

Abstract:
Chaotic maps have attracted wide attention in the field of cyberspace security. However, many shortcomings of chaos-based applications stem from the dynamic degradation. In this article, an evolutionary digital chaotic (EDC) model is proposed to overcome this problem. Using the especially designed mutation sequences, the EDC model can generate non-degenerate chaotic maps on digital devices. The effectiveness of the EDC model is proven through theoretical analysis, and further numerical simulations confirm the superiority of the chaotic maps generated by the EDC model in terms of the dynamical characteristics. This suggests that the EDC model can significantly contribute to the promotion and confidence in chaos-based applications. Moreover, a dynamic xorshift operation is proposed to bridge the gap between chaotic characteristics and security. To investigate the practical application, a general scheme for constructing the pseudorandom number generator (PRNG) is designed. Performance analyses demonstrate that the PRNG has a powerful ability to produce high-quality pseudorandom sequences in different implementation precision.

Abstract:
Existing deep learning (DL)-based magnetic resonance imaging (MRI) retrospective motion correction (MoCo) models are typically task-specific, which makes them challenging to generalize to different scenarios w.r.t motions, modalities, planes, and scanner centers. This limitation occurs since the motions of each patient vary, and collecting diverse paired/unpaired motion data is generally costly and infeasible. To deal with this problem, we propose the Equivariant Imaging Prior (EIP) framework to generalize the MoCo tasks toward various scenarios.In this paper, the traditional MRI MoCo tasks, specifically for the multi-scenarios, can be treated as a mask-varying compressed sensing self-supervised problem for MRI reconstruction with corrupted k-space data.To the best of our knowledge, this framework is the first attempt to handle multiple MRI MoCo scenarios with one single DL model. Specifically, stochastic subsampling and modality augmentation are employed for data preparation. Then, a domain generalization-friendly net is carefully designed and an equivariant imaging task is leveraged to learn the mapping from corrupted data to clean images. The experimental results show that the proposed EIP framework achieves impressive adaptability across generalizable MoCo tasks, including but not limited to multi-motion, multi-modality, multi-center, and multi-plane. Furthermore, our EIP demonstrates similar or superior performance to several state-of-the-art models trained in a supervised manner, extending to even motion estimation on the multi-coil raw data. The code is available: https://github.com/wangzhiwen-scu/EIP4MoCo.

Abstract:
Recently, Transformer-based few-shot classification methods are widely exploited. However, they only leverage feature information at a single scale, resulting in weak feature representations, which cannot fully capture the rich information contained in a limited number of images regarding diverse objects with different scales, even those belonging to the same category. To mitigate this issue, we propose a multi-scale feature sets matching scheme in vision Transformer for few-shot classification, and name it FSViT, which can sufficiently extract discriminative features from the few number of labeled support examples. Concretely, we establish a patch-based multi-scale feature representation based on the feature extractors of FSViT, where we introduce an attention-aware grid pooling operation to merge adjacent patches with various scales to obtain multi-scale feature sets. Moreover, we devise a multi-scale patch matching metric to aggregate the measurement of similarity over the multi-scale feature sets for few-shot classification. Extensive experiments demonstrate the effectiveness of the proposed FSViT in both 1-shot and 5-shot scenarios on standard single-domain and cross-domain few-shot classification, especially improving the state-of-the-art recognition accuracy by 1.27% and 1.33% on average on the Mini-ImageNet and CFAIR-FS datasets, respectively. The code of FSViT is available at https://github.com/codeshop715/FSViT.

Abstract:
The proliferation of high-dimensional complex data in various fields such as multimedia, social media, and sensor networks has led to an increasing demand for real-time clustering algorithms. This article presents a novel two-stage approach for complex data streams. In the online stage, angular margin are introduced to constrain the mapping of input data, enhancing the directional characteristics of the resulting data representation. In the offline stage, we propose a unique clustering approach grounded in angular density to uncover spatial relationships within the data. This approach utilizes two distinct strategies for angular density clustering. Neighbor Selection based on Angular Relations define the angular density, which significantly enhances the algorithm’s discriminative ability. Density-Priority Cluster Selection strategy determines the generation of clusters, ensuring the reliability of clustering. We also introduce a novel data expiration mechanism that optimizes computational costs and memory usage by discarding data objects from stable clusters. Experimental evaluations on four diverse datasets, including speaker diarization and video face clustering tasks, demonstrate the superior performance of our proposed method over state-of-the-art online clustering techniques. Furthermore, our method achieves comparable performance to offline clustering methods, highlighting its effectiveness and efficiency in real-time clustering applications. The source code for the proposed algorithms is accessible at https://github.com/sssssuda/DRSCDM.

Abstract:
Free-viewpoint human body reenactment aims to generate authentic and coherent poses for a source subject based on a target body pose skeleton. While current methods are proficient in reproducing existing poses, they falter in generating novel poses, often yielding blurry results. In this paper, we propose a method for open-set pose synthesis, utilizing multi-view images to generate novel poses from arbitrary viewpoints. Our method begins with building the neural radiance fields (NeRF) using multi-view images. While this NeRF is adept at rendering specific free-viewpoint refined poses, it struggles with sharp results for novel poses. To address this, we introduce a 2D novel pose diffusion (2D-NPD) module and a view-consistent NeRF optimization (VCNeRF-O) strategy. The 2D-NPD performs body reenactment in the 2D domain to generate a set of refined novel pose images. In particular, we introduce a motion adapter tailored for the stable diffusion (SD) model to generate novel poses while preserving the cloth texture. To ensure seamless motion and image clarity, we further devise a dual warp loss function for the motion adapter. Moreover, to generate fine-grained novel poses while maintaining viewpoint consistency, we develop an innovative VCNeRF-O to optimize the NeRF. Experiments demonstrate that our approach outperforms existing techniques in terms of texture quality and consistency in the open-set synthesis of novel poses.

Abstract:
The target recognition of synthetic aperture radar (SAR) data generally faces the issue of limited observational samples in practical applications. Recent few-shot SAR target recognition techniques based on meta-learning, which mainly focus on intricate meta-learning models without considering SAR imaging characteristics during model training, show promise. To address this issue, a novel few-shot transfer learning paradigm named causal intervention and parameter-free reasoning (CIPR) is proposed for SAR target recognition. In the proposed framework, causal intervention pretraining (CIP), which emphasizes causal features of SAR images, is developed to diminish spurious correlations caused by confounders. Moreover, variational inference approximates intricate alterations in SAR imaging angles and background clutter in a generative manner. To make predictions of the unlabelled query set without additional learnable parameters, a parameter-free label reasoning model based on optimal transport, which integrates label knowledge and effectively leverages the distribution characteristics of causal features, is introduced. Experiments on the moving and stationary target acquisition and recognition (MSTAR) dataset demonstrate that the proposed method achieves superior performance and has preferable robustness to large depression angle discrepancies.

Abstract:
The key that hinders the performance improvement of current camouflaged object detection (COD) models is the lack of discriminability of features at fine granularity. We solve this problem from two complementary perspectives. Firstly, complex scenes result in the discriminative feature representations of camouflaged objects being present at different scales and semantic abstraction levels. Therefore, a mechanism is needed to increase the diversity of features to integrate more information potentially beneficial for COD. Second, appearance similarity between objects and environments will inevitably lead to similarity in features. Enhancing feature diversity alone is not enough to solve the above problems. Therefore, it is necessary to give the model semantic perception capabilities to expand the subtle discrepancies between objects and environments in feature embedding. Inspired by the first point, we propose a cross-scale interaction module (CSIM) that utilizes cross-attention between different scales to enhance the diversity of feature representations. Regarding the second point, the semantic guided feature learning (SGFL) is proposed to promote the model to expand feature discrepancies through explicit supervision. Experiments on four popular COD datasets show that our method outperforms recent SOTA methods. In addition, polyp segmentation experiments show that it is also effective for other COD-like tasks.

Abstract:
While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models (LMMs), their image quality assessment (IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice (2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posteriori estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs.

Abstract:
Contrastive learning for self-supervised skeleton-based action recognition has recently received attention. It has been observed that local crops, containing partial action sequences, can predict action patterns, which is advantageous for skeleton-based action recognition. This paper proposes a Global and Local Contrastive Learning framework (skeleton-logoCLR) with two contrastive learning routes, Global-to-Global and Global-to-Local, which utilize the similarity between global and local crops of the same skeleton sequence. Specifically, in the Global-to-Global route, we design Temporal Attention Crop-Resize (TACR) to learn global semantic information by maximizing the retention of action region in the temporal dimension. In the Global-to-Local route, the proposed Skeleton-logo Augmentation is deviced to concatenate two local crops from different sequences for local semantic learning. Moreover, instead of fusing directly, the losses of two routes are combined in a cascaded manner through the Self-Adaptive Training Strategy (SATS) to achieve stronger generalization performance. Extensive experiments are conducted on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The results demonstrate that the proposed method achieves remarkable performance.

Abstract:
In Visual Question Answering (VQA), addressing language prior bias, where models excessively rely on superficial correlations between questions and answers, is crucial. This issue becomes more pronounced in real-world applications with diverse domains and varied question-answer distributions during testing. To tackle this challenge, Test-time Adaptation (TTA) has emerged, allowing pre-trained VQA models to adapt using unlabeled test samples. Current state-of-the-art models select reliable test samples based on fixed entropy thresholds and employ self-supervised debiasing techniques. However, these methods struggle with diverse answer spaces linked to different question types and may fail to identify biased samples that still leverage relevant visual context. In this paper, we propose Question type-guided Entropy Minimization and Debiasing (QED) as a solution for test-time VQA model adaptation. Our approach involves adaptive entropy minimization based on question types to improve the identification of fine-grained and unreliable samples. Additionally, we generate negative samples for each test sample and label them as biased if their answer entropy change rate significantly differs from positive test samples, subsequently removing them. We evaluate our approach on two public benchmarks, VQA-CP v2, and VQA-CP v1, and achieve new state-of-the-art results, with overall accuracy rates of 48.13% and 46.18%, respectively.

Abstract:
Misalignments between multi-modality images pose challenges in image fusion, manifesting as structural distortions and edge ghosts. Existing efforts commonly resort to registering first and fusing later, typically employing two separate stages for registration, i.e., coarse registration and fine registration. Both stages directly estimate the respective target deformation fields. This paper contends that the separate two-stage registration lacks compactness, and the direct estimation of their target deformation fields falls short in accuracy. To tackle these challenges, we introduce IMF, a framework for improving misaligned multi-modality image fusion. Central to IMF is a One-stage Progressive Dense Registration (OPDR) scheme, which accomplishes the coarse-to-fine registration through only a one-stage optimization. Specifically, two pivotal components are involved in OPDR, a dense Deformation Field Fusion (DFF) module and a Progressive Feature Fine (PFF) module. The DFF aggregates the predicted multi-scale deformation sub-fields at the current scale, while the PFF progressively refines the remaining misaligned features. Together, they effectively and accurately estimate the final deformation fields. In addition, we develop a Transformer-Conv-based Fusion (TCF) subnetwork that considers local and long-range feature dependencies, allowing us to capture more informative features from the registered infrared and visible images for the generation of high-quality fused images. Extensive experimental analysis demonstrates the superiority of the proposed method in the fusion of misaligned cross-modality images. The code will be available at https://github.com/wdhudiekou/IMF.

Abstract:
The Intersection over Union (IoU) has been widely employed in various stages of object detection owing to its ability to quantify the similarity between boxes objectively. However, in densely packed scenes full of crowded and small-sized objects, adjacent positive boxes often exhibit high levels of overlap. This overlap interference compromises the consistency between quality evaluation and confidence, leading to ambiguous box prediction within the previous IoU-based models. To address this issue, we design a novel learning paradigm tailored for Dense scenes based on IoU, called DeIoU. This approach effectively suppresses unnecessary overlap between predicted boxes and thereby enhances representation learning for non-salient objects. Specifically, it consists of a dense box regression loss \mathcal L_DeIoU and a one-to-many (O2M) label matching strategy guided by DeIoU. These components focus on calibrating the position and shape prediction quality during the model training, learning distinguishable object features by penalizing overlap interference between neighboring boxes. Extensive experiments on four object detection datasets including SKU-110K, CrowdHuman, MS COCO 2017, and DIOR, demonstrate that our DeIoU-based learning strategy outperforms other state-of-the-art methods. Notably, the proposed method delivers a substantial improvement (average 1.3~AP and 1.8~MR^-2 ) across popular detectors on SKU-110K and CrowdHuman while exhibiting distinct competitiveness on small objects within natural scenes.

Abstract:
Despite rapid progress of end-to-end optimization for single-image dehazing, a long-standing open problem is the non-homogenous haze, at the core of the differences between synthetic hazy images and real hazy images. The atmospheric scattering model (ASM) has been widely adopted to model the degradation process of haze images but based on the assumption of homogeneous haze. In realistic scenarios, non-homogeneous haze often makes it more difficult to estimate the transmission map in ASM, resulting in undesired artifacts in the restored images. To address the issue of non-homogeneous haze, we propose to model the uncertainty in the estimation of the transmission map and develop a spatially adaptive learning module for ASM correction. Specifically, we present an approach to enhancing the well-known Dark Channel prior (DCP) by relaxing the constraint with the transmission map in the DCP-net. Assuming the availability of paired training data, we have developed a strategy to address vulnerability in the DCP, leading to a more accurate estimation of the transmission map. Then, we explore the uncertainty between the estimated transmission map and target transmission map (Ground Truth) to reformulate the ASM for the presence of non-homogeneous haze. A robust and accurate estimated transmission map can boost the final dehazing performance of our DCP-net. Experiments on three popular synthetic and real non-homogeneous datasets show that our proposed approach has achieved better results on both synthetic scenes and real non-homogeneous scenes. The code is available at https://see.xidian.edu.cn/faculty/wsdong/Projects/Projects/project_dehazing_TCSVT2024.htm

Abstract:
The Visible-Infrared (VIS-IR) object detection is a challenging detection task, which combines visible and infrared data to give information on the category and location of objects in the scene. Therefore, the core of this task is to combine complementary information in the visible and infrared modalities to provide more object detection results for detection. The existing methods mainly face the problem of insufficient ability to perceive and combine visible-infrared modal information and have difficulty in balancing the optimization directions of the fusion and detection tasks. To solve these problem, we propose the MMI-Det which is a multi-modal fusion method for visible and infrared object detection. The method can provide a good combination of complementary information in the visible-infrared modalities and output accurate and robust object information. Specifically, to improve the ability of the model to perceive environment at the visible-infrared image level, we designed the Contour Enhancement Module. Furthermore, to extract complementary information from VIS and IR modalities, we design the Fusion Focus Module. It can extract different frequency spectral features of the visible and infrared modalities and focus on the key information of the object at different spatial locations. Moreover, we design the Contrast Bridge Module to improve the ability to extract modal invariant features in the visible-infrared scene. Finally, to ensure that our model can balance the optimization directions of image fusion and object detection, we design the Info Guided Module as a way to improve the effectiveness of the model’s training optimization. We implement extensive experiments on the public FLIR, M3FD, LLVIP, TNO and MSRS datasets, and compared with previous methods, our method achieves better performance with powerful multi-modal information perception capabilities.

Abstract:
Hyperspectral images (HSI) clustering is an important but challenging task. The state-of-the-art (SOTA) methods usually rely on superpixels, however, they do not fully utilize the spatial and spectral information in HSI 3-D structure, and their optimization targets are not clustering-oriented. In this work, we first use 3-D and 2-D hybrid convolutional neural networks to extract the high-order spatial and spectral features of HSI through pre-training, and then design a superpixel graph contrastive clustering (SPGCC) model to learn discriminative superpixel representations. Reasonable augmented views are crucial for contrastive clustering, and conventional contrastive learning may hurt the cluster structure since different samples are pushed away in the embedding space even if they belong to the same class. In SPGCC, we design two semantic-invariant data augmentations for HSI superpixels: pixel sampling augmentation and model weight augmentation. Then sample-level alignment and clustering-center-level contrast are performed for better intra-class similarity and inter-class dissimilarity of superpixel embeddings. We perform clustering and network optimization alternatively. Experimental results on several HSI datasets verify the advantages of the proposed SPGCC compared to SOTA methods. Our code is available at https://github.com/jhqi/spgcc.

Abstract:
This paper presents a novel Transformer architecture for zero-shot learning (ZSL), termed TransZSL, which can characterize hierarchical semantic-aware parts. It consists of an adaptive token refinement (ATR), a hierarchical token aggregation (HTA), and semantic-aware prototypes (SAP). Firstly, the ViT is used as the backbone that provides comprehensive local information without missing details. To address the different degrees of noise caused by large appearance variations, the ATR is proposed to highlight important tokens and suppress useless ones adaptively. However, due to the complex image structure, some important tokens may be incorrectly discarded. Therefore, a random perturbation is proposed to reactivate discarded tokens randomly, reducing the risk of missing discriminative information. Secondly, dataset descriptions contain both low- and high-level attributes. To this end, the HTA aggregates complementary hierarchical tokens from multiple ViT layers. Thirdly, semantically similar content may be distributed in different tokens. To overcome this issue, the SAP is proposed to group semantically identical tokens into one prototype, focusing on semantic-aware parts. Besides, diversity loss is used to encourage networks to learn diverse prototypes that discover diverse parts. Both qualitative and quantitative results on several challenging tasks demonstrate the usefulness and effectiveness of our proposed methods.

Abstract:
Weakly supervised person search targets to detect and identify a person with only bounding box annotations. Recent approaches have focused on learning person relations in a single model, ignoring the conflicts between the detection and Re-ID heads, along with the influence of background elements, which may lead to noisy pseudo labels and inaccurate Re-ID features. To address this challenge, we introduce a novel framework named Knowledge Consistency Distillation (KCD) for weakly supervised person search, which explores the capabilities of an advanced unsupervised person re-identification (Re-ID) model to mitigate the conflicts and background influences. We propose hierarchical consistency alignments, including feature-level, cluster-level, and instance-level consistency alignment, to synchronize the knowledge from the state-of-the-art unsupervised Re-ID model. Specifically, the feature-level consistency aligns the feature through both context and relation alignment. The cluster-level consistency aligns the teacher cluster information by reusing its OIM module. To tackle the inconsistency problem between student instances and teacher cluster centroids, we incorporate pseudo-label refinement to assist the student model in comprehending the teacher’s knowledge at cluster-level while mitigating the negative effects of noisy labels. Finally, an instance-level consistency loss weighted by the similarity between the instance and its corresponding cluster is proposed to align the positive instance correlations. Our approach aims to train a one-step weakly supervised model for person search by exploiting the characteristics of unsupervised person Re-ID. Extensive experiments illustrate that our method achieves state-of-the-art performance on two widely-used person search datasets, CUHK-SYSU and PRW. Our code will be available on GitHub at https://github.com/zongyi1999/KCD.

Abstract:
Despite the large progress in supervised learning with neural networks, there are significant challenges in obtaining high-quality, large-scale and accurately labelled datasets. In such contexts, how to learn in the presence of noisy labels has received more and more attention. Addressing this relatively intricate problem to attain competitive results predominantly involves designing mechanisms that select samples that are expected to have reliable annotations. However, these methods typically involve multiple off-the-shelf techniques, resulting in intricate structures. Furthermore, they frequently make implicit or explicit assumptions about the noise modes/ratios within the dataset. Such assumptions can compromise model robustness and limit its performance under varying noise conditions. Unlike these methods, in this work, we propose an efficient and effective framework with minimal hyperparameters that achieves SOTA results in various benchmarks. Specifically, we design an efficient and concise training framework consisting of a subset expansion module responsible for exploring non-selected samples and a model training module to further reduce the impact of noise, called NoiseBox. Moreover, diverging from common sample selection methods based on the “small loss” mechanism, we introduce a novel sample selection method based on the neighbouring relationships and label consistency in the feature space. Without bells and whistles, such as model co-training, self-supervised pre-training and semi-supervised learning, and with robustness concerning the settings of its few hyper-parameters, our method significantly surpasses previous methods on both CIFAR10/CIFAR100 with synthetic noise and real-world noisy datasets such as Red Mini-ImageNet, WebVision, Clothing1M and ANIMAL-10N.

Abstract:
To further improve the performance of Versatile Video Coding (VVC), a neural network based multi-level in-loop filtering framework for luma and chroma is presented in this letter, which includes Reference pixel Level (RL), Coding tree unit Level (CL), and Frame Level (FL). The neural network based filters in these levels can be flexibly enabled. In RL, the coding performance upper bound is analyzed and asymmetric convolution is designed. In CL, the pixels located at the bottom and rightmost have been assigned greater weights for loss calculation during training. In addition, the co-located luma is adopted in CL and FL chroma filtering for guiding chroma enhancement due to the high correlation between them. For the architecture of neural network, two input channel fusion schemes are combined to enjoy both of their benefits. Extensive experimental results show that the proposed multi-level in-loop filtering method can achieve 6.87%, 32.8%, and 36.9% bit rate reductions on average for Y, U, and V components under all intra configuration, which outperforms the state-of-the-art works.

Abstract:
Human-Object Interaction (HOI) detection aims to infer interactions between humans and objects, and it is very important for scene analysis and understanding. The existing methods usually focus on exploring instance-level (e.g., object appearance) or interaction-level (e.g., action semantic) features to conduct interaction prediction. However, most of these methods only consider the self-triplet feature aggregation, which may lead to learning ambiguity without exploring the cross-triplet context exchange. In this paper, from both visual and textual perspectives, we propose a novel method to jointly explore self- and cross-triplet interaction context clues for HOI detection. First, we employ a graph neural network to perform self-triplet aggregation, where human and object features represent graph nodes and visual interaction feature and textual prior knowledge are acted as two different edges. Furthermore, we also attempt to explore cross-triplet context exchange by incorporating symbiotic and layout relationships among different HOI triplets. Extensive experiments on two benchmarks demonstrate that our proposed method outperforms the state-of-the-art ones and achieves the impressive performance of 40.32 mAP on HICO-DET and 69.1 mAP on V-COCO datasets, respectively.

Abstract:
Ground-truth RGBD data are fundamental for a wide range of computer vision applications; however, those labeled samples are difficult to collect and time-consuming to produce. A common solution to overcome this lack of data is to employ graphic engines to produce synthetic proxies; however, those data do not often reflect real-world images, resulting in poor performance of the trained models at the inference step. In this paper we propose a novel training pipeline that incorporates Diffusion4D (D4D), a customized 4-channels diffusion model able to generate realistic RGBD samples. We show the effectiveness of the developed solution in improving the performances of deep learning models on the monocular depth estimation task, where the correspondence between RGB and depth map is crucial to achieving accurate measurements. Our supervised training pipeline, enriched by the generated samples, outperforms synthetic and original data performances achieving an RMSE reduction of (8.2%, 11.9%) and (8.1%, 6.1%) respectively on the indoor NYU Depth v2 and the outdoor KITTI dataset.

Abstract:
Recent studies have shown that lip shape and movement can be used as an effective biometric feature for speaker authentication. By using random prompt text scheme, lip-based authentication system can also achieve good liveness detection performance in laboratory scenarios. However, due to the increasingly widespread mobile application, the authentication system may face additional practical difficulties such as complex background, limited user samples, etc., which will degrade the authentication performance derived by current methods. To confront the above problems, a new deep neural network, i.e. the Triple-feature Disentanglement Network for Visual Speaker Authentication (TDVSA-Net), is proposed in this paper to extract discriminative and disentangled lip features for visual speaker authentication in the random prompt text scenario. Three decoupled lip features, including the content feature inferring the speech content, the physiological lip feature describing the static lip shape and appearance and the behavioral lip feature depicting the unique pattern in lip movements during utterance, are extracted by TDVSA-Net and fed into corresponding modules to authenticate both the prompt text and the speaker’s identity. Experiment results have demonstrated that compared with several SOTA visual speaker authentication methods, the proposed TDVSA-Net can extract more discriminative and robust lip features which boost the content recognition and identity authentication performance against both human imposters and DeepFake attacks.

Abstract:
Skeleton-based human motion prediction task aims to forecast future skeleton frames conditioned by observed skeleton sequence. Different from previous methods that focus on human motion prediction for atomic actions, we observe that people are witnessed to perform composite actions which consist of atomic actions that simultaneously happen. Considering the large number of action types, it is more laborious to collect composite actions than atomic actions. This paper presents a practical composite human motion prediction task, whose training data just contains atomic actions meanwhile the test data contains both atomic actions and composite actions. To evaluate this task, we collect a large-scale Composite HumAn Motion Prediction (CHAMP) dataset, whose training data has 16 types of atomic actions and test data has 50 types of composite actions. Despite the success of previous human motion prediction methods using Graph Convolutional Networks (GCN), these methods achieve inferior performances on our CHAMP dataset due to the huge domain gap between the training and test data. To solve this problem, we present a composite human motion prediction framework containing three modules. First, a Composite Motion Synthesis (CMS) module is designed to generate synthesized composite human actions from atomic actions. Second, a Composite GCN module is presented to predict human motion by modeling different human body parts. Third, a human body partition policy network is used to choose the best partition strategy for both the CMS and Composite GCN modules. Extensive experiments on the CHAMP dataset verify the effectiveness of our framework which obviously outperforms GCN-based methods.

Abstract:
Multi-camera vehicle tracking is a fundamental task for city traffic management to count traffic flow or monitor roads. This paper focuses on multi-camera tracking on the highway, which is more challenging compared with city streets in some problems such as fast-moving vehicles, tiny similar vehicles in appearance, longer tracking distance, and lighting intensity changes in the dark tunnels. In this paper, we propose a practical Appearance-Parsing Spatio-Temporal Trajectory Matching Network (ASTM-Net) based on the global appearance matching of local trajectory for addressing the cross-camera tracking tasks on the highway. Specifically, considering that the environmental disturbance and small vehicles have a similar appearance, we propose a multiple appearance-attribute parsing (MAP) module consisting of a Bi-propagation top-down (Bi-TD) block and appearance re-identification (ARe-ID) block to obtain salient global appearance-attribute features through given a video sequence. To address discrete tracking fragments caused by occlusion, we develop an appearance-joint-tracking (AJT) mechanism to merge the isolated tracklets with target interaction and occlusion handling. We then exploit an appearance-informed spatio-temporal matching (ASTM) module to achieve multi-camera tracklet-to-target assignment, which employs spatio-temporal consistency relation for intra-camera trajectory correction and coarse inter-camera tracklet correlation and aggregate appearance matrix of local trajectories for assigning global trajectory ID. Finally, in order to evaluate our proposed ASTM-Net, a new dataset, named HST, collected on the highway is established. We verify the ASTM-Net on the HST and the other three public datasets, i.e., CityFlow, UA-DETRAC, and Synthehicle, whose experimental results demonstrate the effectiveness and robustness of the proposed method.

Abstract:
Recently, fully-connected tensor network (FCTN) decomposition, which factorizes the target tensor into a series of interconnected factor tensors, has drawn growing focus on multi-dimensional visual data processing. However, the lack of clear physical interpretation for the factor tensors hinders us from introducing handcrafted regularizers to deeply explore the potential of FCTN decomposition. To tackle this issue, we suggest a unimode hierarchical nonlinear (UHN) decomposition for each factor tensor, which can adaptively capture the complex nonlinear structure and implicitly regularize factor tensors. With this UHN decomposition of the factor tensors, we naturally propose a nested fully-connected tensor network (N-FCTN) decomposition. Attributed to the adaptive and implicit regularization inherent in UHN decomposition of factor tensors, the proposed N-FCTN decomposition is expected to perform favorably against the original FCTN decomposition. Based on the proposed N-FCTN decomposition, we build a multi-dimensional visual data recovery model and provide a theoretical error bound between the recovered tensor by our model and the underlying tensor. To address the resulting non-convex and nonlinear optimization problem, we develop an efficient proximal alternating minimization (PAM)-based algorithm and establish its theoretical convergence guarantee. Extensive experimental results on multi-spectral images, color videos, and light field data demonstrate the superior recovery performance of the proposed method compared to the state-of-the-art methods.

Abstract:
Weakly supervised object localization (WSOL) aims to train instance-level locators by exploiting accessible image-level labels. By multiplying channel-wise features with classification weights and then adding them together, most prior works follow the pipeline of the Class Activation Map (CAM) to collect the semantic responses, thereby highlighting regions that contribute to class prediction to achieve WSOL. However, CAM-based methods treat the class contributions of all pixel positions in a channel equally and assign dominant weights for the discriminative channels biasedly. This fails to express the fine-grained pixel-level semantic response of each channel and model the complex contextual relations between channels, resulting in the mixup of the activation value between non-discriminative foreground regions and the background. To alleviate these issues, we present a Local Semantic activation enhancement and Global Spatial correlation mining network (LSGS-Net) for accurate WSOL. Specifically, we first propose a local activation generation module to explicitly learn the semantic response of each pixel position from channels. Then, we design a regularization loss to supervise the consistency between similar local activations, which utilizes the cross-image information to improve the accuracy of local activations. We further propose a K-nearest Neighbors graph module to capture the spatial correlation between different local activations, which can adaptively assign more proper weights when fusing all local activation. In the inference stage, the bounding box will be determined with a foreground threshold. Extensive experiments show that LSGS-Net achieves significant and consistent improvement with various backbones on the CUB, ILSVRC, and OpenImages benchmarks, with a 97.5% and 75.3% GT-Known LOC on CUB and ILSVRC, respectively. For segmentation quality on OpenImages, LSGS-Net already exceeds the SOTA method by 1.2% pIoU and 1.9% PxAP.

Abstract:
Making full use of spatial-temporal information is the key factor for removing compressed video artifacts. Recently, many deep learning-based compression artifact reduction methods have emerged. Among them, a series of methods based on deformable convolution have shown excellent capabilities in spatio-temporal feature extraction. However, local deformable offset prediction and pixel-wise inter-frame feature alignment in the unidirectional form limit the full utilization of temporal features in the existing method. Additionally, compressed video shows inconsistent degrees of distortion on different frequency components, and their restoration difficulty is also nonuniform. For the above problems presented by existing methods, we propose an enlarged motion-aware and frequency-aware network (EMAFA) to further extract spatio-temporal information and enhance information of different frequency components. To perceive different degrees of motion artifacts between compressed frames as accurately as possible, we design a bidirectional dense propagation pattern with pixel-wise and patch-wise deformable convolution (PIPA) module in the feature domain. In addition, we propose a multi-scale atrous deformable alignment (MSADA) module to enrich spatio-temporal features in image domain. Moreover, we design a multi-direction frequency enhancement (MDFE) module with multiple direction convolution to enhance the features of different frequency components. The experimental results show that the proposed method performs better than the state-of-the-art methods in both objective evaluation and visual perception experience. Supplementary experiments for Internet Streamed Video with hybrid-distortion demonstrate that our method also exhibits considerable generalizability for quality enhancement.

Abstract:
With the rapid development of imaging sensor technology in the field of remote sensing, multi-modal remote sensing data fusion has emerged as a crucial research direction for land cover classification tasks. While diffusion models have made great progress in generative models and image classification tasks, existing models primarily focus on single-modality and single-client control, that is, the diffusion process is driven by a single modal in a single computing node. To facilitate the secure fusion of heterogeneous data from clients, it is necessary to enable distributed multi-modal control, such as merging the hyperspectral data of organization A and the LiDAR data of organization B privately on each base station client. In this study, we propose a multi-modal collaborative diffusion federated learning framework called FedDiff. Our framework establishes a dual-branch diffusion model feature extraction setup, where the two modal data are inputted into separate branches of the encoder. Our key insight is that diffusion models driven by different modalities are inherently complementary in terms of potential denoising steps on which bilateral connections can be built. Considering the challenge of private and efficient communication between multiple clients, we embed the diffusion model into the federated learning communication structure, and introduce a lightweight communication module. Qualitative and quantitative experiments validate the superiority of our framework in terms of image quality and conditional consistency. To the best of our knowledge, this is the first instance of deploying a diffusion model into a federated learning framework, achieving optimal both privacy protection and performance for heterogeneous data. Our FedDiff surpasses existing methods in terms of performance on three multi-modal datasets, achieving a classification average accuracy of 96.77% while reducing the communication cost.

Abstract:
With the rapid advancement of three-dimensional (3D) sensing technology, point cloud has emerged as one of the most important approaches for representing 3D data. However, quality degradation inevitably occurs during the acquisition, transmission, and process of point clouds. Therefore, point cloud quality assessment (PCQA) with automatic visual quality perception is particularly critical. In the literature, the graph convolutional networks (GCNs) have achieved certain performance in point cloud-related tasks. However, they cannot fully characterize the nonlinear high-order relationship of such complex data. In this paper, we propose a novel no-reference (NR) PCQA method with hypergraph learning. Specifically, a dynamic hypergraph convolutional network (DHCN) composing of a projected image encoder, a point group encoder, a dynamic hypergraph generator, and a perceptual quality predictor, is devised. First, a projected image encoder and a point group encoder are used to extract feature representations from projected images and point groups, respectively. Then, using the feature representations obtained by the two encoders, dynamic hypergraphs are generated during each iteration, aiming to constantly update the interactive information between the vertices of hypergraphs. Finally, we design the perceptual quality predictor to conduct quality reasoning on the generated hypergraphs. By leveraging the interactive information among hypergraph vertices, feature representations are well aggregated, resulting in a notable improvement in the accuracy of quality pediction. Experimental results on several point cloud quality assessment databases demonstrate that our proposed DHCN can achieve state-of-the-art performance. The code will be available at: https://github.com/chenwuwq/DHCN.

Affiliations: School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, China; Department of General Surgery, Xiangya Hospital, Central South University, Changsha, China; School of Cyber Science and Engineering, MoE KLINNS Laboratory, Xi’an Jiaotong University, Xi’an, China; Hunan Frontline Medical Technology Company Ltd., Changsha, China; Department of Orthopaedics, The Second Affiliated Hospital of Xi’an Jiaotong University, Xi’an, China

Abstract:
Recent advances in deep learning have greatly facilitated the automated segmentation of ultrasound images, which is essential for nodule morphological analysis. Nevertheless, most existing methods depend on extensive and precise annotations by domain experts, which are labor-intensive and time-consuming. In this study, we suggest using simple aspect ratio annotations directly from ultrasound clinical diagnoses for automated nodule segmentation. Especially, an asymmetric learning framework is developed by extending the aspect ratio annotations with two types of pseudo labels, i.e., conservative labels and radical labels, to train two asymmetric segmentation networks simultaneously. Subsequently, a conservative-radical-balance strategy (CRBS) strategy is proposed to complementally combine radical and conservative labels. An inconsistency-aware dynamically mixed pseudo-labels supervision (IDMPS) module is introduced to address the challenges of over-segmentation and under-segmentation caused by the two types of labels. To further leverage the spatial prior knowledge provided by clinical annotations, we also present a novel loss function namely the clinical anatomy prior loss. Extensive experiments on two clinically collected ultrasound datasets (thyroid and breast) demonstrate the superior performance of our proposed method, which can achieve comparable and even better performance than fully supervised methods using ground truth annotations.

Abstract:
Although recurrent network-based optical flow estimation methods have shown great success in recent years, most of these methods have difficulty handling large displacements and occlusions because the existing recurrent networks are usually restricted to coarse-resolution single-scale models while ignoring the multiscale features brought by hierarchical concepts in previous coarse-to-fine approaches. In this paper, we propose an adaptive-aware correlation recurrent network for optical flow estimation, named ACR-Net, which preserves fine motion features with a single-scale resolution recurrent framework and adaptively incorporates multiscale features at different stages to achieve high-accuracy optical flow estimation. First, our proposed self-adaptation scale-aware correlation module can incorporate the adaptive correlation of multiscale inter- and intra-motion features, which makes the features more discriminative for capturing long-range dependencies between pixels. Second, our presented adaptive-aware motion module can effectively extract the required features of different kinds of motion from multilevel correspondence. Third, our introduced cross-guide motion and fusion modules can accurately guide the propagation of reliable pixels towards unreliable pixels and dynamically determine the most suitable expression to address the occlusion challenges. Comprehensive experiments demonstrate that ACR-Net outperforms existing two-view models, striking a good balance between speed and accuracy and achieving the best performance on the MPI-Sintel final pass and KITTI-2015 test datasets. Source code is available at https://github.com/PCwenyue/ACR-Net-TCSVT.

Abstract:
The effectiveness of object detection is significantly hampered in challenging nighttime or rainy scenarios. This is due to the severe domain shifts between daytime and adverse-visual images. Previous methods have demonstrated that using image-to-image translation methods for data augmentation can effectively address domain shifts, but they may still fail in preserving image objects when faced with extreme adverse images like rainy nights. In addition, achieving diversity in the generated results remains challenging. To this end, we propose a Progressive Adverse Image Translation (PAIT) framework that tackles domain shifts by generating diverse and detail-preserving images. The main contributions of this paper are as follows. 1) We propose a novel PAIT framework, which incorporates an iterative mapping module and a slicing layer. This framework enables the progressive generation of increasingly challenging images in a fine-to-coarse manner. 2) To preserve the details of the images, we innovatively introduce an iterative mapping module to generate smooth style transform curves. 3) To enhance the diversity of synthesized images, a simple but efficient end-to-end optimization method is proposed. 4) We found a strong correlation between the style diversity of augmented images and the performance of the detection model through a quantitative analysis, highlighting the crucial role of style diversity in enhancing the model’s generalizability. Our framework achieves state-of-the-art performance on multiple challenging visual datasets, surpassing the current state-of-the-art methods by 27%(+8.0AP). Moreover, our approach and modules can be easily extended to different detectors and other domain adaptation methods, making it a versatile solution for object detection in adverse visual environments. Our code will be available at https://github.com/ssunguotu/Diverse-Aug.

Abstract:
Dynamic hand gesture authentication aims to recognize users’ identity through the characteristics of their hand gestures. How to extract favorable features for verification is the key to success. Cross-modal knowledge distillation is an intuitive approach that can introduce additional modality information in the training phase to enhance the target modality representation, improving model performance without incurring additional computation in the inference phase. However, most previous cross-modal knowledge distillation methods directly transfer information from one modality to another one without considering the modality gap. In this paper, we propose a novel translation mechanism in cross-modal knowledge distillation that can effectively mitigate the modality gap and utilize the information from the additional modality to enhance the target modality representation. In order to better transfer modality information, we propose a novel modality fusion-enhanced non-local (MFENL) module, which can fuse the multi-modal information from the teacher network and enhance the fused features based on the modality input into the student network. We use cascaded MFENL modules as the translator based on the proposed cross-modal knowledge distillation method to learn an enhanced RGB representation for dynamic hand gesture authentication. Extensive experiments on the SCUT-DHGA dataset demonstrate that our method has compelling advantages over the state-of-the-art methods. The code is available at https://github.com/SCUT-BIP-Lab/TranslationCKD.

Abstract:
Currently, learning-based multi-view stereo (MVS) has been dominated by the pipeline of 3D cost volume and regularization network over the static cost volume for depth regression. However, this methodology is plagued by heavy time and memory consumption, which greatly hinders the applications of these methods for real-world high-resolution images. To address these challenges, we present Effi-MVS+, an efficient multi-scale dynamic cost volume based MVS method. Firstly, instead of constructing a static cost volume and predicting a probability distribution map for depth regression, we update the depth map by iteratively predicting depth residuals. In each iteration, we construct a lightweight dynamic cost volume by encoding local matching and regularization information. The dynamic cost volume is subsequently processed using a 2D convolution-based GRU, which owns significant advantages in computational complexity and efficiency. Secondly, we propose a cross-scale propagation mechanism to enhance the multi-scale dynamic cost volume. This mechanism facilitates the progressive aggregation of multi-scale information, thereby providing enhanced matching and regularization information. Thirdly, to further improve the efficiency, we provide a reliable initial depth map to launch the framework and guarantee fast convergence. Extensive experiments on the DTU and Tanks & Temples benchmarks demonstrate the superiority of our method, which outperforms other state-of-the-art methods by a large margin in terms of reconstruction quality, speed, and memory usage. Code will be released at https://github.com/npucvr/Effi-MVS-plus.

Abstract:
The human visual system (HVS) cannot perceive the pixel intensity change below a certain threshold which is also known as the just noticeable difference (JND). Conventional JND prediction models mainly follow a two-step pipeline by first modeling the diverse masking effects based on the findings of the HVS and then fusing the results of different masking effect models into an overall JND map. However, due to the insufficient understanding of the HVS properties at the current stage, it is difficult to devise accurate computational models to characterize the complex masking effects. Moreover, the reasonability of the manually designed fusion schemes also lacks justification. In this work, we rethink the JND estimation problem from a fresh perspective by conceptualizing the JND as the difference map between the pristine image and its corresponding Critical Perceptual Lossless (CPL) counterpart. Building on this insight, we introduce a deep residual learning framework called ResJND to learn the discrepancies between the pristine image and its CPL counterpart, aiming to predict JND map implicitly. To support the training of our proposed ResJND model, we construct a dedicated CPL image dataset called CPL-Set which comprises a collection of pristine images and their corresponding CPL images selected by thorough subjective experiments. Comprehensive experiments have conclusively shown that our ResJND model excels at accurately predicting the JND map. Additionally, it demonstrates superior performance in associated applications, such as JND-guided noise injection, JND-guided image compression, and distortion visibility prediction. Codes are available at: https://github.com/Knife646/ResJND.

Abstract:
This paper presents a hyperspectral image (HSI) reconstruction technique based on physics-driven optimization of multispectral filter array (MSFA) patterns. The encoding of HSIs using an MSFA and their decoding through deep learning has gained increasing attention. However, previous studies have seldom explored pattern optimization from a physical perspective during the encoding process. In this paper, we apply a spectral sensitivity function (SSF) response model to generate the MSFA, and the goal of encoder optimization extends from SSF to physical structural parameters. To fully utilize spatial and spectral information in the decoding process, we design an end-to-end dual-branch spatial-spectral fusion network (DSFNet). By jointly optimizing the MSFA with the SSF response model and DSFNet, the proposed method significantly improves the reconstruction accuracy of HSI. When compared with existing HSI reconstruction methods, our proposed approach achieves state-of-the-art performance in both metric and visual quality.

Abstract:
Intra prediction is a vital tool in video coding that eliminates the spatial redundancy within a frame to enhance compression efficiency. Conventional intra prediction methods employ multiple directional prediction modes to describe textures in local areas. Recently, research on neural network-based intra prediction has achieved great success. The block-context pairs are divided into multiple clusters according to a predefined relationship, and a corresponding network is trained and applied for each cluster. However, the networks in these methods adopt fixed parameters to predict diverse image blocks, making it hard to cope with various textures in natural images. Inspired by recent works on parameter prediction, in this paper, we propose a meta-network-based intra prediction method, called MetaIP, that dynamically customizes the network parameters for each block sample in a given cluster. MetaIP consists of a meta-subnetwork and a prediction subnetwork. For an image block, the meta-subnetwork takes its neighboring reference pixels and some auxiliary information (e.g., quantization parameter) as inputs to generate customized parameters first. Then, the prediction subnetwork uses the customized parameters to infer the predicted block. MetaIP can generate multiple sets of network parameters corresponding to multiple modes for an image block. The optimal mode is determined by the rate-distortion optimization. MetaIP is integrated into VVC to assist or replace the directional prediction modes to evaluate its performance. The experimental results demonstrate that MetaIP with four prediction modes achieves an average of 3.84% and 1.96% bitrate saving for the luma component over VTM-17.0 when assisting or replacing VVC intra modes, respectively.

Abstract:
There is an urgent need from various multimedia applications to efficiently compress point clouds. The Moving Picture Experts Group has released a standard platform called geometry-based point cloud compression (G-PCC). However, its k-nearest neighbor (k-NN) based attribute prediction has limited efficiency for point clouds with rich texture and directional information. To overcome this problem, we propose a texture-aware attribute predictive coding framework in a point cloud diffusion model. In our work, attribute intra prediction is solved as a diffusion-based interpolation problem, and a general attribute predictor is developed. It is theoretically proven that G-PCC k-NN based predictor is a degraded case of the proposed diffusion-based solution. First, a point cloud is represented as two levels of details with seeds as the inpainting mask and non-seed points to be predicted. Second, we design point cloud partial difference operators to perform energy-minimizing attribute inpainting from seeds to unknowns. Smooth attribute interpolation can be achieved via an iterative diffusion process, and an adaptive early termination is proposed to reduce complexity. Third, we propose a structure-adaptive attribute predictive coding scheme, where edge-enhancing anisotropic diffusion is employed to perform texture-aware attribute prediction. Finally, attributes of seeds are beforehand encoded and prediction residuals of left points are progressively encoded into bitstream. Experiments show the proposed scheme surpasses the state-of-the-art by an average of 14.14%, 17.52%, and 17.87% BD-BR gains on the coding of Y, U, and V components, respectively. Subjective results on attribute reconstruction quality also verify the advantage of our scheme.

Abstract:
The human pose transfer task aims to generate synthetic person images that preserve the style of reference images while accurately aligning them with the desired target pose. However, existing methods based on generative adversarial networks (GANs) struggle to produce realistic details and often face spatial misalignment issues. On the other hand, methods relying on denoising diffusion models require a large number of model parameters, resulting in slower convergence rates. To address these challenges, we propose a self-calibration flow-guided module (SCFM) to establish precise spatial correspondence between reference images and target poses. This module facilitates the denoising diffusion model in predicting the noise at each denoising step more effectively. Additionally, we introduce a multi-scale feature fusing module (MSFF) that enhances the denoising U-Net architecture through a cross-attention mechanism, achieving better performance with a reduced parameter count. Our proposed model outperforms state-of-the-art methods on the DeepFashion and Market-1501 datasets in terms of both the quantity and quality of the synthesized images. Our code is publicly available at https://github.com/zylwithxy/SCFM-guided-DDPM.

Abstract:
Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. Subsequently, we perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).

Abstract:
Single image-based 3D shape retrieval (IBSR) has attracted appealing academic interests recently, which aims to find the corresponding 3D shape from a shape repository for a given single 2D image. However, state-of-the-art methods neglect the discrepancy in the image domain due to unavoidable occlusion. The occluded image representations acting as noise, may perturb the alignment of the normal 2D representations with the 3D representations, resulting in occlusion-sensitive image-shape retrieval. To tackle this crucial challenge, in this paper, we propose a novel Occlusion-invariant PErception Network (OPEN) to learn occlusion-invariant image representations and image-shape correspondence. Specifically, we propose a hard occlusion example mining strategy to sample a hard image pair. Hereafter, to enforce the consistency between normal and occluded 2D images, we propose an Occlusion-invariant Image Consistency (OIC) based on hard image pairs, which gathers 2D image representations of the same instance while pushing away other 2D image representations. In addition, to prevent the 3D representations from perturbation by the occluded 2D representations, we design an Occlusion-invariant Correspondence Consistency (OCC) based on hard image pairs, which pulls the image-specific 3D shape embedding derived by attention mechanism close to the other 2D image representation of the same instance. The combination of OIC and OCC leads to accurate 2D-3D shape matching in challenging occluded scenarios. Our OPEN outperforms state-of-the-art methods by 6%～ 11% in terms of Top-1 retrieval accuracy on several representative benchmark datasets.

Abstract:
Object detection via deep neural networks has undergone considerable advancements in recent years. Yet, the detection of smaller objects, specifically those with a few pixels (i.e., < 32^2 pixels), is still challenging compared with large objects (i.e., > 96^2 pixels). Existing methods commonly apply high-resolution features or complex super-resolution strategies based on the two-stage Faster Region Convolutional Neural Network (RCNN). They sequentially apply localization and classification stages after a shared feature map extracted by one single backbone network. However, these methods cause low detection accuracy of small objects, high computational overhead, and waste of hardware resources. In this paper, we develop a high-accuracy and real-time small object detection system with negligible computational overhead and low hardware idleness. At the software level, we propose a two-stage Coarse-to-Fine Decoupling RCNN (CFD RCNN) with three techniques: 1) The shared backbone decoupling for localization and classification to achieve high accuracy for both tasks; 2) The training method using backbone feature upsampling for localization with low computational overhead; 3) The object cropping strategy from the original high-resolution image for high-accuracy classification. At the hardware level, we propose a virtualized FPGA accelerator with the Dynamic Resource Allocation (DRA) strategy. The DRA strategy reallocates the hardware resources, considering the workload and resource preference of each stage in CFD RCNN to reduce hardware idleness. Extensive experiments on the TT100K and GTSDB datasets using Xilinx ZCU102 FPGA show that the proposed small object detection system can achieve 2.9% improvement in mean average precision (mAP) compared with state-of-the-art (SOTA) algorithms and raised the throughput from 18.9 FPS to > \!~26.0 FPS ( ～ 1.37× ) compared with existing accelerators.

Abstract:
With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative Sentinel-Guided Zero-Shot Learning (SG-ZSL) paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both model entities. The teacher model serves as a sentinel on behalf of the data owner, replacing real data, to guide the student model at the AI service provider’s end during training. Considering the disparity of knowledge space between the teacher and student, we introduce two variants of the teacher model: the omniscient and the quasi-omniscient teachers. Under these teachers’ guidance, the student model seeks to match the teacher model’s performance and explores domains that the teacher has not covered. To trade-off between privacy and performance, we further introduce two distinct security-level training protocols: white-box and black-box, enhancing the paradigm’s adaptability. Despite the inherent challenges of real data absence in the SG-ZSL paradigm, it consistently outperforms in ZSL and GZSL tasks, notably in the white-box protocol. Our comprehensive evaluation further attests to its robustness and efficiency across various setups, including stringent black-box training protocol.

Abstract:
Novel-view synthesis with sparse input views is important for practical applications such as AR/VR and autonomous driving. Many works in this field have already integrated depth information into NeRF, utilizing depth priors for assistance in geometric and spatial understanding. However, most existing work tends to either overlook the inaccuracies in depth maps or only handle them roughly, limiting the effectiveness of the synthesis. To address this issue, we propose a depth-guided robust point cloud fusion NeRF for sparse input synthesis. We first construct a point cloud for each input view, with a novel point cloud representation based on learnable matrices and vectors. Then, through an additional lightweight scene fusion network, we fuse the point clouds from each input view to build a point cloud of the entire scene. By optimizing the point cloud representation and scene fusion network, inaccuracies in the depth map can be adjusted and refined, thereby achieving a more precise perception of the overall scene. Each voxel in the scene is determined by referencing the fused point cloud to establish its density and appearance. Experimental results demonstrate that our method outperforms state-of-the-art baselines.

Abstract:
Spatio-temporal action detection networks, which need to simultaneously extract and fuse spatial and temporal features, often result in existing models becoming bloated and difficult to run in real-time and deploy on edge devices. This paper introduces an efficient and real-time spatio-temporal action detection model, YOWOv3. This model uses efficient 3D and 2D backbone networks to separately extract spatial and spatial-temporal features from sequential information. A lightweight spatio-temporal feature fusion module, designed by deeply integrating convolution and self-attention mechanisms, further enhances the extraction of spatio-temporal features. We refer to this module as the CFACM (Channel Fusion & Attention Convolution Mix) module. Our approach not only outperforms the latest efficient spatio-temporal action detection models in terms of lightness, reducing the model size by 24% compared to the latter, but also improves the mAP accuracy on the UCF101-24 dataset by 1.35%, while maintaining excellent speed performance, thus achieving a balance between accuracy and speed. Furthermore, existing models often use 3D convolutions to extract temporal information, which may be limited on certain devices, such as Apple’s M series processors. To mitigate the potential issue of 3D convolution operators not being supported during edge deployment of spatio-temporal action detection models, we employ a spatio-temporal shift module containing only 2D convolutions. This enables the model to acquire temporal information and inject the obtained temporal features into multi-level spatio-temporal feature extraction models. This not only liberates the model from the constraints of 3D convolution operations but also enhances the model’s balance between accuracy and speed. This results in state-of-the-art performance in lightweight networks using only 2D convolutions.

Abstract:
Re-detection is a necessary capability for long-term tracking. Target candidate proposals in the whole image can provide a chance of tracking reset when tracking fails due to tracking drift or target invisibility. In this paper, we propose a unified local-global tracker based on the same transformer architecture sharing weights, which can not only search in a continuous local region but also provide target candidates of the global image in every frame. The requirements of both long-term and short-term scenarios can be addressed using a unified model. A simple proposal selection scheme is adopted to properly select the candidate proposals of re-detection, to assist tracking and obtain better performance. The scheme performs re-evaluation of all high-quality proposals based on a transformer-based embedding network, once the predicted state of the local tracking is not sufficient to be accurate. To capture appearance variations brought by online updates in minimum risks, a long-term-friendly dynamic template update scheme is also designed. Extensive experiments are conducted to demonstrate the effectiveness of our proposed tracker, including three short-term tracking benchmarks and six long-term benchmarks. Our tracker can achieve results comparable to that of the state-of-the-art. The proposed tracker can also work well in balancing the performance and speed, achieving an average speed of approximately 25 fps tested on LaSOT testing set.

Abstract:
Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.

Abstract:
Despite extensive exploration of more powerful multi-object tracking (MOT) frameworks, the impact of frequent occlusion has remained a formidable challenge. In this work, we present a novel MOT framework with Authenticity Hierarchizing and Occlusion Recovery (AHOR), that strikingly handles occlusion and demonstrates superior precision and adaptability. Specifically, through an in-depth analysis of the classical tracking-by-detection (TBD) paradigm, we fully upgrade three aspects. Firstly, we propose an Existence Score that provides a more accurate depiction of detection authenticity under occlusion, enhancing the effectiveness and robustness of the hierarchical association. Secondly, we present an ingeniously devised pre-processing method in conjunction with a Recovery Intersection over Union (RIoU) for location similarity measurement, addressing the adverse effects of occlusion-induced disparity between visible and true object regions. Lastly, we introduce an Occluded Person Re-identification Module (ODReID) that extracts appearance features from the restricted visible region, overcoming the critical dependence on object quality. Results of extensive experiments demonstrate that our AHOR achieves state-of-the-art performance on MOT17, MOT20, DanceTrack, and VisDrone test sets.

Abstract:
Recent advances in NeRF-based 3D-aware GANs have achieved outstanding performance, especially in the realm of human facial representations, making projection of facial images back into their latent space superior and preferable compared to 2D GAN inversion. However, the direct application of 2DGAN inversion techniques to 3DGAN raises challenges due to potential appearance distortions and geometric inconsistences. To tackle these issues, this work presents a novel integrated framework that combines a composite inversion pipeline in both the SS and W+ spaces and integrates a contrastive-based training strategy, ensuring proficient disentanglement within the module. Moreover, we design a facial semantic manipulation technique based on dimensional analysis of the latent code, which is fully compatible with the proposed 3DGAN inversion pipeline. Comprehensive experimental validations substantiate the effectiveness of the proposed approach in executing 3d-aware face inversion and semantic editing tasks, presenting a robust technological solution for a diverse array of digital human modeling applications in the downstream.

Affiliations: Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China; School of Information Management, Jiangxi University of Finance and Economics, Nanchang, Jiangxi, China; Multimedia Laboratory, ByteDance Inc., Shenzhen, China; Institute for Quantum Information and the State Key Laboratory of High Performance Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, China

Abstract:
Investigating how people perceive virtual reality (VR) videos in the wild (i.e., those captured by everyday users) is a crucial and challenging task in VR-related applications due to complex authentic distortions localized in space and time. Existing panoramic video databases only consider synthetic distortions, assume fixed viewing conditions, and are limited in size. To overcome these shortcomings, we construct the VR Video Quality in the Wild (VRVQW) database, containing 502 user-generated videos with diverse content and distortion characteristics. Based on VRVQW, we conduct a formal psychophysical experiment to record the scanpaths and perceived quality scores from 139 participants under two different viewing conditions. We provide a thorough statistical analysis of the recorded data, observing significant impact of viewing conditions on both human scanpaths and perceived quality. Moreover, we develop an objective quality assessment model for VR videos based on pseudocylindrical representation and convolution. Results on the proposed VRVQW show that our method is superior to existing video quality assessment models. We have made the database and code available at https://github.com/limuhit/VR-Video-Quality-in-the-Wild.

Abstract:
Burst denoising aims to generate a clean image based on a sequence of noisy frames of the same scene captured in quick succession. However, relative motions inevitably happen between frames due to the movements of scenes or cameras, which would lead to blur and ghosting in the generated images. To address this issue, in this paper we propose a novel Efficient Burst Denoising Network (EBDNet) by integrating optical flow estimation with kernel prediction network in an end-to-end scenario. First, a lightweight Denoising Optical Flow Estimation (DOFE) module is presented for both burst feature and image alignment, which encourages to reduce the noise effect when making optical flow estimation. Building upon the aligned burst features and frames, a new fast Fourier convolution-enhanced kernel prediction module is introduced to merge the complementary information. It employs an encoder-decoder architecture with a well-designed feature enrichment block, which exploits the multi-level information from the encoder to boost the decoder features from both spatial and frequency domain views. Extensive experiments demonstrate that the proposed network achieves the best performance compared with state-of-the-art methods while maintaining reasonably low computing complexity.

Abstract:
The problem of video demoiréing is a new challenge in video restoration. Unlike image demoiréing, which involves removing static and uniform patterns, video demoiréing requires tackling dynamic and varied moiré patterns while maintaining video details, colors, and temporal consistency. It is particularly challenging to model moiré patterns for videos with camera or object motions, where separating moiré from the original video content across frames is extremely difficult. Nonetheless, we observe that the spatial distribution of moiré patterns is often sparse on each frame, and their long-range temporal correlation is not significant. To fully leverage this phenomenon, a sparsity-constrained spatial self-attention scheme is proposed to concentrate on removing sparse moiré efficiently for each frame without being distracted by dynamic video content. The frame-wise spatial features are then correlated and aggregated via the local temporal cross-frame-attention module to produce temporal-consistent high-quality moiré-free videos. The above decoupled spatial and temporal transformers constitute the Spatio-Temporal Decomposition Network, dubbed STD-Net. For evaluation, we present a large-scale video demoiréing benchmark featuring various real-life scenes, camera motions, and object motions. We demonstrate that our proposed model can effectively and efficiently achieve superior performance on video demoiréing and single image demoiréing tasks. The proposed dataset is released at https://github.com/FZU-N/LVDM.

Affiliations: Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Center for Big Data and Population Health of the Institute of Health and Medcine, Anhui University, Hefei, China

Abstract:
Existing Transformer-based RGB-Thermal (RGBT) tracking methods either use cross-attention to fuse the two modalities, or use self-attention and cross-attention to model both modality-specific and modality-sharing information. However, the significant appearance gap between modalities limits the feature representation ability of certain modalities during the fusion process. To address this problem, we propose a novel Progressive Fusion Transformer called ProFormer, which progressively integrates single-modality information into the multimodal representation for robust RGBT tracking. In particular, ProFormer first uses a self-attention module to collaboratively extract the multimodal representation. Then, ProFormer introduces two cross-attention modules to interact it with the features of the dual modalities for enhancing modality-specific information in the multimodal representation. In addition, we propose a dynamically guided learning algorithm that adaptively employs the well-performing branches to guide the learning of other branches, to improve the representation ability of each branch. Extensive experiments demonstrate that our proposed ProFormer achieves a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.

Abstract:
Free-view image compression has attracted the gaze of people due to the rapid development of 3D vision applications. However, as far as we know, no end-to-end learned compression model is proposed for free-view image sequences. Most existing learned compression models are limited and only applicable to image sequences with simple horizontal and vertical translations, such as stereo and light field image compression models. In this paper, we first propose an end-to-end network FICNet to improve free-view image compression performance, effectively eliminating the spatial redundancy among multiple views. In our methods, a differentiable depth prediction module is introduced to our model for exploring spatial correlation and achieving end-to-end training. Besides, we demonstrate a strategy of multi-view reference to alleviate the hole problem in depth-based prediction, and a filter network is designed to improve the prediction accuracy further. A residual fusion network with multi-level complementary features is also utilized to enhance the reconstruction quality. Extensive experiments show that our model can perform favorably in generating more refined predictive images and achieves up to a 16.23% BD-rate improvement compared to the state-of-the-art method 3D-HEVC.

Abstract:
Transformer-based deep learning networks are revolutionizing our society. The convolution and attention co-designed (CAC) Transformers have demonstrated superior performance compared to the conventional Transformer-based networks. However, CAC Transformer networks contain various nonlinear functions, such as softmax and complex activation functions, which require high precision hardware design yet typically with significant cost in area and power consumption. To address these challenges, SoftAct, a compact and high-precision algorithm-hardware co-designed architecture, is proposed to implement both softmax and nonlinear activation functions in CAC Transformer accelerators. An improved softmax algorithm with penalties is proposed to maintain precision in hardware. A stage-wise full zero detection method is developed to skip redundant computation in softmax. A compact and reconfigurable architecture with a symmetrically designed linear fitting module is proposed to achieve nonlinear functions. The SoftAct architecture is designed in an industrial 28-nm CMOS technology with the MobileViT-xxs network classifying the ImageNet-1k dataset as the benchmark. Compared with the state of the art, SoftAct improves up to 5.87% network accuracy under 8-bit quantization, 153.2× area efficiency, and 1435× overall efficiency.

Affiliations: School of Electronic and Information Engineering, Beihang University, Beijing, China; School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University, Xi’an, China; CEDEO.net, Villar Dora, Italy; Department of Computer Science, University of Rochester, Rochester, NY, USA; Department of Electrical and Electronic Engineering, University of Surrey, Guildford, U.K.; Microsoft, Redmond, WA, USA; Department of Electrical and Electronic Engineering, Imperial College London, London, U.K.; Department of Software and Information Science, Iwate Prefectural University, Takizawa, Iwate, Japan

Abstract:
Our world is becoming rapidly dependent on data of increasing complexity, diversity, and volume which calls for robust and powerful tools to process such big data. Probabilistic generative models fulfill this goal by learning latent characteristic data relations, especially for the recent emergence of large-scale deep generative models that are able to create realistic content, namely, artificial intelligence-generated content (AIGC). The applications of AIGC span across various domains, and witness rich potential in multimedia content creation, including dialog generation, text-to-speech conversion, image/video generation, and cross-modal content generation.

Abstract:
In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.

Abstract:
One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one mapping between audio signal and head motion sequences, introducing ambiguity correspondences into the mapping since people can behave differently in head motions when speaking the same content. This unreasonable mapping form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when speaking. Based on the above observation, we propose OSM-Net, a one-to-many one-shot talking head generation network with natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion features can be sampled within a reasonable range to achieve the one-to-many mapping. Besides, the landmark constraint and time window feature input improve the accurate expression feature extraction and video generation. Extensive experiments show that OSM-Net generates more natural realistic head motions under reasonable one-to-many mapping paradigm compared with other methods.

Abstract:
With the rapid development of multimedia services and the dramatic growth of video data volume, efficient video representation and AI-generated content (AIGC) become critical parts of future multimedia communication systems. Sketch graph is a structured abstraction of key textures in an image, and video sketch graph further exploits the temporal continuity of videos to achieve a sparse representation. Sketch-based representation has potential applications in communication systems for both human subjective perception and machine vision tasks, and provides a new idea for AIGC. However, current video sketch extraction methods rely on human assistance and correction, and cannot be applied to end-to-end communication systems. We design a novel framework for spatiotemporal sketch extraction based on deep learning methods. In the proposed framework, sketch extraction and sparse coding are performed at the sender side using structural and temporal features of the video. The original videos are generatively reconstructed at the receiver side or applied to downstream machine vision tasks. We validate the performance of the proposed method on Cityscapes dataset with different metrics. Experiments show that our proposed framework can be end-to-end adapted to video communication tasks in different scenarios and can achieve efficient video characterization and transmission. Moreover, our proposed method enables sketch-based end-to-end AIGC for video generation.

Abstract:
This paper focuses on generating Inverse Synthetic Aperture Radar (ISAR) images from optical images, in particular, for orbit space targets. ISAR images are widely applied in space target observation and classification tasks, whereas, limited to the expensive cost of ISAR sample collection, training deep learning-based ISAR image classifiers with insufficient samples and generating ISAR samples from emulation optical images via image translation techniques have attracted increasing attention. Image translation has highlighted significant success and popularity in computer vision, remote sensing and data generation societies. However, most of the existing methods are implemented under the discipline of extracting the explicit pixel-level features and do not perform effectively while entailing translation to domains with specific implicit features, such as ISAR image does. We propose a meta-learning based domain prior to implicit feature modelling and apply it to CycleGAN and UNIT models to realize effective translations between the ISAR and optical domains. Two representative implicit features, ISAR scattering distribution feature from the physical domain and the classification identifying feature from the task domain, are elaborately formulated with explicit modelling in statistic form. A meta-learning based training scheme is introduced to leverage the mutual knowledge of domain priors across different samples, and thus allows few-shot learning capacity with dramatically reduced training samples. Extensive simulations validate that the obtained ISAR images have better visible-authenticity and training-effectiveness than the existing image translation approaches on various synthetic datasets. Source codes are available at https://github.com/XYLGroup/MLDP.

Abstract:
In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories. However, this task is inherently challenging due to the complex granularity of category subdivisions and the limited availability of data for each category. To address these challenges, this work proposes CSDNet, a pioneering framework that effectively explores contrastive learning and self-distillation to learn discriminative representations specifically designed for Ultra-FGVC tasks. CSDNet comprises three main modules: Subcategory-Specific Discrepancy Parsing (SSDP), Dynamic Discrepancy Learning (DDL), and Subcategory-Specific Discrepancy Transfer (SSDT), which collectively enhance the generalization of deep models across instance, feature, and logit prediction levels. To increase the diversity of training samples, the SSDP module introduces adaptive augmented samples to spotlight subcategory-specific discrepancies. Simultaneously, the proposed DDL module stores historical intermediate features by a dynamic memory queue, which optimizes the feature learning space through iterative contrastive learning. Furthermore, the SSDT module effectively distills subcategory-specific discrepancies knowledge from the inherent structure of limited training data using a self-distillation paradigm at the logit prediction level. Experimental results demonstrate that CSDNet outperforms current state-of-the-art Ultra-FGVC methods, emphasizing its powerful efficacy and adaptability in addressing Ultra-FGVC tasks.

Abstract:
Learning with noisy labels has become more and more popular because of the expensive costs of collecting high-quality labels. To avoid the decrease in model performance caused by incorrect annotations, some existing methods try to select reliable samples based on the local structure of nearest neighbors in the feature space. However, the information from local neighbors is unreliable when encountering extremely noisy cases, and selecting samples only using the feature space may result in clear noise accumulation. To this end, we propose a Dual-Space Collaborative Learning (DSCL) framework to boost classification accuracy by jointly using the complementarity information from both semantic and feature spaces. Specifically, a collaborative selection module is designed by constructing a set of global prototypes and high-confidence semantic predictions, which enhances the robustness of the sample selection process. Moreover, a collaborative regularization module is constructed by the bidirectional adjustment between the semantic and feature spaces, which effectively alleviates the noise accumulation issue caused by sample selection bias in a single space. By simultaneously utilizing the two modules, our method improves the accuracy of sample selection and mitigates the degradation caused by noisy labels. Extensive experimental results indicate the superior performance of DSCL compared with various baselines. The source codes of this paper are available at https://github.com/DarrenZZhang/DSCL

Abstract:
Empowered by the sophisticated long-range dependency modeling ability of Transformer, tracking performance has seen a dynamic increase in recent years. Approaches in this vein leverage the Transformer feature to integrate the information of target and search regions while neglecting the superior local representation extracted by their CNN backbone. To address this, we introduce a BIdirectional inTeraction mechanism between CNN and Transformer features for visual tracking, termed BIT-Tracker, which admits a comprehensive fusion of local and global representations, and thus boosts tracking performance. The first ingredient of BIT-Tracker is an aggregation of multi-level Transformer features to achieve a better global modeling ability. In order to combine the merits of both local and global representations, our second ingredient performs a bi-directional interaction between CNN and Transformer features, where the interaction is achieved via either querying the CNN feature from the Transformer feature or querying the Transformer feature from the CNN feature. Afterwards, the outputs from both directions are fused to predict the temporal locations of targets. Extensive experiments demonstrate the effectiveness of the proposed feature aggregation and bi-directional interaction modules. Impressively, BIT-Tracker achieves leading performance on eight tracking benchmarks and outperforms SOTA results by salient margins. Code will be made available.

Abstract:
Multimodal video sentiment analysis aims to integrate multiple modal information to analyze the opinions and attitudes of speakers. Most previous work focuses on exploring the semantic interactions of intra- and inter-modality. However, these works ignore the reliability of multimodality, i.e., modalities tend to contain noise, semantic ambiguity, missing modalities, etc. In addition, previous multimodal approaches treat different modalities equally, largely ignoring their different contributions. Furthermore, existing multimodal sentiment analysis methods directly regress sentiment scores without considering ordinal relationships within sentiment categories, with limited performance. To address the aforementioned problems, we propose a trustworthy multimodal sentiment ordinal network (TMSON) to improve performance in sentiment analysis. Specifically, we first devise a unimodal feature extractor for each modality to obtain modality-specific features. Then, an uncertainty distribution estimation network is customized, which estimates the unimodal uncertainty distributions. Next, Bayesian fusion is performed on the learned unimodal distributions to obtain multimodal distributions for sentiment prediction. Finally, an ordinal-aware sentiment space is constructed, where ordinal regression is used to constrain the multimodal distributions. Our proposed TMSON outperforms baselines on multimodal sentiment analysis tasks, and empirical results demonstrate that TMSON is capable of reducing uncertainty to obtain more robust predictions.

Abstract:
Batch steganography regarding to image-selection and payload-allocation has gained increasing attention due to the secure demanding of data hiding of real scenario. However, due to the predefined selection mechanism, the chosen images are always complex which means that the diversity of the selected cover set is finite. In this paper, we develop a diverse and secure batch steganography scheme including the model-based generation and double-layered payload assignment. To construct the diverse image set, we use the Kullback-Leibler (KL) divergence to quantify the diversity increment and, relying on steganographic distortion, we select multiple image subsets (class) to create the diverse cover set in which each subset is modeled as the normal distribution with proper model parameters. Depending on the distortion of image subset, we assign the payload into all subsets with between-class allocation. Moreover, for the assigned payload of each subset, we introduce the linear model to achieve the within-class allocation. Finally, we obtain a diverse cover set along with suitable payload. Extensive experiments demonstrate the practicality of the proposed method in diversity and, compared with other selection methods, exhibit higher security on multiple steganalytic tools.

Abstract:
This paper develops an extremely robust solution for absolute pose estimation with known prior gravity direction by motion decoupling. Absolute pose estimation is a fundamental problem in computer vision, and recently the prior known vertical direction is commonly applied to help solve the pose estimation problem. In this paper, we explore the geometrical constraints of the absolute pose estimation with a known direction. We find that the rigid pose can be decoupled with the help of the known direction. Thereby, absolute pose estimation algorithms, which decouple rigid motion, are proposed. Notably, in real applications, there may be imperfect inputs, i.e., outliers, due to incorrect 2D-3D matches. Unfortunately, these outliers may lead to unacceptable results. To suppress the outliers, the decoupled absolute pose estimation problem is solved by branch-and-bound algorithm and globally voting, which can provide the optimal solution with provable guarantees. Moreover, in extreme case, the proposed method can solve absolute pose estimation problem without knowing the 2D-3D correspondences, which is also known as simultaneous camera pose correspondence estimation. To demonstrate the feasibility and the superiority of the proposed methods, comprehensive comparison experiment are conduced. The source code is available at https://github.com/Liu-Yinlong/algorithm-for-PnP-with-known-vertical-direction.

Abstract:
Micro-video venue recognition aims to predict the venue category where a micro-video was filmed. Different from traditional long videos which contain rich temporal context, venue prediction for micro-videos is difficult due to its limited duration (generally within 6s). The existing works usually extract features of each modality from a global perspective for prediction, neglecting the semantics carried by local objects. To this end, we propose Multi-Modal and Multi-Granularity Object Relations (M2ORE) to address the above issues, which learns multi-granularity interactive semantics between venues and multimodal semantic objects to help understand venues. Specifically, M2ORE comprises of two modules: it first extract semantic objects of different modalities, i.e. visual objects in keyframes and keywords in texts, and models the affiliation relationship between semantic objects and venues and the co-occurrence relationship among semantic objects, forming a heterogeneous venue-object relation graph. Then, to achieve the interactive semantics between venues and objects from the relation graph, a novel Parallel-Graph Inference Model (Parallel-GIM) is proposed, which updates the representation of nodes through graph propagation and fuse multi-level features (local-global-multimodal) through the devised hierarchical attention mechanism. Finally, the probability distribution of venues can be obtained through a multi-layer perceptron with the comprehensive features of the venue nodes. Extensive experiments on real-world micro-video dataset demonstrate the superiority of the proposed M2ORE.

Abstract:
Recent development in computing power has resulted in performance improvements on holistic (none-occluded) person Re-Identification (ReID) tasks. Nevertheless, the precision of the recent research will diminish when a pedestrian is obstructed by obstacles. Within the realm of 2D space, the loss of information from obstructed objects continues to pose significant challenges in the context of person ReID. Person is a 3D non-grid object, and thus semantic representation learning in only 2D space limits the understanding of occluded person. In the present work, we propose a network based on 3D multi-view learning, allowing it to acquire geometric and shape details of an occluded pedestrian from 3D space. Simultaneously, it capitalizes on advancements in 2D-based networks to extract semantic representations from 3D multi-views. Specifically, the surface random selection strategy is proposed to convert images of 2D RGB into 3D multi-views. Using this strategy, we build four extensive 3D multi-view data collections for person ReID. After that, Pedestrian 3D Shape Understanding for Person Re-Identification via Multi-View Learning (MV-3DSReID), is proposed for identifying the person by learning person geometry and structure representation from the groups of multi-view images. In comparison to alternative data formats (e.g., 2D RGB, 3D point cloud), multi-view images complement each other’s detailed features of the 3D object by adjusting rendering viewpoints, thus facilitating a more comprehensive understanding of the person for both holistic and occluded ReID situations. Experiments on occluded and holistic ReID tasks demonstrate performance levels comparable to state-of-the-art methods, validating the effectiveness of our proposed approach in tackling challenges related to occlusion. The code is available at https://github.com/hangjiaqi1/MV-TransReID.

Abstract:
Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at https://github.com/zijinxuxu/PDFNet.

Abstract:
Despite the considerable effort devoted to high-generalizable blind image quality assessment (BIQA), the generalization performance of the state-of-the-art metrics remains limited when facing new visual scenes. A straightforward way to address the dilemma is labeling a great number of images from the new scene and subsequently training a new model, which is quite labor-intensive and cost-expensive. Hence, there is an urgent need to mitigate the dependency on labeled samples by designing a data-efficient BIQA algorithm. Motivated by the above facts, this paper presents an Active Learning-based IQA (AL-IQA) framework, which reduces the requirement for training samples by selecting representative images from two perspectives, including distortion and content. Specifically, in terms of distortion, we design distortion prompts and adopt Contrastive Language-Image Pre-Training (CLIP) to predict image distortion in a zero-shot manner. Then, we employ curriculum learning-inspired strategy to select samples with gradually increasing difficulty (measured by prediction uncertainty of CLIP), in order to facilitate model training. Meantime, in terms of content, we adopt distribution matching-based dataset distillation to distill unlabeled images into several high-density informative synthetic images. Then, feature distances between unlabeled images and distilled images are compared to identify images with the most representative content. Finally, Borda count is adopted to capture a consensus of both distortion and content through weighted counting, and prompt tuning is utilized for adapting the model to the IQA task. Extensive experiments are conducted on five IQA datasets, and the results demonstrate that the proposed AL-IQA not only effectively reduces the number of training samples but also achieves state-of-the-art prediction accuracy and generalization performance. The source code is available at https://github.com/esnthere/AL-IQA.

Abstract:
Few-shot object detection (FSOD) aims to detect novel targets with only a few instances of the associated samples. Although combinations of distillation techniques and meta-learning paradigms have been acknowledged as the primary strategies for FSOD tasks, the existing distillation methods exhibit inherent biases and sensitivity to novel class variability. A critical hurdle for FSOD distillation is the difficulty in ensuring appropriate knowledge learned from the teacher model during the fine-tuning stage. Furthermore, coarse distillation procedures risk misalignment between the learned and actual distributions. This misalignment could potentially negate the benefits of positive cases and impede the detector’s evolution. To address these deficiencies, we propose a novel self-distillation paradigm exclusively for the fine-tuning stage (SD-FSOD). Our methods integrate a Distribution Prototype Extractor (DPE) and Self-Distillation Memory (SDM), promoting feature distribution consistency during distillation. In detail, the DPE module reliably initializes the weights of the detector, ensuring a robust class distribution for the distillation process. Meanwhile, the SDM module utilizes decoupling techniques to divide the distillation tasks into two sub-task branches, allowing the student model to independently learn and share precise features through isolated distillation processes. The synergistic integration of feature calibration techniques and the continuous self-distillation paradigm distinctly enhances the fine-tuning process, which shows the superiority of the FSOD self-distillation methodologies. The extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our proposed approach produces significant improvements and achieves state-of-the-art (SOTA) performance.

Abstract:
This paper presents a fresh paradigm for protecting facial privacy via an invertible image obfuscation framework that incorporates multiple characteristics including anonymity, diversity, reversibility, security, and lightweight all at once. We name the framework PRO-Face S, an acronym for Privacy-preserving Reversible Obfuscation of Face images via Secure flow. The core of the proposed framework is a flow-based generative model (or invertible neural network), which takes as input a face image along with its pre-obfuscated form, and outputs the privacy-protected image that visually mirrors the pre-obfuscated one. The pre-obfuscation applied can be in various forms with different types and strengths. The invertibility of the flow-based model ensures that the original image can be easily recovered from the protected image in high fidelity. An elaborate secret key mechanism is devised to securely guide the mutual transformations of privacy protection and image recovery, such that the correct recovery is only possible upon the availability of the correct secret, pre-specified by the user in the protection stage. Two modes of wrong recovery are investigated to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on multiple image datasets demonstrate the superiority of the proposed framework over state-of-the-art methods.

Abstract:
Conditional generative adversarial networks (cGANs) aim to synthesize diverse images given the input conditions and the latent codes, but they are prone to map an input to a single output regardless of the variations in latent code, which is also well known as the mode collapse problem of cGANs. To alleviate the problem, in this paper, we investigate explicitly enhancing the statistical dependency between the latent code and the synthesized image in cGANs by utilizing mutual information neural estimators to estimate and maximize the conditional mutual information (CMI) between them given the input condition. The method provides a new perspective from information theory to improve diversity for cGANs and can facilitate many existing conditional image synthesis frameworks with a simple neural estimator extension. Moreover, our studies show that several key designs, including the neural estimator choice, the neural estimator’s network design, and the sampling strategy, are crucial to the success of the method. Extensive experiments on four popular conditional image synthesis tasks, including class-conditioned image generation, paired and unpaired image-to-image translation, and text-to-image generation, demonstrate the effectiveness and superiority of the proposed method.

Abstract:
The goal of image rescaling is to embed the information from high-resolution images into low-resolution images and then reconstruct the high-resolution images in reverse. Existing methods either focus on small scaling factors or do not generalize well to natural images with diverse content in extreme settings, i.e., using extreme scaling factors (e.g., 16× and 32× ). When performing extreme rescaling, previous methods often fail to produce plausible high-quality results due to insufficient cues in low-resolution images. In this work, we propose an extreme natural image rescaling framework that exploits the rich generative prior integrated into the GAN model trained on large-scale natural images to reduce the ambiguity of extreme upscaling. Considering the invertible bijective transformation between quantized features and low-resolution image, we develop an invertible feature recovery module that generates semantically sound low-resolution image while maximizing the preservation of useful features for the subsequent upscaling. Furthermore, we propose a multi-scale refinement module that explicitly introduces the supervised ground truth information to mitigate unpleasant artifacts and distortions. Extensive experiments show that the proposed rescaling framework formulated by the above components achieves significantly better visual performance than state-of-the-art methods.

Abstract:
Depth completion is a long-standing challenge in computer vision, where classification-based methods have made tremendous progress in recent years. However, most existing classification-based methods rely on pre-defined pixel-shared and discrete depth values as depth categories. This representation fails to capture the continuous depth values that conform to the real depth distribution, leading to depth smearing in boundary regions. To address this issue, we revisit depth completion from the clustering perspective and propose a novel clustering-based framework called CluDe which focuses on learning the pixel-wise and continuous depth representation. The key idea of CluDe is to iteratively update the pixel-shared and discrete depth representation to its corresponding pixel-wise and continuous counterpart, driven by the real depth distribution. Specifically, CluDe first utilizes depth value clustering to learn a set of depth centers as the depth representation. While these depth centers are pixel-shared and discrete, they are more in line with the real depth distribution compared to pre-defined depth categories. Then, CluDe estimates offsets for these depth centers, enabling their dynamic adjustment along the depth axis of the depth distribution to generate the pixel-wise and continuous depth representation. Extensive experiments demonstrate that CluDe successfully reduces depth smearing around object boundaries by utilizing pixel-wise and continuous depth representation. Furthermore, CluDe achieves state-of-the-art performance on the VOID datasets and outperforms classification-based methods on the KITTI dataset.

Abstract:
The estimation of depth from 4D light field images is a fundamental problem for perceiving and reconstructing environmental scenes. While learning-based methods have achieved remarkable results in this field, most of them rely on supervised learning, which faces significant challenges in real-world applications due to the lack of sufficient available ground truth depth maps. In this paper, we propose an unsupervised learning architecture based on a generative adversarial learning model for light field image depth estimation (OALFGAN). Specifically, our approach involves a multi-scale deep convolutional generative adversarial network learning system that includes a sparse-to-dense cascaded multi-scale generator and a discriminator, which decomposes the problem of generating high-quality images into more manageable sub-problems. To address the issue of violations of photometric consistency that may be caused by occlusion, we introduce a spatial-angular attention module that adaptively extracts view features with fewer occlusions and richer textures to generate more accurate disparity maps. Furthermore, we design a loss function that incorporates adaptive angular entropy consistency, symmetry loss, and edge-aware loss based on the distribution regularity and self-constraint of light field images to further optimize occlusion and disparity discontinuity issues and improve the reliability of the final depth prediction. Our proposed method demonstrates superior performance over existing methods on synthetic datasets, both quantitatively and qualitatively. Moreover, our proposed method exhibits excellent generalization performance on real-world datasets, demonstrating the effectiveness of our approach.

Abstract:
Visual content is increasingly being processed by machines for various automated content analysis tasks instead of being consumed by humans. Despite the existence of several compression methods tailored for machine tasks, few consider real-world scenarios with multiple tasks. In this paper, we aim to address this gap by proposing a task-switchable pre-processor that optimizes input images specifically for machine consumption prior to encoding by an off-the-shelf codec designed for human consumption. The proposed task-switchable pre-processor adeptly maintains relevant semantic information based on the specific characteristics of different downstream tasks, while effectively suppressing irrelevant information to reduce bitrate. To enhance the processing of semantic information for diverse tasks, we leverage pre-extracted semantic features to modulate the pixel-to-pixel mapping within the pre-processor. By switching between different modulations, multiple tasks can be seamlessly incorporated into the system. Extensive experiments demonstrate the practicality and simplicity of our approach. It significantly reduces the number of parameters required for handling multiple tasks while still delivering impressive performance. Our method showcases the potential to achieve efficient and effective compression for machine vision tasks, supporting the evolving demands of real-world applications.

Abstract:
We present PU-Mask, a virtual mask-based network for 3D point cloud upsampling. Unlike existing upsampling methods, which treat point cloud upsampling as an “unconstrained generative” problem, we propose to address it from the perspective of “local filling”, i.e., we assume that the sparse input point cloud (i.e., the unmasked point set) is obtained by locally masking the original dense point cloud with virtual masks. Therefore, given the unmasked point set and virtual masks, our goal is to fill the point set hidden by the virtual masks. Specifically, because the masks do not actually exist, we first locate and form each virtual mask by a virtual mask generation module. Then, we propose a mask-guided transformer-style asymmetric auto-encoder (MTAA) to restore the upsampled features. Moreover, we introduce a second-order unfolding attention mechanism to enhance the interaction between the feature channels of MTAA. Next, we generate a coarse upsampled point cloud using a pooling technique that is specific to the virtual masks. Finally, we design a learnable pseudo Laplacian operator to calibrate the coarse upsampled point cloud and generate a refined upsampled point cloud. Extensive experiments demonstrate that PU-Mask is superior to the state-of-the-art methods. Our code will be made available at: https://github.com/liuhaoyun/PU-Mask.

Abstract:
Existing end-to-end depth representation in embodied AI is often task-specific and lacks the benefits of emerging pre-training paradigm due to limited datasets and training techniques for RGB-D videos. To address the challenge of obtaining robust and generalized depth representation for embodied AI, we introduce a unified RGB-D video dataset (UniRGBD) and a novel time-aware contrastive (TAC) pre-training approach. UniRGBD addresses the scarcity of large-scale depth pre-training datasets by providing a comprehensive collection of data from diverse sources in a unified format, enabling convenient data loading and accommodating various data domains. We also design an RGB-Depth alignment evaluation procedure and introduce a novel Near-K accuracy metric to assess the scene understanding capability of the depth encoder. Then, the TAC pre-training approach fills the gap in depth pre-training methods suitable for RGB-D videos by leveraging the intrinsic similarity between temporally proximate frames. TAC incorporates a soft label design that acts as valid label noise, enhancing the depth semantic extraction and promoting diverse and generalized knowledge acquisition. Furthermore, the adjustments in perspective between temporally proximate frames facilitate the extraction of invariant and comprehensive features, enhancing the robustness of the learned depth representation. Additionally, the inclusion of temporal information stabilizes training gradients and enables spatio-temporal depth perception. Comprehensive evaluation of RGB-Depth alignment demonstrates the superiority of our approach over state-of-the-art methods. We also conduct uncertainty analysis and a novel zero-shot experiment to validate the robustness and generalization of the TAC approach. Moreover, our TAC pre-training demonstrates significant performance improvements in various embodied AI tasks, providing compelling evidence of its efficacy across diverse domains.

Abstract:
Unsupervised domain adaptation aims to leverage labeled data from a source domain to learn a classifier for an unlabeled target domain. Amongst its many variants, open set domain adaptation (OSDA) is perhaps the most challenging one, as it further assumes the presence of unknown classes in the target domain. In this paper, we study OSDA with a particular focus on enriching its ability to traverse across larger domain gaps, and we show that existing state-of-the-art methods suffer a considerable performance drop in the presence of larger domain gaps, especially on a new dataset (PACS) that we re-purposed for OSDA. Exploring this is pivotal for OSDA as with increasing domain shift, identifying unknown samples in the target domain becomes harder for the model, thus making negative transfer between source and target domains more challenging. Accordingly, we propose a Mutual-to-Separate (MTS) framework to address the larger domain gaps. Essentially we design two networks – (a) Sample Separation Network (SSN): which is trained to learn a hyperplane for separating unknown samples from known ones, and (b) Distribution Matching Network (DMN): which is trained to maximise domain confusion between source and target domains without unknown samples under the guidance of the SSN. The key insight lies in how we exploit the mutually beneficial information between these two networks. On closer observation, we see that SSN can reveal which samples in the target domain belong to the unknown class by instance weighting whereas, DMN pushes apart the samples that most likely belong to the unknown class in the target domain, which in turn reduces the difficulty of SSN in identifying unknown samples. It follows that (a) and (b) will mutually supervise each other and alternate until convergence, which can better align the source and target domains in the shared label space. Extensive experiments on five datasets (Office-31, Office-Home, PACS, VisDA, and mini _DomainNet) demonstrate the efficiency of the proposed method. Detailed ablation experiments also validate the effectiveness of each component and the generality of the proposed framework. Codes are available at: https://github.com/PRIS-CV/Mutual-to-Separate.

Abstract:
Dynamic multi-person mesh recovery has broad applications in sports broadcasting, virtual reality, and video games. However, current multi-view frameworks rely on a time-consuming camera calibration procedure. In this work, we focus on multi-person motion capture with uncalibrated cameras, which mainly faces two challenges: one is that inter-person interactions and occlusions introduce inherent ambiguities for both camera calibration and motion capture; the other is that a lack of dense correspondences can be used to constrain sparse camera geometries in a dynamic multi-person scene. Our key idea is to incorporate motion prior knowledge to simultaneously estimate camera parameters and human meshes from noisy human semantics. We first utilize human information from 2D images to initialize intrinsic and extrinsic parameters. Thus, the approach does not rely on any other calibration tools or background features. Then, a pose-geometry consistency is introduced to associate the detected humans from different views. Finally, a latent motion prior is proposed to refine the camera parameters and human motions. Experimental results show that accurate camera parameters and human motions can be obtained through a one-step reconstruction. The code are publicly available at https://github.com/boycehbz/DMMR.

Abstract:
Accurate segmentation of 3D dental models derived from intra-oral scanners (IOS) is one of the key steps in many digital dental applications such as orthodontics and implants. However, it is difficult to accurately segment individual teeth and gums in 3D dental models due to the following problems: 1) the shape and appearance of adjacent teeth are very similar, which is easy to be misidentified; 2) the boundary between teeth and gums is often indistinct, especially in orthodontic patients with abnormalities such as missing and crowded teeth. To solve such problems, a Dual-Branch Geometric Attention Network (DBGANet) for 3D tooth segmentation is proposed, which can capture tooth geometric structure and detailed boundary information from multi-view geometric features encoded by 3D coordinates and normal vectors. The framework contains two branches, i.e., C-branch and N-branch. First, centroid-guided separable attention is designed in the C-branch to learn global context information by modeling the spatial dependencies of tooth point clouds, which can capture the overall geometric structure of teeth to better distinguish adjacent teeth with similar appearance. Then, Gaussian neighbor attention is designed in the N-branch to encode normal vectors to highlight detailed differences between geometric features at different points, which helps to refine the boundaries of teeth and gingiva for more accurate and smooth tooth segmentation. Extensive experiments on the real-patient datasets of 3D dental models demonstrate that the proposed DBGANet significantly outperforms state-of-the-art methods.

Abstract:
Occlusion perturbation presents a significant challenge in person re- identification (re-ID), and existing methods that rely on external visual cues require additional computational resources and only consider the issue of missing information caused by occlusion. In this paper, we propose a simple yet effective framework, termed Erasing, Transforming, and Noising Defense Network (ETNDNet), which treats occlusion as a noise disturbance and solves occluded person re- ID from the perspective of adversarial defense. In the proposed ETNDNet, we introduce three strategies: Firstly, we randomly erase the feature map to create an adversarial representation with incomplete information, enabling adversarial learning of identity loss to protect the re- ID system from the disturbance of missing information. Secondly, we introduce random transformations to simulate the position misalignment caused by occlusion, training the extractor and classifier adversarially to learn robust representations immune to misaligned information. Thirdly, we perturb the feature map with random values to address noisy information introduced by obstacles and non-target pedestrians, and employ adversarial gaming in the re- ID system to enhance its resistance to occlusion noise. Without bells and whistles, ETNDNet has three key highlights: (i) it does not require any external modules with parameters, (ii) it effectively handles various issues caused by occlusion from obstacles and non-target pedestrians, and (iii) it designs the first GAN-based adversarial defense paradigm for occluded person re- ID. Extensive experiments on six public datasets fully demonstrate the effectiveness, superiority, and practicality of the proposed ETNDNet. The code will be released at https://github.com/nengdong96/ETNDNet.

Abstract:
Stable imaging in adverse environments (e.g., total darkness) makes thermal infrared (TIR) cameras a prevalent option for night scene perception. However, the low contrast and lack of chromaticity of TIR images are detrimental to human interpretation and subsequent deployment of RGB-based vision algorithms. Therefore, it makes sense to colorize the nighttime TIR images by translating them into the corresponding daytime color images (NTIR2DC). Despite the impressive progress made in the NTIR2DC task, how to improve the translation performance of small object classes is under-explored. To address this problem, we propose a generative adversarial network incorporating feedback-based object appearance learning (FoalGAN). Specifically, an occlusion-aware mixup module and corresponding appearance consistency loss are proposed to reduce the context dependence of object translation. As a representative example of small objects in nighttime street scenes, we illustrate how to enhance the realism of traffic light by designing a traffic light appearance loss. To further improve the appearance learning of small objects, we devise a dual feedback learning strategy to selectively adjust the learning frequency of different samples. In addition, we provide pixel-level annotation for a subset of the Brno dataset, which can facilitate the research of NTIR image understanding under multiple weather conditions. Extensive experiments illustrate that the proposed FoalGAN is not only effective for appearance learning of small objects, but also outperforms other image translation methods in terms of semantic preservation and edge consistency for the NTIR2DC task. Compared with the state-of-the-art NTIR2DC approach, FoalGAN achieves at least 5.4% improvement in semantic consistency and at least 2% lead in edge consistency.

Abstract:
Network pruning has been widely studied to reduce the complexity of deep neural networks (DNNs) and hence speed up their inference. Unfortunately, most existing pruning methods ignore the changes in the model’s robustness before and after pruning, which makes pruned models vulnerable under dynamically perturbed environments (e.g., autonomous driving). Only a few works have explored the robustness of pruned models against adversarial attacks that significantly differ from perturbations in real-world scenarios. To bridge the gap between real-world applications and existing studies, in this work, we propose an adversarial pruning scheme, which automatically identifies and preserves robust channels to obtain robust pruned models that are suitable for practical deployment in dynamically perturbed environments. Specifically, to simulate real-world perturbations, we first employ multi-type adversarial attack samples and adversarial perturbation samples generated by an adversarial perturbation generator to create mixed noise samples. Then, we propose a plug-and-play feature scoring module and a novel contribution difference loss to evaluate the robustness of intermediate features dynamically. Next, to leverage robust intermediate features to identify robust channels, we have developed a simple but effective gating mechanism that evaluates the robustness of channels and preserves robust channels during training. Lastly, we compress the model in a layer-wise or block-wise manner. Compared to existing methods, our scheme enhances the robustness of the pruned model in a broader sense, making it better able to against dynamic perturbations in the real world. Extensive experimental results on well-known dataset benchmarks and popular network architectures demonstrate the effectiveness of our method.

Abstract:
By hiding the front-facing camera below the display panel, Under-Display Camera (UDC) provides users with a full-screen experience. However, due to the characteristics of the display, images taken by UDC suffer from significant quality degradation. Methods have been proposed to tackle UDC image restoration and advances have been achieved. There are still no specialized methods and datasets for restoring UDC face images, which may be the most common problem in the UDC scene. To this end, considering color filtering, brightness attenuation, and diffraction in the imaging process of UDC, we propose a two-stage network UDC Degradation Model Network named UDC-DMNet to synthesize UDC images by modeling the processes of UDC imaging. Then we use UDC-DMNet and high-quality face images from FFHQ and CelebA-Test to create UDC face training datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face restoration. We propose a novel dictionary-guided transformer network named DGFormer. Introducing the facial component dictionary and the characteristics of the UDC image in the restoration makes DGFormer capable of addressing blind face restoration in UDC scenarios. Experiments show that our DGFormer and UDC-DMNet achieve state-of-the-art performance.

Abstract:
Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models.

Abstract:
The rapid advancement of artificial intelligence (AI) technology has led to the prioritization of standardizing the processing, coding, and transmission of video using neural networks. To address this priority area, the Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) group is developing a suite of standards called MPAI-EEV for “end-to-end optimized neural video coding.” The aim of this AI-based video standard project is to compress the number of bits required to represent high-fidelity video data by utilizing data-trained neural coding technologies. This approach is not constrained by how data coding has traditionally been applied in the context of a hybrid framework. This paper presents an overview of recent and ongoing standardization efforts in this area and highlights the key technologies and design philosophy of EEV. It also provides a comparison and report on some primary efforts such as the coding efficiency of the reference model. Additionally, it discusses emerging activities such as learned Unmanned-Aerial-Vehicles (UAVs) video coding which are currently planned, under development, or in the exploration phase. With a focus on UAV video signals, this paper addresses the current status of these preliminary efforts. It also indicates development timelines, summarizes the main technical details, and provides pointers to further points of reference. The exploration experiment shows that the EEV model performs better than the state-of-the-art video coding standard H.266/VVC in terms of perceptual evaluation metric.

Abstract:
In video coding, inter prediction aims to reduce temporal redundancy by using previously encoded frames as references. The quality of reference frames is crucial to the performance of inter prediction. This paper presents a deep reference frame generation method to optimize the inter prediction in Versatile Video Coding (VVC). Specifically, reconstructed frames are sent to a well-designed frame generation network to synthesize a picture similar to the current encoding frame. The synthesized picture serves as an additional reference frame inserted into the reference picture list (RPL) to provide a more reliable reference for subsequent motion estimation (ME) and motion compensation (MC). The frame generation network employs optical flow to predict motion precisely. Moreover, an optical flow reorganization strategy is proposed to enable bi-directional and uni-directional predictions with only a single network architecture. To reasonably apply our method to VVC, we further introduce a normative modification of the temporal motion vector prediction (TMVP). Integrated into the VVC reference software VTM-15.0, the deep reference frame generation method achieves coding efficiency improvements of 5.22%, 3.61%, and 3.83% for the Y component under random access (RA), low delay B (LDB), and low delay P (LDP) configurations, respectively. The proposed method has been discussed in Joint Video Exploration Team (JVET) meeting and is currently part of Exploration Experiments (EE) for further study.

Abstract:
This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations and their gaze followings simultaneously, aiming at jointly detect human gaze location and gaze object in a unified and single-stage pipeline. To this end, we propose GTR, short for Gaze following detection TRansformer, streamlining the gaze following detection pipeline by eliminating all additional components, leading to the first unified paradigm that unites GL-D and GO-D in a fully end-to-end manner. GTR enables an iterative interaction between holistic semantics and human head features through a hierarchical structure, inferring the relations of salient objects and human gaze from the global image context and resulting in an impressive accuracy. Concretely, GTR achieves a 12.1 mAP gain ( \mathbf 25.1% ) on GazeFollowing and a 18.2 mAP gain ( \mathbf 43.3% ) on VideoAttentionTarget for GL-D, as well as a 19 mAP improvement ( \mathbf 45.2% ) on GOO-Real for GO-D. Meanwhile, unlike existing systems detecting gaze following sequentially due to the need for a human head as input, GTR has the flexibility to comprehend any number of people’s gaze followings simultaneously, resulting in high efficiency. Specifically, GTR introduces over a × 9 improvement in FPS and the relative gap becomes more pronounced as the human number grows.

Abstract:
Recently, memory-based methods have exhibited remarkable performance in Video Object Segmentation (VOS) by employing non-local pixel-wise matching between the query and memory. Nevertheless, these methods suffer from two limitations: 1) Non-local pixel-wise matching can result in the incorrect segmentation of background distractor objects, and 2) memory features with substantial temporal redundancy consume significant computing resources and reduce the inference speed. To address the limitations, we first propose a local attention mechanism to suppress background features, and we introduce a novel training framework based on contrast learning to ensure the network learns reliable and robust pixel-wise correspondence between query and memory. We adaptively determine whether to update the memory based on the variation of foreground objects. Next, we propose a dynamic memory bank, which utilizes a lightweight and differentiable soft modulation gate to determine the number of memory features to remove along the temporal dimension. This allows efficient and flexible management of memory features. Our network achieves competitive results (e.g., 92.1% on DAVIS 2016 val, 87.6%/81.3% on DAVIS 2017 val/test, 87.0% on YouTube-VOS 2018 val) compared with the state-of-the-art methods while maintaining a faster inference speed of 25+FPS. Moreover, our network demonstrates a favorable balance between performance and speed when dealing with the long-time video dataset.

Abstract:
This paper addresses an important and valuable open-world object detection (OWOD) in autonomous driving scenarios, which aims to detect objects under both domain-agnostic and category-agnostic settings simultaneously. Existing OWOD algorithms mainly focus on the detection of pre-defined object categories under various conditions (domain-agnostic) or instead perform zero-shot object detection (category-agnostic), separately. The knowledge gap between seen and unseen object categories poses challenges for models optimized with supervision from the only seen object categories. The domain difference across different scenarios also causes further challenges in aligning observations with different appearances. To address these two challenges simultaneously, we propose our Instance Dictionary Learning (IDL for short) for more robust and accurate OWOD performance. We first design a pre-training procedure to build up the mappings between region features and category semantic embeddings by introducing instance contrastive learning. The joint vision-semantic space is formulated through the more detailed instance-level “Dictionary”, which expresses the region-category correspondences and helps link the seen and unseen object categories. The domain discrimination is further designed for extracting the domain invariance feature representations in the further training procedure seamlessly. The proposed IDL could detect the unseen categories from unseen domains without any bounding box annotations while there is no obvious performance drop on detecting seen categories meanwhile. Comprehensive experiments have been conducted and our method could achieve a new state-of-the-art OWOD performance over previous algorithms.

Abstract:
Vehicle Re-Identification (ReID) aims to find images of the same vehicle from different videos. It remains a challenging task in the video analysis field due to the huge appearance discrepancy of the same vehicle in cross-view matching and the subtle difference of different similar vehicles in same-view matching. In this paper, we propose a Co-occurrence Attention Net (CAN) to deal with these two challenges. Specifically, CAN consists of two branches, a main branch and an aware branch. The main branch is in charge of extracting global features that are consistent in most views. This feature encodes holistic information such as color and pose, however, it can not handle cross/same-view hard cases, as shown in Fig.1. Therefore, the aware branch is designed to focus on the local details and viewpoint information, which can become an important complement for those hard cases. Considering that the positions of local areas such as wheels and logos change with the viewpoint, Aware Attention Module is introduced to find the hidden relationship among local areas and seamlessly combine the viewpoint information simultaneously. Then, CAN is trained by a partition-and-reunion-based loss, which can narrow the intra-class distance and increase the inter-class distance. Further, an adaptive co-occurrence view emphasize strategy is adopted to fully utilize the learned features. Experimental results on three widely used datasets including VeRi-776, VehicleID and VERI-Wild demonstrate the effectiveness of our method and competitive performance with other state-of-the-art methods.

Abstract:
Occlusions and complex backgrounds are common factors that hinder many computer vision applications. In a street scene, the challenge of accurately predicting pedestrian trajectories comes from the complexity of human behavior and the diversity of the external environment. It is difficult, if not impossible, to extract relevant information to accurately predict pedestrian trajectories in dynamic scenes. Synthetic aperture imaging (SAI) uses an array of cameras to mimic a camera with a large virtual convex lens by projecting images of a scene from different views onto a virtual focal plane. It is commonly used to reconstruct occluded objects, and in a street scene, can provide observation of pedestrians occluded by other objects and pedestrians. In this paper, we propose a joint prediction method based on autofocusing of SAI to predict pedestrian trajectories in dynamic scenes. The main contributions of this paper include: 1) The task of pedestrian trajectory prediction in dynamic scenarios is redefined as pedestrian trajectory prediction and SAI autofocusing from a practical but more challenging perspective. 2) The proposed method is based on an existing SAI-based method to extract information in heavily occluded views, which can obtain more accurate results but with less computational cost and without using other sensors such as LiDAR or depth cameras. 3) A new pedestrian trajectory prediction model, an attention-based trajectory prediction variational autoencoder (ATP-VAE), is proposed to extract complex human behavior and social interactions in dynamic scenes through a new Intention Attention Unit. The experimental results on multiple public datasets show that the proposed method achieves state-of-the-art results in the first-person perspective and in aerial view.

Abstract:
This paper presents a novel framework, named Global-Local Correspondence Framework (GLCF), for visual anomaly detection with logical constraints. Visual anomaly detection has become an active research area in various real-world applications, such as industrial anomaly detection and medical disease diagnosis. However, most existing methods focus on identifying local structural degeneration anomalies and often fail to detect high-level functional anomalies that involve logical constraints. To address this issue, we propose a two-branch approach that consists of a local branch for detecting structural anomalies and a global branch for detecting logical anomalies. To facilitate local-global feature correspondence, we introduce a novel semantic bottleneck enabled by the visual Transformer. Moreover, we develop feature estimation networks for each branch separately to detect anomalies. Our proposed framework is validated using various benchmarks, including industrial datasets, Mvtec AD, Mvtec Loco AD, the logical dataset DigitAnatomy, and the newly proposed Mvtec AAD dataset. Experimental results show that our method outperforms existing methods, particularly in detecting logical anomalies.

Abstract:
Visual tracking from the ground view and the UAV view has received increasing attention due to its wide range of practical applications. These two tasks have strong complementary benefits in the description of the target object, such as detailed appearance in the ground view and global motion information in the UAV view, and their combination has the potential to allow the tracking system to be more robust. However, no work has studied this problem in-depth, and it is challenging to accurately combine the ground view information and the UAV view information. To fill the gap and address the challenge, we propose a new computer vision task called UAV-Ground visual tracking. Considering the lack of relevant data and methods, we first propose a unified video dataset called UGVT, which includes 210 pairs of UAV and ground high-resolution video sequences with a total of more than 204K frames, which can be used as a comprehensive evaluation platform for relevant tracking methods. Secondly, based on the newly constructed dataset, we propose a co-learning method called MvCL to fuse the information of ground and UAV views. It first associates the same tracking target in the two views based on cross-attention operation and then fuses the complementary information of the two views. In particular, as a plug-and-play module based on Transformer structure, this method can be flexibly embedded into different tracking frameworks. Extensive experiments are conducted on the newly created dataset. The results demonstrate the effectiveness of the proposed method in improving the robustness of the tracking system compared with 10 state-of-the-art tracking methods and also indicate the prospect and significance of potential UAV-Ground visual tracking research. The dataset is available at: https://github.com/mmic-lcl/Datasets-and-benchmark-code/.

Abstract:
Semantic segmentation is a significant task for remote sensing interpretation, which takes advantage of contextual semantic information to classify each pixel into a specific category. Most current methods apply convolutional neural networks (CNN) to learn feature representation from remote sensing images, which may ignore the global dependencies due to the limitation of convolutional kernels. Inspired by the global feature learning ability of Transformer, we propose a novel deep model called dual-path feature aware network (DPFANet), which combines the structure of CNN and Transformer for semantic segmentation of remote sensing images. DPFANet aims to learn effective modeling ability from local to global features of images. Simultaneously, an adaptive feature fusion network is developed to fuse features from dual-path networks. Moreover, an edge optimization block is applied to constrain the edge features, whose purpose is to obtain more representative features for segmentation. Experimental results on three public remote sensing datasets verify that our proposed network yields better segmentation performance compared to other related methods.

Abstract:
The scene graph generation aims to recognize objects and infer the relationships between them, which can provide a comprehensive understanding of image visual perception. However, the long-tailed issue of relations remains challenging for scene graph generation. This paper proposes a novel framework based on knowledge-driven data-driven joining to address the long-tail issues in scene graph generation. The proposed framework consists of two modules: the relation inference module and the prior knowledge learning module. The relation inference module aims to learn the relational features of entity pairs in images and the structural features of scene graphs. The prior knowledge learning module aims to learn the triplet representation from the knowledge graph and use it as prior knowledge to provide logical guidance and constraints for relation inference. This provides prior bias for relation inference to transfer the bias towards head categories to reasonable categories, thereby mitigating the long-tail problem. Experiment results indicate that the proposed framework outperforms on Visual Genome datasets and that the generated scene graph relation is logically reasonable.

Abstract:
Capturing sufficient global context and rich spatial structure information is critical for dense prediction tasks. Convolutional Neural Network (CNN) is particularly adept at modeling fine-grained local features, while Transformer excels at modeling global context information. It is evident that CNN and Transformer exhibit complementary characteristics. Exploring the design of a network, that efficiently fuses these two models to leverage their strengths fully and achieve more accurate detection, represents a promising and worthwhile research topic. In this paper, we introduce a novel CNN-Transformer Iterative Fusion Network (CTIF-Net) for salient object detection. It efficiently combines CNN and Transformer to achieve superior performance by using a parallel dual encoder structure and a feature iterative fusion module. Firstly, CTIF-Net extracts features from the image using the CNN and the Transformer, respectively. Secondly, two feature convertors and a feature iterative fusion module are employed to combine and iteratively refine the two sets of features. The experimental results on multiple SOD datasets show that CTIF-Net outperforms 17 state-of-the-art methods, achieving higher performance in various mainstream evaluation metrics such as F-measure, S-measure, and MAE value. Code can be found at https://github.com/danielfaster/CTIF-Net/.

Abstract:
Deep unrolling architectures have revitalized compressive sensing (CS) by seamlessly blending deep neural networks with traditional optimization-based reconstruction algorithms. In pursuit of an efficient and deep interpretable approach, we propose LTwIST for CS problem, a novel deep unrolling framework that draws inspiration from the well-known two-step iterative shrinkage thresholding (TwIST) algorithm. LTwIST uses a trainable sensing matrix to adaptively learn structural information in images, and introduces a customized U-block architecture to solve the proximal mapping of nonlinear transformations connected with the sparsity-inducing regularizer. Specifically, each iteration recovery step of LTwIST corresponds to an iterative update step of the traditional TwIST algorithm. Moreover, the proposed method is designed to learn all the parameters end-to-end without manual tuning such as shrinkable thresholds, step sizes, etc. As a result, LTwIST obviates the need for manual parameter optimization, allows for high-quality image recovery and provides unambiguous interpretability. Moreover, our proposed LTwIST is also applicable to CS-based magnetic resonance imaging and exhibits a strong reconstruction performance. Extensive experiments on several public benchmark datasets demonstrate that the proposed LTwIST outperforms existing state-of-the-art deep CS methods by considerable margins in terms of quality evaluation metrics and visual performance. Our code is available on LTwIST.

Abstract:
Few-shot semantic segmentation (FSS) aims to segment objects of unseen classes in query images with only a few annotated support images. Existing FSS algorithms typically focus on mining category representations from the single-view support to match semantic objects of the single-view query. However, the limited annotated samples render the single-view matching struggle to perceive the varying characteristics of novel objects, which results in a restricted learning space for novel categories and further induces a biased segmentation with demoted parsing performance. To address this challenge, inspired by the semantic transform invariance, this paper proposes a fresh few-shot segmentation framework to break the bias and perform invariant segmentation in a multi-view matching manner. Specifically, original and transform support features from different perspectives with the same semantics are learnable fused to obtain the transform invariance prototype with a stronger category representation ability. Simultaneously, aiming at providing better parsing guidance, the Transform Invariance Guidance Mask Generation (TIGM) module is proposed to integrate prior knowledge from different perspectives. Finally, segmentation predictions from varying views are complementarily merged in the Transform Invariance Semantic Prediction (TISP) module to decide the uncertain area and yield precise segmentation predictions. Extensive experiments on both PASCAL- 5^i and COCO- 20^i datasets demonstrate the effectiveness of our approach and show that our method could achieve state-of-the-art performance. Code is available at https://github.com/caoql98/BBD.

Abstract:
Traditional end-to-end video coding is typically featured with sparsely distributed operational rate-distortion (R-D) points. This creates daunting challenges to rate control which is typically regarded as the indispensable coding optimization module. To tackle this problem, this paper proposes high efficiency rate control for end-to-end scale-adaptive video coding which enables the conversion from sparsely to densely distributed R-D points. The proposed scheme does not increase the number of models in the sparse-to-dense conversion and provides more flexibility in end-to-end video coding thereby leading to better coding performance. More specifically, R-D analyses for scale-adaptive coding are first conducted, shedding light on the design of the rate control algorithm. Subsequently, generalized R-D models are presented, based on which high efficiency rate control is achieved. Extensive experimental results provide evidence of the efficiency of the proposed method in terms of R-D performance, control accuracy and computational complexity.

Abstract:
The past few years have witnessed a great success in applying deep learning to enhance the perceptual quality of compressed video. These methods usually perform frame-by-frame quality enhancement, incurring high computational complexity. Low-complexity perceptual quality enhancement is addressed in this paper, motivated by the observation of temporal correlations among video frames. We propose to decompose video content into temporal low-frequency and high-frequency components, and to focus the enhancement of the temporal low-frequency component, which may significantly reduce the computational complexity. Specifically, we employ the temporal wavelet transform (TWT) for the temporal frequency analysis, and build a TWT-based multiple-input multiple-output perceptual quality enhancement scheme. First, we use a motion estimation method on the input video to acquire the motion information, and then use TWT to obtain the temporal low- and high-frequency components. Second, we design a deep network to enhance the quality of the temporal low-frequency component. Finally, the temporal high-frequency component and the enhanced temporal low-frequency component are combined by the temporal wavelet inverse transform (TWIT) to generate the enhanced video. Experimental results show that our method achieves comparable perceptual quality to that of the state-of-the-art methods, but reduces the computational complexity to 1/13.

Abstract:
For moving cameras, the video content changes significantly, which leads to inaccurate prediction in traditional inter prediction and results in limited compression efficiency. To solve these problems, first, we propose a camera pose-based background modeling (CP-BM) framework that uses the camera motion and the textures of reconstructed frames to model the background of the current frame. Compared with the reconstructed frames, the predicted background frame generated by CP-BM is more geometrically similar to the current frame in position and is more strongly correlated with it at the pixel level; thus, it can serve as a higher-quality reference for inter prediction, and the compression efficiency can be improved. Second, to compensate the motion of the background pixels, we construct a pixel-level motion vector field that can accurately describe various complex motions with only a small overhead. Our method is more general than other motion models because it has more degrees of freedom, and when the degrees of freedom are decreased, it encompasses other motion models as special cases. Third, we propose an optical flow-based depth estimation (OF-DE) method to synchronize the depth information at the codec, which is used to build the motion vector field. Finally, we integrate the overall scheme into the High Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) reference software HM-16.7 and VTM-10.0. Experimental results demonstrate that in HM-16.7, for in-vehicle video sequences, our solution has an average Bjøntegaard delta bit rate (BD-rate) gain of 8.02% and reduces the encoding time by 20.9% due to the superiority of our scheme in motion estimation. Moreover, in VTM-10.0 with affine motion compensation (MC) turned off and turned on, our method has average BD-rate gains of 5.68% and 0.56%, respectively.

Abstract:
Fast stereo based 3D object detectors have made great progress recently. However, they suffer from the inferior accuracy. We argue that the main reason is due to the poor geometry-aware feature representation in 3D space. To solve this problem, we propose an efficient stereo geometry network (ESGN). The key in our ESGN is an efficient geometry-aware feature generation (EGFG) module. Our EGFG module first uses a stereo correlation and reprojection module to construct multi-scale stereo volumes in camera frustum space, second employs a multi-scale bird’s eye view (BEV) projection and fusion module to generate multiple geometry-aware features. In these two steps, we adopt deep multi-scale information fusion for discriminative geometry-aware feature generation, without any complex aggregation networks. In addition, we introduce a deep geometry-aware feature distillation scheme to guide stereo feature learning with a LiDAR-based detector. The experiments are performed on the classical KITTI dataset. On KITTI test set, our ESGN outperforms the fast state-of-art-art detector YOLOStereo3D by 5.14% on mAP3d at 62ms . To the best of our knowledge, our ESGN achieves a best trade-off between accuracy and speed. We hope that our efficient stereo geometry network can provide more possible directions for fast 3D object detection.

Abstract:
Medical image segmentation is widely used in clinical diagnosis, and methods based on convolutional neural networks have been able to achieve high accuracy. However, it is still difficult to extract global context features, and the parameters are too large to be clinically applied. In this regard, we propose a novel network structure to improve the traditional encoder-decoder network model, which saves parameters while maintaining segmentation accuracy. We improve the feature extraction efficiency by constructing an encoder module that can simultaneously extract local features and global continuity information. A novel attention module is designed to optimize segmentation boundary regions while improving training efficiency. The feature transfer structure of the decoding part is also improved, which fully integrates the features of different levels to restore the spatial resolution more finely. We evaluate our model on seven different medical segmentation datasets, the 2018 Data Science Bowl Challenge (DSBC2018), the 2018 Lesion Boundary Segmentation Challenge (ISIC2018), the Gland Segmentation in Colon Histology Images Challenge (GlaS), Kvasir-SEG, CVC-ClinicDB, Kvasir-Instrument and Polypgen. Extensive experimental results show that our model can achieve good segmentation performance while maintaining a small number of parameters and computational load, which can further facilitate the generalization of the theoretical approach to clinical practice. Our code will be released at https://github.com/caijilia/ERDUnet.

Abstract:
Remote photoplethysmography (rPPG) is a vital way of measuring heart rate (HR) to reflect human physical and mental health, which is useful for diagnosing cardiovascular and neurological diseases. Many non-contact HR estimation methods have been proposed gradually in recent years, but the majority of approaches are based on a single-modal HR information source, resulting in ineffective and unsatisfactory estimation results due to noise and insufficient information. This paper proposes a novel information-enhanced network for HR estimation based on multimodal (e.g., RGB and NIR) sources to address these problems. In the network, context and modal difference information are sequentially enhanced from spatiotemporal and modal views for accurately describing HR-aware features, while maximum frequency information is enhanced for inhibiting heartbeat noise. Specifically, a context-enhanced video Swin-Transformer (CET) module is exploited to extract useful rPPG signal features from facial visible-light and near-infrared videos. Then, a novel modal difference enhanced fusion (MDEF) module is designed to acquire a fused rPPG signal, which is taken as the input of the frequency-enhanced estimation (FEE) module to obtain the corresponding HR value. These three modules are integrated and jointly learned in an end-to-end way, and the multimodal combinations can provide highly complementary information for estimating HR value. Experimental and evaluation results on three multimodal datasets show that the proposed model achieves a superior effect compared to the state-of-the-art methods.

Abstract:
Anchor-free detection methods identify different objects by perceiving bounding box keypoints without predefined anchor boxes, which have attracted much attention due to their straightforward design and comparable performance. Currently, most anchor-free methods detect bounding box corners to regress object locations. In clutter environments, the bounding box corners may lie in background regions, which have limited relation with the object itself. In addition, the relationships between object keypoints are always neglected, potentially affecting the perceptibility of the detector for high-precision object detection. In this paper, we propose the Keypoint Relational Regression Network (KRRNet) to detect object keypoints with semantic relations instead of bounding box corners. The relational regression head is designed to enhance the keypoint relationship exploration capability and reason accuracy object locations. Moreover, the random background sampling strategy is proposed to sample negative background points around foreground object regions and form point pairs with object keypoints. Then, KRRNet can explicitly learn discriminative feature embedding from contrastive learning to pull close the positive pairs and push apart the negative pairs, resisting the influence of surrounding complex environments. KRRNet can be trained on one Nvidia RTX 3090 GPU and achieves a single-scale test AP of 48.9% and multi-scale test AP of 50.6% on the MS-COCO test-dev with the backbone of Hourglass-104, surpassing state-of-the-art bottom-up anchor-free detector using the same backbone.

Affiliations: Department of Computer and Information Science, Faculty of Science and Technology, PAMI Research Group, University of Macau, Macau, China; School of Electrical and Electronic Engineering, Nanyang Technological University, Jurong West, Singapore; Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China; Institute of Intelligent Manufacturing, Nanjing Tech University, Nanjing, China

Abstract:
Low-rank matrix recovery is a major challenge in machine learning and computer vision, particularly for large-scale data matrices, as popular methods involving nuclear norm and singular value decomposition (SVD) are associated with high computational costs and biased estimators. To overcome this challenge, we propose a novel approach to learning low-rank matrices based on the matrix volume and a nonconvex logarithmic function. The matrix volume is the product of all the nonzero singular values of a matrix and has unique geometric properties and connections with other convex and nonconvex functions. We establish a generalized nonconvex regularization problem using the penalty function strategy and introduce an accelerated proximal alternating linearized minimization (AccPALM) algorithm with double acceleration, which combines Nesterov’s acceleration and power strategy. The algorithm reduces computational costs and has provable convergence results under the Kurdyka- ojasiewicz (K) inequality with mild conditions. Our approach shows superior accuracy, efficiency, and convergence behavior compared to other low-rank matrix learning methods on robust matrix completion (RMC) and low-rank representation (LRR) tasks. We analyze the impact of algorithm parameters on convergence and performance and present visually appealing results to further demonstrate the effectiveness of our approach. The proposed methodology represents a promising advance in the field of low-rank matrix recovery, and its effectiveness has been validated via extensive numerical experiments. The source code for the proposed algorithms is accessible at https://github.com/ZhangHengMin/AccPALMcodes.

Abstract:
Category-level object pose estimation aims to predict the 6D pose and 3D metric size of objects from given categories. Due to significant intra-class shape variations among different instances, existing methods have mainly focused on estimating dense correspondences between observed point clouds and their canonical representations, i.e., normalized object coordinate space (NOCS). Subsequently, a similarity transformation is applied to recover the object pose and size. Despite these efforts, current approaches still cannot fully exploit the intrinsic geometric features to individual instances, thus limiting their ability to handle objects with complex structures (i.e., cameras). To overcome this issue, this paper introduces GPT-COPE, which leverages a graph-guided point transformer to explore distinctive geometric features from the observed point cloud. Specifically, our GPT-COPE employs a Graph-Guided Attention Encoder to extract multiscale geometric features in a local-to-global manner and utilizes an Iterative Non-Parametric Decoder to aggregate the multiscale geometric features from finer scales to coarser scales without learnable parameters. After obtaining the aggregated geometric features, the object NOCS coordinates and shape are regressed through the shape prior adaptation mechanism, and the object pose and size are obtained using the Umeyama algorithm. The multiscale network design enables perceiving the overall shape and structural information of the object, which is beneficial to handle objects with complex structures. Experimental results on the NOCS-REAL and NOCS-CAMERA datasets demonstrate that our GPT-COPE achieves state-of-the-art performance and significantly outperforms existing methods. Furthermore, our GPT-COPE shows superior generalization ability compared to existing methods on the large-scale in-the-wild dataset Wild6D and achieves better performance on the REDWOOD75 dataset, which involves objects with unconstrained orientations.

Abstract:
Decoders play significant roles in recovering scene depths. However, the decoders used in previous works ignore the propagation of multilevel lossless fine-grained information, cannot adaptively capture local and global information in parallel, and cannot perform sufficient global statistical analyses on the final output disparities. In addition, the process of mapping from a low-resolution (LR) feature space to a high-resolution (HR) feature space is a one-to-many problem that may have multiple solutions. Therefore, the quality of the recovered depth map is low. To this end, we propose a high-quality decoder (HQDec), with which multilevel near-lossless fine-grained information, obtained by the proposed adaptive axial-normalized position-embedded channel attention sampling module (AdaAxialNPCAS), can be adaptively incorporated into a LR feature map with high-level semantics utilizing the proposed adaptive information exchange scheme. In the HQDec, we leverage the proposed adaptive refinement module (AdaRM) to model the local and global dependencies between pixels in parallel and utilize the proposed disparity attention module to model the distribution characteristics of disparity values from a global perspective. To recover fine-grained HR features with maximal accuracy, we adaptively fuse the high-frequency information obtained by constraining the upsampled solution space utilizing the local and global dependencies between pixels into the HR feature map generated from the nonlearning method. Extensive experiments demonstrate that each proposed component improves the quality of the depth estimation results over the baseline results, and the developed approach achieves state-of-the-art results on the KITTI and DDAD datasets. The code and models will be publicly available at HQDec.

Abstract:
Manga is a fashionable Japanese-style comic form that is composed of black-and-white strokes and is generally displayed as raster images on digital devices. Typical mangas have simple textures, wide lines, and few color gradients, which are vectorizable natures to enjoy the merits of vector graphics, e.g., adaptive resolutions and small file sizes. In this paper, we propose MARVEL (MAnga’s Raster to VEctor Learning), a primitive-wise approach for vectorizing raster gray-level mangas by Deep Reinforcement Learning (DRL). Unlike previous learning-based methods which predict vector parameters for an entire image, MARVEL introduces a new perspective that regards an entire manga as a collection of basic primitives—stroke lines, and designs a DRL model to decompose the target image into a primitive sequence for achieving accurate vectorization. To improve vectorization accuracies and decrease file sizes, we further propose a stroke accuracy reward to predict accurate stroke lines, and a pruning mechanism to avoid generating erroneous and repeated strokes. Extensive subjective and objective experiments show that our MARVEL can generate impressive results and reaches the state-of-the-art level.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality image retrieval task. Compared to visible modality person re-identification that handles only the intra-modality discrepancy, VI-ReID suffers from an additional modality gap. Most existing VI-ReID methods achieve promising accuracy in a supervised setting, but the high annotation cost limits their scalability to real-world scenarios. Although a few unsupervised VI-ReID methods already exist, they typically rely on intra-modality initialization and cross-modality instance selection, despite the additional computational time required for intra-modality initialization. In this paper, we study the fully unsupervised VI-ReID problem and propose a novel cross-modality hierarchical clustering and refinement (CHCR) method by promoting modality-invariant feature learning and improving the reliability of pseudo-labels. Unlike conventional VI-ReID methods, CHCR does not rely on any manual identity annotation and intra-modality initialization. First, we design a simple and effective cross-modality clustering baseline that clusters between modalities. Then, to provide sufficient inter-modality positive sample pairs for modality-invariant feature learning, we propose a cross-modality hierarchical clustering algorithm to promote the clustering of inter-modality positive samples into the same cluster. In addition, we develop an inter-channel pseudo-label refinement algorithm to eliminate unreliable pseudo-labels by checking the clustering results of three channels in the visible modality. Extensive experiments demonstrate that CHCR outperforms state-of-the-art unsupervised methods and achieves performance competitive with many supervised methods.

Abstract:
Recent learned image compression models surpass manually designed methods in rate-distortion performance by introducing nonlinear transforms and end-to-end optimization. However, there still lack quantitative measurements that efficiently evaluate the latent representations inferred by learned image compression models. To address this problem, we develop novel measurements on robustness and importance of the latent representations. We first propose an admissible range that can be efficiently estimated via gradient ascent and descent for establishing the empirical distribution of latent representations. Consequently, the in-distribution region within the admissible range is derived to measure the robustness and channel importance of latent representations of natural images. Visualization demonstrates the statistics of latent representations are significantly distinguishing in the properties of robustness and linearity within and outside the in-distribution region. To our best knowledge, this paper proposes the first statistically meaningful measurements for learned image compression and successfully applies the measurements in corruption alleviation during successive image compression and post-training pruning in a training-free fashion. Compared with existing methods, the shrunk in-distribution constraint derived from the in-distribution region achieves superior robustness and rate-distortion performance in successive compression. The channel importance allows post-training pruning to achieve comparable rate-distortion performance with a reduction of up to 60% entropy coding time.

Abstract:
Recently, there are significant advancements in learning-based image compression methods surpassing traditional coding standards. Most of them prioritize achieving the best rate-distortion performance for a particular compression rate, which limits their flexibility and adaptability in various applications with complex and varying constraints. In this work, we explore the potential of resolution fields in scalable image compression and propose the reciprocal pyramid network (RPN) that fulfills the need for more adaptable and versatile compression. Specifically, RPN first builds a compression pyramid and generates the resolution fields at different levels in a top-down manner. The key design lies in the cross-resolution context mining module between adjacent levels, which performs feature enriching and distillation to mine meaningful contextualized information and remove unnecessary redundancy, producing informative resolution fields as residual priors. The scalability is achieved by progressive bitstream reusing and resolution field incorporation varying at different levels. Furthermore, between adjacent compression levels, we explicitly quantify the aleatoric uncertainty from the bottom decoded representations and develop an uncertainty-guided loss to update the upper-level compression parameters, forming a reverse pyramid process that enforces the network to focus on the textured pixels with high variance for more reliable and accurate reconstruction. Combining resolution field exploration and uncertainty guidance in a pyramid manner, RPN can effectively achieve spatial and quality scalable image compression. Experiments show the superiority of RPN against existing classical and deep learning-based scalable codecs. Code will be available at https://github.com/JGIroro/RPNSIC.

Abstract:
In multiview video coding, the coding performance highly depends on the quality of the reference frames. In view of this, a step-wise reference frame generation network (SWGNet) is designed to improve the quality of the reference frame for efficient multiview video coding. In particular, a frame-level to block-level learning paradigm is proposed to step-wisely generate a high-quality reference frame. In the frame-level stage, by exploiting parallax correlations between temporal and inter-view references on the basis of image alignment, a parallax-guided frame-level synthesis module is proposed to generate an elementary reference frame. Then, in the block-level stage, a transformer-based block-level aggregation module is designed to further refine the texture details of the reference frame by modeling long-range dependencies among pixels. The proposed SWGNet is integrated into 3D-HEVC, and extensive experiments demonstrate that the proposed method achieves significant bitrate saving compared with 3D-HEVC.

Abstract:
Focusing on the difficulty of absolute rotation globalization of large-scale rotation averaging problem, a novel hierarchical pipeline, termed as IRAv3+, based on multiple Connected Dominating Sets (CDSs) is proposed in this paper. Specifically, the proposed method not only obtains the graph clusters for local rotation averaging like other cluster-based methods, but also generate a subset via connected dominating set extraction, which is served as a reference for rotation globalization. To facilitate the rotation globalization, two key techniques are proposed: 1) to provide a more reliable global reference, instead of a single CDS, multiple CDSs are randomly selected and united; 2) to give a more accurate local-to-global alignment estimation, instead of using the relative rotation measurements of the sharing edges between local clusters and global reference, the absolute rotations of common vertices between them are involved. Experiments on the 1DSfM dataset demonstrate the effectiveness of the proposed IRAv3+ and its advantages over the existing cluster-based rotation averaging methods and other state of the arts.

Abstract:
Post-training neural network quantization (PTQ) is an effective model compression technology that has revolutionized the deployment of deep neural networks on various edge devices. It provides easy-to-use characteristics and allows for generating a quantized model based on a pre-trained counterpart without re-training. Typical PTQ approaches maintain output consistency through layer-wise calibration. However, these approaches still suffer from performance degradation primarily caused by feature quantization in ultra-low bitwidth conditions. To address this issue, we propose a prepositive feature quantization framework that decouples adjacent layers and calibrates the interaction between feature and parameter quantization perturbations. Additionally, we present a feature-loss-aware optimization strategy to solve the corresponding calibration problem. To validate the effectiveness of our method, we conducted extensive experiments on the ImageNet benchmark dataset. Our approach demonstrates a noticeable improvement in PTQ performance under the 2-bit condition.

Abstract:
Under-Display Camera (UDC) is an emerging feature of cellphone. This technology makes full-screen cellphones possible by hiding the front-facing camera below the display panel, which is in contrast to the conventional designs that place the camera in a bezel or punch-hole on the screen border. However, this novel imaging paradigm also causes degradation. The display panel attenuates and diffracts incoming light, so the images captured by UDC contain multiple artifacts, such as blurring, color shift, and low intensities. This paper proposes a lightweight deep learning approach to restore UDC images in a blind setting. The restoration network uses cross-scale modulation to exploit complementary information from multi-scale representations and capture the self-similarity across scales, aiming to find the cues for recovering distortion-free images. To facilitate the deployment of this scheme across mobile devices, especially on those with limited memory space and computing power, we compress the restoration network by reducing architectural redundancy. An adaptive distillation algorithm is designed to exploit knowledge from a pre-trained full-size model. The proposed work also interprets the behavior of the neural network in utilizing local and non-local information to restore UDC images. The proposed algorithm is evaluated over three datasets of the images captured by the cameras below different types of display panels. The results of comparative experiments demonstrate that our algorithm shows comparable or superior performance to the competing ones that are much heavier in parameter amount and computational complexities.

Abstract:
Structure-from-Motion is a technology used to obtain scene structure through image collection, which is a fundamental problem in computer vision. For unordered Internet images, SfM is very slow due to the lack of prior knowledge about image overlap. For sequential images, knowing the large overlap between adjacent frames, SfM can adopt a variety of acceleration strategies, which are only applicable to sequential data. To further improve the reconstruction efficiency and break the gap of strategies between these two kinds of data, this paper presents an efficient covisibility-based incremental SfM. Different from previous methods, we exploit covisibility and registration dependency to describe the image connection which is suitable to any kind of data. Based on this general image connection, we propose a unified framework to efficiently reconstruct sequential images, unordered images, and the mixture of these two. Experiments on the unordered images and mixed data verify the effectiveness of the proposed method, which is three times faster than the state-of-the-art on feature matching, and an order of magnitude faster on reconstruction without sacrificing the accuracy. The source code is publicly available at https://github.com/openxrlab/xrsfm.

Abstract:
It is challenging to achieve generalized zero-shot action recognition. Different from the conventional zero-shot tasks which assume that the instances of the source classes are absent in the test set, the generalized zero-shot task studies the case that the test set contains both the source and the target classes. Due to the gap between visual feature and semantic embedding as well as the inherent bias of the learned classifier towards the source classes, the existing generalized zero-shot action recognition approaches are still far less effective than traditional zero-shot action recognition approaches. Facing these challenges, a novel transductive learning with prior knowledge (TLPK) model is proposed for generalized zero-shot action recognition. First, TLPK learns the prior knowledge which assists in bridging the gap between visual features and semantic embeddings, and preliminarily reduces the bias caused by the visual-semantic gap. Then, a transductive learning method that employs unlabeled target data is designed to overcome the bias problem in an effective manner. To achieve this, a target semantic-available approach and a target semantic-free approach are devised to utilize the target semantics in two different ways, where the target semantic-free approach exploits prior knowledge to produce well-performed semantic embeddings. By exploring the usage of the aforementioned prior-knowledge learning and transductive learning strategies, TLPK significantly bridges the visual-semantic gap and alleviates the bias between the source and the target classes. The experiments on the benchmark datasets of HMDB51 and UCF101 demonstrate the effectiveness of the proposed model compared to the state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn

Abstract:
Advanced human sensing technologies based on radio frequency (RF) signals have gained widespread attention in recent years. However, due to the sparsity and incompleteness of RF signals, fine-grained RF-based multi-person 3D pose estimation has progressed more slowly. In this paper, we present RF-based Pose Machine (RPM 2.0) for multi-person 3D pose estimation using RF signals. Specifically, we first develop a lightweight anchor-free detector module to locate and crop regions of interest from horizontal and vertical RF signals. Afterward, we treat the horizontal and vertical millimeter-wave radars as “RF cameras” with different viewing angles and propose a Multi-view Fusion Network to unproject the RF signals into a unified latent feature space, and then calculate the correlation for weighted fusion. Finally, a Spatio-Temporal Attention Network is designed to reconstruct the multi-person 3D skeleton sequences, in which the spatial attention module is proposed to recover invisible body parts using non-local correlations among joints and the temporal attention module refines the 3D pose sequences using temporal coherency learned from frame queries. We evaluate the performance of the proposed RPM 2.0 and state-of-the-art methods on a large-scale dataset with multi-person 3D pose labels and corresponding radar signals. The experimental results show that RPM 2.0 outperforms all of the baseline methods, which locates multi-person 3D key points with an average error of 73 mm and generalizes well in new data such as occlusion, low illumination.

Abstract:
With the increasing diversity of visual tracking tasks, object tracking in RGB and thermal (RGB-T) modalities has received widespread interest. Most of the existing RGB-T tracking methods mainly improve tracking performance by integrating hierarchically complementary information from RGB and thermal modalities, however, they are insufficient in handling tracking failures due to the lack of re-detection capability. To address these issues, we propose a new RGB-T tracking method with online learning samples and adaptive object recovery. First, the features of RGB and thermal modalities are concatenated for robust appearance modeling. Second, a multimodal fusion strategy is designed to stably integrate reliable information of modalities and propose to use similarity to measure tracking confidence. Finally, a detector with online learning of positive and negative samples and adaptive recovery is developed to correct unreliable tracking results. Numerical results on five recent large-scale benchmark datasets demonstrate that the proposed tracker achieves competitive performance compared to other state-of-the-art methods.

Abstract:
Capturing high dynamic range (HDR) images (videos) is attractive because it can reveal the details in both dark and bright regions. Since the mainstream screens only support low dynamic range (LDR) content, tone mapping algorithm is required to compress the dynamic range of HDR images (videos). Although image tone mapping has been widely explored, video tone mapping is lagging behind, especially for the deep-learning-based methods, due to the lack of HDR-LDR video pairs. In this work, we propose a unified framework (IVTMNet) for unsupervised image and video tone mapping. To improve unsupervised training, we propose domain and instance based contrastive learning loss. Instead of using a universal feature extractor, such as VGG to extract the features for similarity measurement, we propose a novel latent code, which is an aggregation of the brightness and contrast of extracted features, to measure the similarity of different pairs. We totally construct two negative pairs and three positive pairs to constrain the latent codes of tone mapped results. For the network structure, we propose a spatial-feature-enhanced (SFE) module to enable information exchange and transformation of nonlocal regions. For video tone mapping, we propose a temporal-feature-replaced (TFR) module to efficiently utilize the temporal correlation and improve the temporal consistency of video tone-mapped results. We construct a large-scale unpaired HDR-LDR video dataset to facilitate the unsupervised training process for video tone mapping. Experimental results demonstrate that our method outperforms state-of-the-art image and video tone mapping methods. Our code and dataset are available at https://github.com/cao-cong/UnCLTMO.

Affiliations: College of Computer Science and Technology, Qingdao University, Qingdao, China; Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, and the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China; College of Engineering, Ocean University of China, Qingdao, China; School of Mechanical and Aerospace Engineering, Nanyang Technological University (NTU), Jurong West, Singapore; College of Computer Science, Nankai University, Tianjin, China

Abstract:
Underwater image quality is seriously degraded due to the insufficient light in water. Although artificial illumination can assist imaging, it often brings non-uniform illumination phenomenon. To this end, we develop an illumination channel sparsity prior (ICSP) guided variational framework for non-uniform illumination underwater image restoration. Technically, the illumination channel sparsity prior is built on the observation that the illumination channel of a uniform-light underwater image in HSI color space contains few pixels whose intensity is very low. Then according to the Retinex theory, we design a variational model with L0 norm term, constraint term, and gradient term, by integrating the proposed ICSP into an extended underwater image formation model. Such three regularizations are effective in enhancing the brightness, correcting color distortion, and revealing structures and fine-scale details. Meanwhile, we exploit a fast numerical algorithm on the base of the alternating direction method of multipliers (ADMM) to accelerate solving this optimization problem. We also collect a benchmark dataset, namely NUID that contains 925 real underwater images of different non-uniform illumination. Extensive experiments demonstrate that our proposed method is effective in terms of qualitative and quantitative comparisons, ablation studies, convergence analysis, and applications. The code and dataset are available at https://github.com/Hou-Guojia/ICSP.

Abstract:
Image-to-image translation methods have progressed from only considering the image-level information to integrating the global- and instance-level information. However, only the foreground instances are refined, and the background semantics are taken as an entire feature, which causes a substantial loss of the semantic information in the translation. Additionally, the insufficient quality of the translated semantic regions also leads to an unsatisfactory performance of the object recognition or visual odometry tasks in which the translated images/videos are further used. In this paper, we propose a novel generative adversarial network for panoptic-level image-to-image translation (PanopticGAN). The proposed method has three advantages: 1) the extracted panoptic perception (i.e., the foreground instances and background semantic regions) as content codes are aligned with the sampled panoptic style codes, which considers the panoptic-level information to avoid the semantic information loss, and the latent space of each object has a rich fusion of content and style codes to generate the higher-fidelity results; 2) a feature masking module is proposed to extract the representations within each object contour by masks for sharpening the object boundaries; 3) the improved fidelity of the translated semantic regions further contributes to enhancing the performance of the object recognition or visual odometry tasks that the translated images/videos are used in. In this paper, we also annotate a compact panoptic segmentation dataset for the thermal-to-color translation task. Extensive experiments are conducted to demonstrate the effectiveness of our PanopticGAN over the latest methods.

Abstract:
The demand to implement semantic segmentation networks on mobile devices has increased dramatically. However, existing real-time semantic segmentation methods still suffer from a large number of network parameters, unsuitable for mobile devices with limited memory resources. The reason mainly arises from the fact that most existing methods take the backbone networks (e.g., ResNet-18 and MobileNet) as an encoder. To alleviate this problem, we propose a novel Reparameterizable Channel & Dilation (RCD) block and construct a considerably lightweight yet effective encoder by stacking several RCD blocks according to three guidelines. The strengths of the proposed encoder result in the abilities not only to extract discriminative feature representations via channel convolutions and dilated convolutions, but also to reduce computational burdens while maintaining segmentation accuracy with the help of re-parameterization technique. Except for encoder, we also present a simple but effective decoder that adopts an across-resolution fusion strategy to fuse multi-scale feature maps generated from the encoder instead of a bottom-up pathway fusion. With such an encoder and a decoder, we provide a Reparameterizable Across-resolution Fusion Network (RAFNet) for real-time semantic segmentation. Extensive experiments demonstrate that our RAFNet achieves a promising trade-off between segmentation accuracy, inference speed and network parameters. Specifically, our RAFNet with only 0.96M parameters obtains 75.3% mIoU at 107 FPS and 75.8% mIoU at 195 FPS on Cityscapes and CamVid test sets for full-resolution inputs, respectively. After quantization and deployment on a Xilinx ZCU104 device, our RAFNet obtains a favorable segmentation performance with only 1.4W power.

Abstract:
From the perspective of computer vision, both visual saliency and object detection have attracted hot attention in the field of traffic scene perception. However, these two tasks are often seen as independent missions, and their correlations have rarely been explored. In real driving scenarios, drivers mainly care about the salient objects closely related to the current driving task under the guidance of visual selective attention. This process highly integrates saliency perception and object detection, leading to efficient and quick decision-making, thus achieving safe driving. In this study, with reference to human drivers’ perception of traffic scenes, we focus on detecting fixated objects within the regions attracting the drivers’ attention. Firstly, we build a new fixated object detection dataset based on drivers’ fixations, which can serve as a benchmark for studying traffic object detection from the driver’s point of view. Then, we propose a fixated object detection model based on saliency prior, named FOD-Net. FOD-Net takes advantage of the predicted salient regions as saliency priors to guide the detection of the fixated objects that are closely relevant to the driving task, thus improving detection accuracy. Experimental results on the proposed dataset show that FOD-Net achieves a mAP value of 78.4% with small model parameters, which is higher than other state-of-the-art models. Our work combines the driver’s attention mechanism with object detection to narrow the gap between visual saliency and object detection in traffic scenes, showing potential supplemental or referential value for developing high-intelligence assisted/automatic driving systems. The dataset and code are available in https://github.com/YiShi701/Fixated-object-detection.

Abstract:
Due to their imaging mechanisms and techniques, some depth images inevitably have low visual qualities or have some inconsistent foregrounds with their corresponding RGB images. Directly using such depth images will deteriorate the performance of RGB-D SOD. In view of this, a novel RGB-D salient object detection model is presented, which follows the principle of calibration-then-fusion to effectively suppress the influence of such two types of depth images on final saliency prediction. Specifically, the proposed model is composed of two stages, i.e., an image generation stage and a saliency reasoning stage. The former generates high-quality and foreground-consistent pseudo depth images via an image generation network. While the latter first calibrates the original depth information with the aid of those newly generated pseudo depth images and then performs cross-modal feature fusion for the final saliency reasoning. Especially, in the first stage, a Two-steps Sample Selection (TSS) strategy is employed to select such reliable depth images from the original RGB-D image pairs as supervision information to optimize the image generation network. Afterwards, in the second stage, a Feature Calibrating and Fusing Network (FCFNet) is proposed to achieve the calibration-then-fusion of cross-modal information for the final saliency prediction, which is achieved by a Depth Feature Calibration (DFC) module, a Shallow-level Feature Injection (SFI) module and a Multi-modal Multi-scale Fusion (MMF) module. Moreover, a loss function, i.e., Region Consistency Aware (RCA) loss, is presented as an auxiliary loss for FCFNet to facilitate the completeness of salient objects together with the reduction of background interference by considering the local regional consistency in the saliency maps. Experiments on six benchmark datasets demonstrate the superiorities of our proposed RGB-D SOD model over some state-of-the-arts.

Abstract:
In this work, we propose a large-scale dataset, VRAI, and an effective Orientation Adaptive and Salience Attentive (OASA) Network for vehicle re-identification (ReID) in aerial imagery. The VRAI dataset includes two subsets: VRAI-Image, which contains over 137,000 images of 13,000 vehicle instances, and VRAI-Video, which comprises more than 14,000 video trajectories of 7,000 identities. To our best knowledge, this is the largest dataset for UAV-based vehicle ReID, and the first dataset proposed for video-based ReID under UAV views. Based on the VRAI dataset, we design an OASA network to address two crucial challenges of vehicle ReID in aerial imagery. Firstly, the significant vehicle orientation variations in aerial images could cause great vehicle pattern deformations, making it difficult to identify vehicles across UAV views. To overcome this challenge, in our OASA, an orientation adaptive dynamic convolution module is designed, which constructs customized kernels for each vehicle instance to extract orientation-invariant features. Besides, the unique vertical view and long focal length of the UAV platform often render many salient vehicle attributes, such as logos and license plates, invisible, which brings a great challenge to ReID models to extract distinguishable vehicle features. To address this issue, in the OASA, we design a transformer-based salience attentive module (Trans-Attn) that guides the model to focus on subtle yet discriminative clues of vehicle instances in aerial imagery. Through extensive experiments, both of our designed modules are verified effective. Besides, our OASA model outperforms state-of-the-art algorithms both on our VRAI dataset and other surveillance-based datasets. Our VRAI dataset is available in https://github.com/JiaoBL1234/VRAI-Dataset.

Abstract:
View synthesis aims to learn a view transformation and synthesize the target views from a single or multiple source views. Although previous view synthesis methods have obtained promising performance, they heavily rely on the supervision of the target view. In this paper, we propose an unsupervised single-view synthesis network (USVS-Net) to learn the view transformation without the supervision of the target view. Specifically, with the usage of only a single source view, a style-guidance view synthesis model is proposed to learn an intrinsic representation, which intends to describe the object from a reference pose. With the intrinsic representation, the view transformation is learned to boost the learning of the unsupervised single-view synthesis. Then, taking the style-guidance view synthesis model as the teacher, a prior-distillation view synthesis model is further presented as the student to learn a more direct view transformation. By utilizing the proposed method, high-quality target views are synthesized in a time-efficient manner. Experiments on both synthetic and real-scene datasets show that despite the lack of supervision of the target view, the proposed method achieves promising results compared with the existing view synthesis methods.

Abstract:
Due to the complex imaging mechanism, underwater images often suffer from multiple degradation issues, such as color cast, blurry detail, and low contrast, which affect the extraction of valuable information. To deal with these degradation issues, a simple yet effective underwater image quality improvement method based on color, detail and contrast restoration (CDCR) is developed, which consists of three key modules: a well-preserved finding-driven color balance module (CBM), a linear saturation transformation-based discriminant function-based detail restoration module (DRM), and a transmission minimization-oriented contrast restoration module (CRM). First, the CBM explores a well-preserved channel finding and employs a channel compensation strategy to balance the color differences among three color channels. Second, the DRM uses a piecewise underwater image saturation estimation strategy, which takes the various spectral properties of water into account and designs an additional linear saturation transformation-based discriminant function to prevent the transmission from being under-estimated. At last, the CRM estimates a global backscatter light based on transmission minimization and further improves the contrast by locally removing the backscatter light of the base layer. Our restored image is appealing in its natural color, fine details, and high contrast. Extensive experiments on three underwater image enhancement datasets show that our CDCR achieves better results than state-of-the-art methods, i.e., compared with the second-best method, the average PCQI and UIQM values of our method increase by 5.7% and 0.2%, and the average Blur and DFAD values of our method decrease by 8.0% and 5.3%. Meanwhile, experiments further suggest that the rate of new visible edges and the quality of contrast restoration of our CDCR at least increase by 7.7% and 51.2% in most tested sandstorm and foggy images, respectively, which demonstrates that our method has a good generalization capability for sandstorm and foggy image restoration.

Abstract:
Long-term spatial-temporal frame prediction focuses on predicting future image frames precisely, which has numerous applications in real-world scenarios. Existing deep learning prediction models mainly rely on advanced neural network architectures to model complicated spatial-temporal features, which make few efforts to explore high-order correlations to better capture long-term dynamics. Their prediction on long-term frames suffers from inaccurate visual and motion detail issue. In this article, we propose a high-order prediction model for long-term frame prediction, which improves the appearance and motion details by designing special high-order correlation modules in two aspects. First, to enhance the appearance details of predicted frames, we propose a high-order appearance encoder module, where high-order appearance features can be effectively captured with a carefully designed Non-local ConvLSTM. Second, to guarantee the motion accuracy of predicted sequences, we carefully design a high-order motion encoder module, which can accurately capture and preserve the high-order motion patterns with adaptive motion extractors and progressive memory banks, respectively. Comprehensive experiments are conducted on six challenging datasets from real-world scenarios, which demonstrate the effectiveness and superiority of our proposed method over state-of-the-art methods.

Abstract:
Steganalysis feature selection shows excellent effectiveness on elevating the detection efficiency and decreasing time-space cost. However, the single evaluation criterion for features and the subjective selection basis always lead to valuable features neglect, which restricts the improvement of detection accuracy. To alleviate this predicament, this paper proposes a steganalysis feature selection method based on multidimensional evaluation and dynamic threshold allocation (MEDTA method). Firstly, to measure the feature components’ contribution degree to detection, the concept of partial entropy for steganalysis (ste- pe ) is defined and utilized to measure the mutual information between feature components. On this basis, the evaluation criterion for steganalysis feature components’ contribution degree is proposed, and the theoretical basis is given. Secondly, to measure the functional similarity of the feature components in distinguishing between cover images and stego images, by applying the property of cosine similarity between vectors, the evaluation criterion for steganalysis feature components’ contribution angle is proposed. Then, according to the Occam’s Razor, a multidimensional evaluation criterion based on contribution degree and contribution angle is proposed, which provides a basis for feature selection. In addition, to allocate the threshold for feature selection, this paper proposed a dynamic threshold allocation model, which combines the merits of several function models. Finally, feature selection with multidimensional evaluation and dynamic threshold allocation is proposed, which can achieve a comprehensive evaluation and objective selection for steganalysis features. Extensive experiments conducted on the BOSSbase1.01 image database demonstrate that the proposed MEDTA method could not only achieve highly competitive or even better performance in detection accuracy and feature dimension reduction, as compared with the state-of-the-art methods, but also get rid of depending on classifiers, so that the efficiency of feature extraction and steganalysis gets promoted.

Abstract:
As online collaborations become more prevalent, screen content has become increasingly important in real-time video communications. To reduce communication costs, the H.265/HEVC standard introduced the Screen Content Coding (SCC) extension, which achieves significant bits savings but comes with a higher encoding complexity. There is a need for ultrafast SCC encoding to meet the demands of real-time applications. Our key idea is to predict the rate-distortion (RD) cost of each possible coding unit under each possible mode, rather than performing actual coding to obtain the RD cost. Specifically, we construct neural networks to predict RD costs for intra prediction, palette, and normal intra block copy (IBC) modes. For IBC merge mode, we conduct motion compensation trials and use a linear regression network for prediction. Using the predicted RD costs, we create a partition-mode map set that determines not only block partitioning but also optimal modes, significantly reducing encoding complexity. Our experimental results demonstrate that our method achieves a more than 90% reduction in encoding time with an average 9.4% BD-rate increase compared to the HEVC-SCC reference software in the all-intra configuration.

Abstract:
Pedestrian trajectory prediction is an essential task in real-world applications, aimed at predicting plausible future trajectories based on limited observations. In this work, we rethink the standard evaluation metric of the pedestrian trajectory prediction task: Minimum-of-N Average Displacement Error (MoN-ADE). As for multi-modal prediction models that generate multiple trajectories for each pedestrian, this metric typically evaluates the model by only considering the one that is closest to the ground-truth trajectory. However, such an evaluation protocol cannot comprehensively evaluate the predictive ability of the model, and potentially encourage models to generate high-variance and dispersed trajectory distributions. This is quite impractical especially for many real-world scenes like autonomous driving that require precise and convergent trajectory predictions. To address these limitations, we design a novel metric towards comprehensive evaluation in pedestrian trajectory prediction, which moves beyond the traditional reliance on the closest prediction. Specifically, we replace the Minimum-of-N strategy with an insightful Random-Sampling-K strategy to calculate the expectations of the minimum ADE and formulate a novel metric: Area Under the Curve (AUC). Furthermore, motivated by the proposed metric, we introduce a novel objective function named K-Ensemble Loss, which guides the state-of-the-art models to optimize the whole prediction distribution and reduce the uncertainty caused by the high-variance predictions. Extensive experiments on three real-world datasets demonstrate that the proposed metric and objective function are provided with significant effectiveness and flexibility.

Abstract:
Oriented object detection has garnered significant attention. However, rotational symmetry and discontinuity at boundaries can confuse networks, leading to discontinuous loss and regression inconsistency. In this paper, we propose an efficient multi-directional object detection framework named Direction Prediction Redefinition (DPR). We describe the angle variation of rotated bounding boxes ( B_r ) as changes in the dimensions of horizontal bounding boxes ( B_h ). Specifically, we generate two sets of horizontal bounding boxes by predicting the center points of the corresponding boundaries within the rotated bounding box, thereby avoiding boundary issues caused by angle prediction. To further achieve robust rotated boundary representation, we propose the Joint Scale Representation method and the State Feature Encoding module, which are used to eliminate outliers in rotated boundaries and guide the correct selection of horizontal bounding box vertices, respectively. Moreover, we further abstract DPR as Multiple Trigonometric functions based DPR (DPR-MT). This method maps a single angle into four sets of trigonometric functions and considers them as the four sides of the horizontal bounding box. This approach predicts angles in the form of horizontal bounding boxes without complex operations, making it plug-and-play. Experimental results and visual analysis on challenging datasets further verify the effectiveness and competitiveness of our proposed method.

Abstract:
Satellite video object tracking involves tracking a specified tiny object within a wide scene. The insufficient appearance features of these tiny objects pose significant challenges to appearance-based object trackers, particularly in situations involving occlusion, target blur, and similar interferences. In this paper, a novel Graph Association MOtion-aware tracker (GAMO) is proposed for tiny object in satellite videos, which integrates motion and spatial relationship information. First, a Gaussian motion estimator is proposed that decouples motion into velocity and direction, rather than using traditional x-y movement modeling. This estimator predicts the object’s position and estimates motion uncertainty with a directional motion probability map. Furthermore, the estimated motion serves as a prior to guide the proposal sampling. A probabilistic proposal sampling module is designed that samples candidate bounding boxes according to the directional motion probability map, focusing on the region where the target is most likely to appear. Additionally, we implement a graph association module to model and propagate the spatial relationships between the target and neighboring objects over time. This relationship information assists the appearance features in distinguishing the target from similar interferences. Experiments on the Skysat-1, SV248S, and VISO datasets demonstrate the superiority of the proposed tracker. GAMO leverages motion and surrounding information, resulting in significant improvements with minimal computational overhead. The code and results will be publicly available in https://github.com/Midkey/GAMO.

Abstract:
Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at https://github.com/zou-longkun/RPD.git.

Abstract:
Finger vein authentication, recognized for its high security and specificity, has become a focal point in biometric research. Traditional methods predominantly concentrate on vein feature extraction for discriminative modeling, with a limited exploration of generative approaches. Suffering from verification failure, existing methods often fail to obtain authentic vein patterns by segmentation. To fill this gap, we introduce DiffVein, a unified diffusion model-based framework which simultaneously addresses vein segmentation and authentication tasks. DiffVein is composed of two dedicated branches: one for segmentation and the other for denoising. For better feature interaction between these two branches, we introduce two specialized modules to improve their collective performance. The first, a mask condition module, incorporates the semantic information of vein patterns from the segmentation branch into the denoising process. Additionally, we also propose a Semantic Difference Transformer (SD-Former), which employs Fourier-space self-attention and cross-attention modules to extract category embedding before feeding it to the segmentation task. In this way, our framework allows for a dynamic interplay between diffusion and segmentation embeddings, thus vein segmentation and authentication tasks can inform and enhance each other in the joint training. To further optimize our model, we introduce a Fourier-space Structural Similarity (FSSIM) loss function, which is tailored to improve the denoising network’s learning efficacy. Extensive experiments on the USM and THU-MVFV3V datasets substantiates DiffVein’s superior performance, setting new benchmarks in both vein segmentation and authentication tasks.

Abstract:
No-Reference Image Quality Assessment (NR-IQA), a subset of IQA techniques, is critical in scenarios where reference images are unavailable. With advancements in camera technology and computer vision, IQA datasets have evolved significantly in distortion types, image contents, and domains. This highlights the need for a broad study of NR-IQA continual learning, optimizing on a sequence of tasks, in both in-domain and domain-transfer settings. In this paper, we introduce the Channel Modulation Kernel (CMKernel) as a solution to enhance NR-IQA continual learning from two perspectives. Firstly, CMKernel encodes channel attention information for both in-domain and domain-transfer scenarios. By imposing constraints on CMKernels of successive models, the channel attention distillation loss effectively mitigates the divergence between old and new models. Secondly, in the context of the domain-transfer setting, a significant challenge lies in training a robust and transferable base model from the general domain for subsequent continual learning across specific domains. To tackle this, we introduce CMKernel-based multi-dataset learning to acquire a generative model. By dynamically weighting convolutional channels, the base model learns more equally from mixed datasets, enhancing its performance for subsequent incremental tasks. Comprehensive experiments validate the superiority of CMKernel in both in-domain and domain-transfer continual learning settings, showcasing its efficacy in addressing the evolving challenges of NR-IQA in diverse image contexts.

Abstract:
Object detection in remote sensing images has garnered significant attention due to its wide applications in real-world scenarios. However, most existing oriented object detectors still suffer from complex backgrounds and varying angles, limiting their performance to further improvement. In this paper, we propose a novel oriented detector with Hierarchical mask prompting and Robust integrated regression, termed HRDet. Specifically, to cope with the first issue, we construct a hierarchical mask prompting module consisting of a semantic mask prediction branch and hierarchical Softmax technique. The former aims to isolate object instances from cluttered interferences guided by coarse box-wise masks, while the latter propagates differentiated features for adjacent layers using hierarchical attentive weights. To deal with the second issue, we strive for robust integrated regression and formulate an efficient oriented IoU loss, explicitly measuring the discrepancies of three geometric factors in oriented regression, i.e., the central point distance, side length, and angle. This innovative loss intends to overcome the problem that existing IoU-based losses are invariant during the regression of varying angles. We applied these two strategies to a simple one-stage detection pipeline, achieving a new level of trade-off between speed and accuracy. Extensive experiments on four large aerial imagery datasets, DOTA-v1.0, DOTA-v2.0, DIOR-R, and HRSC2016, demonstrate that our HRDet significantly improves the accuracy of the one-stage detector over refine-stage counterparts while maintaining the efficiency advantage. The source code will be available at https://github.com/yanqingyao1994/HRDet.

Abstract:
Trusted open set recognition aims to classify known classes and reject unknown ones, as well as outputs an uncertainty estimate to measure the reliability of recognition results, thus extending the application scenarios of traditional open set recognition methods to risk-sensitive fields. Current methods assume that the covariate distribution of the known classes remains constant during training and testing. However, due to the common occurrence of covariate shift in practical applications, existing methods often suffer from limited generalization. To this end, a causal evidence learning framework, highlighted by the controllable Evidential Uncertainty Guided Adversarial Data Augmentation (EUG-ADA) and Causal Adversarial Disentanglement (CausalAD) strategies, is proposed to support trusted open set recognition under covariate shift. Specifically, EUG-ADA generates high-quality augmentation samples to increase training data diversity, guided by controllable evidential uncertainty and constrained by semantic consistency. Moreover, it is complemented by the CausalAD, which learns causal representations through causal intervention, mitigating the risk of misrecognition of unknown classes caused by the model’s reliance on shortcuts for prediction. The combined effect of EUG-ADA and CausalAD enables the model to learn more generalized and robust causal evidence for trusted open set recognition. Finally, extensive experimental results on both real-world and synthetic data validate the effectiveness of the proposed method, demonstrating that it improves not only open set recognition performance under covariate shift but also the reliability of uncertainty estimates. The code is released on https://github.com/ScorpioBao/CEL-OSR.

Abstract:
Semi-supervised video object segmentation (VOS) is a highly challenging task, which relies on the initial frame’s mask as a segmentation reference in a video sequence to classify each pixel in subsequent frames. However, the guidance provided by the first frame is limited due to the diverse types of segmentation targets and uncertain appearance changes. Consequently, it is crucial to retain useful information during the segmentation process and employ this information for model iteration optimization, enabling the model to better adapt to rapidly changing segmentation objectives. In this work, we propose a multi-scale adaptive model optimization strategy, which incorporates a contextual relevance enhancement module to enforce object correlation by emphasizing feature similarity across adjacent frames. Additionally, we introduce a keyframe discrimination module to deal with the segmentation challenges in scenarios involving significant target changes. Moreover, we also introduce a multi-scale memory screening module to automatically screen and select global-local optimization features for ensuring the model’s generalization performance. Extensive experiments show that the proposed method achieves state-of-the-art performance on DAVIS and large-scale Youtube-VOS 2018/2019 datasets without relying on synthetic training data or first-frame fine-tuning.

Abstract:
When the number of available training views is limited, NeRF and 3DGS will soon overfit the optimization and learn the wrong scene geometry. For this challenge, a common solution is to provide depth prior as supervision to correct scene geometry. In this work, we present Geometric Regularized 3D Gaussian Splatting (GeoRGS), a priors-independent method for improving novel view synthesis from sparse inputs. We analyze the problems of the density control strategy in 3DGS with sparse inputs, and find that correcting the erroneous Gaussian growth trend at the beginning of training is effective in mitigating overfitting. Based on this analysis, we propose two geometric regularization methods that do not require prior information. One is based on selecting seed patches of 3D Gaussian from the scene, which guides growth to form correct scene geometry, while the other focuses on regularizing depth similarity between object surfaces and edges. GeoRGS achieves state-of-the-art performance in novel view synthesis from sparse input on LLFF, Blender, RealEstate10K and MipNeRF360 datasets, while also demonstrating significantly faster training speeds and rendering efficiency compared to other baselines.

Abstract:
This paper presents a novel single-image object counting method based on block co-saliency density map estimation, called free-to-count everything network (F2CENet). Image block co-saliency attention is introduced to promote density estimation adaptation, allowing to input any image with arbitrary size for accurate counting using the learned model without requiring manually labeled few shots. The proposed network also outperforms existing crowd counting methods based on geometry-adaptive kernels in complex scenes. A novel module generates multilevel & scale block correlation maps to guide the co-saliency density map estimation. Co-saliency attention maps are then fused for accurately locating block-wise salient objects under guidance of the initial cues. Hence, accurate density maps are generated via comprehensive learning of internal relations in block co-salient features and progressive optimization of local details with saliency-oriented scene understanding. Results from extensive experiments on existing density map estimation datasets with arbitrary challenges verify the effectiveness of the proposed F2CENet and show that it outperforms various state-of-the-art few-shot and crowd counting methods. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are used as evaluation metrics to measure the accuracy which are commonly used metrics for counting task. The average predicted MAE and RMSE are 10.88% and 8.44% less compared with the state-of-the-art evaluated on dataset contains sufficiently large and diverse categories used for few-shot and crowd counting.

Abstract:
Through formulating the image restoration as a generation problem, the conditional diffusion model has been applied to low-light image enhancement (LIE) to restore the details in dark regions. However, in the previous diffusion model based LIE methods, the conditions used for guiding generation are degraded images, such as low-light image, signal-to-noise ratio map and color map, which suffer from severe degradation and are simply fed into diffusion model by rigidly concatenating with the noise. To avoid using degraded conditions resulting in sub-optimal performance in recovering details and enhancing brightness, we use the image intrinsic components originating from the Retinex model as guidance, whose multi-scale features are flexibly integrated into the diffusion model, and propose a novel conditional diffusion model for LIE. Specifically, the input low-light image is decomposed into reflectance and illumination by a Retinex decomposition module, where two components contain abundant physical property and lighting conditions of the scene. Then, we extract the latent features from two conditions through a component-dependent feature extraction module, which is designed according to the physical property of components. Finally, instead of previous rigid concatenation manner, a well-designed feature fusion mechanism is equipped to adaptively embed generative conditions into diffusion model. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods, and is capable of effectively restoring the local details while brightening the dark regions. Our codes are available at https://github.com/Knossosc/ICCDiff.

Abstract:
Lane lines play a crucial role in the traffic system. However, due to the diversity of lane categories, road conditions and weather environments, as well as the different aspect ratios of lane lines, lane detection algorithms face many challenges. This paper proposes a multi-dimensional feature refinement method for complex scene lane detection based on start point guidance. Due to the fact that lane lines often traverse the entire image, capturing sufficient context is crucial, and lane lines also have specific local patterns that require detailed low-level features for accurate localization. We propose global feature refinement (GFR) and lane aware gather (LAG) to refine the features from the following two dimensions: global enhancement and local refinement. To generate high-quality anchors, we predict the start point coordinate of lane instances through start point coordinate prediction (SPCP). To better fit 2D lane detection, we adopt the more general penalty LaneIoU (PLIoU) as the loss function to evaluate the predicted results. Experimental results demonstrate that the proposed method performs well in in lane detection tasks in complex scenes and has strong competitiveness among existing methods.

Abstract:
Replaying episodic memory hippocampus-based is a promising class incremental learning (CIL) method, and it must address the problem of catastrophic forgetting. However, most current studies have ignored background information provided by the parahippocampal gyrus. Therefore, in this paper, the topological data analysis (TDA) technology is proposed for the first time to simulate the biological function of the parahippocampal gyrus in order to obtain multi-scale background information such as geometric and topological. Based on the idea, a novel CIL method via semantic information mapping and background information calibrating (SiBiCIL) is proposed. It takes categorical prototypes as semantic information and uses the constructed mapping function to generate the prototype classifiers. The classification results are then calibrated using background information extracted from the point-clouded data. In addition to the cross-entropy loss and the knowledge distillation loss, a difference loss is defined to maintain the distinctiveness between the prototype classifiers. These losses collectively contribute to updating the model and retaining previous classes’ knowledge, as well as a way to mitigate catastrophic forgetting. Compared to other methods with the total number of exemplars no more than 500, the experimental results on CIFAR-10, CIFAR-100, ImageNet-100, and Protein_family datasets show that highest percentage increases of average classification accuracies are 49.72%, 7.15%, 1.96%, and 12.35% respectively, while the forgetting rate has concurrently decreased by 32.35%, 9.80%, 12.50%, and 5.41%. Moreover, SiBiCIL significantly reduces the episodic memory buffer budget by an average of 76.39%.

Abstract:
In recent years, the uploading of massive personal images has increased the security risks, mainly including privacy breaches and copyright infringement. Adversarial examples provide a novel solution for protecting image privacy, as they can evade the detection by deep neural network (DNN)-based recognizers. However, the perturbations in the adversarial examples typically meaningless and therefore cannot be extracted as traceable information to support copyright protection. In this paper, we designed a dual protection scheme for image privacy and copyright via traceable adversarial examples. Specifically, a traceable adversarial model is proposed, which can be used to embed the invisible copyright information into images for copyright protection while fooling DNN-based recognizers for privacy protection. Inspired by the training method of generative adversarial networks (GANs), a new dynamic adversarial training strategy is designed, which allows our model for achieving stable multi-objective learning. Experimental results show that our scheme is exceptionally robust in the face of a variety of noise conditions and image processing methods, while exhibiting good model migration and defense robustness.

Abstract:
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S2TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.

Abstract:
With the rapid development of multimedia processing and deep learning technologies, especially in the field of video understanding, video quality assessment (VQA) has achieved significant progress. Although researchers have moved from designing efficient video quality mapping models to various research directions, in-depth exploration of the effectiveness-efficiency trade-offs of spatio-temporal modeling in VQA models is still less sufficient. Considering the fact that videos have highly redundant information, this paper investigates this problem from the perspective of joint spatial and temporal sampling, aiming to seek the answer to how little information we should keep at least when feeding videos into the VQA models while with acceptable performance sacrifice. To this end, we drastically sample the video’s information from both spatial and temporal dimensions, and the heavily squeezed video is then fed into a stable VQA model. Comprehensive experiments regarding joint spatial and temporal sampling are conducted on six public video quality databases, and the results demonstrate the acceptable performance of the VQA model when throwing away most of the video information. Furthermore, with the proposed joint spatial and temporal sampling strategy, we make an initial attempt to design an online VQA model, which is instantiated by as simple as possible a spatial feature extractor, a temporal feature fusion module, and a global quality regression module. Through quantitative and qualitative experiments, we verify the feasibility of online VQA model by simplifying itself and reducing input.

Abstract:
Video object detection remains a challenging task due to appearance degradation in certain frames. Existing studies usually aggregate temporal information from multiple frames to enhance the object’s appearance representation. Although significant detection performance has been achieved, there are still two shortcomings: (1) The spatial context information within each frame is not fully exploited, which can provide additional decision support when objects are corrupted; (2) In the feature alignment phase, traditional methods tend to employ one-to-one or one-to-global temporal alignment strategies, overlooking the local temporal correlation of objects. To address the above issues, we propose a Joint Spatial and Temporal Feature Enhancement Network (JSTFE-Net) for video object detection, which can jointly utilize spatial-temporal information. First, we present a novel local-global context enhancement module to effectively encode intra-frame spatial context information. This module can enhance the learning of both local details and global semantic information of objects, thereby facilitating accurate object perception within the spatial domain. Second, we develop a deformable temporal sampling module, which adaptively samples correlated temporal information according to the motion information between frames. In addition, to improve the aggregation of temporal-correlated sampled features from multiple frames, we devise an attention-based temporal aggregation block, which dynamically fuses these feature points based on their temporal similarity with the corresponding object feature point. Note that our JSTFE-Net can be effortlessly plugged into image object detectors and state-of-the-art video object detectors. Extensive experiments on the ImageNet VID dataset show that the proposed JSTFE-Net can consistently and significantly improve performance, demonstrating its effectiveness in video object detection.

Abstract:
Understanding human intentions (e.g., emotions) from videos has received considerable attention recently. Video streams generally constitute a blend of temporal data stemming from distinct modalities, including natural language, facial expressions, and auditory clues. Despite the impressive advancements of previous works via attention-based paradigms, the inherent temporal asynchrony and modality heterogeneity challenges remain in multimodal sequence fusion, causing adverse performance bottlenecks. To tackle these issues, we propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations (MEA) to refine multimodal features and leverage the complementarity across distinct modalities. On the one hand, MEA introduces a predictive self-attention module to capture reliable context dynamics within modalities and reinforce unique features over the modality-exclusive spaces. On the other hand, a hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities over the modality-agnostic space. Meanwhile, a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner. Eventually, we propose a decoupled graph fusion mechanism to enhance knowledge exchange across heterogeneous modalities and learn robust multimodal representations for downstream tasks. Numerous experiments are implemented on three multimodal datasets with asynchronous sequences. Systematic analyses show the necessity of our approach.

Abstract:
Student engagement in online learning is an important indicator for measuring learning effectiveness. Due to the fact that facial video data of students during online learning contains a wider range of information such as time, current research has begun to focus on obtaining student engagement from video data. These studies primarily rely on supervised learning methods and have achieved certain success. However, the longstanding lack of large-scale and high-quality labeled data, as well as the time-consuming and laborious sample labeling work, have to some extent hindered their further improvement. To solve this problem, this paper proposes a self-supervised learning method, Facial Masked Autoencoder (FMAE), which is used to construct a student engagement recognition model. This method uses a masked autoencoder to process a large number of unlabeled facial videos, and performs self-supervised pre-training by learning masked facial features from the reconstruction process. In order to promote the encoder to better mask learning for the face, a new facial mask strategy and reconstruction module have been proposed. With this method, the model can not only focus on important facial regions, but also obtain more accurate appearance features and spatio-temporal details. Experiments have demonstrated that the proposed method achieves excellent results on DAiSEE and EmotiW datasets, showing its potential in the task of student engagement recognition.

Abstract:
Unsupervised domain adaptation (UDA) has become an appealing approach for knowledge transfer from a labeled source domain to an unlabeled target domain. However, when the classes in source and target domains are imbalanced, most existing UDA methods experience significant performance drop, as the decision boundary usually favors the majority classes. Some recent class-imbalanced domain adaptation (CDA) methods aim to tackle the challenge of biased label distribution by exploiting pseudo-labeled target samples during the training process. However, these methods suffer from the issues with unreliable pseudo labels and error accumulation during training. In this paper, we propose a pairwise adversarial training approach for class-imbalanced domain adaptation. Unlike conventional adversarial training in which the adversarial samples are obtained from the \ell _p ball of the original samples, we generate adversarial samples from the interpolated line of the aligned pairwise samples from source and target domains. The pairwise adversarial training (PAT) is a novel data-augmentation method which can be integrated into existing unsupervised domain adaptation (UDA) models to tackle the CDA problem. Inspired by the noise injection, we also extend the pairwise adversarial training to noisy pairwise adversarial training (nPAT), in which the random noise is injected into the generation of the adversarial samples. In our study, we evaluate our proposed methods as well as the baselines on three major benchmark datasets, namely Office-Home, DomainNet and Office-31. For Office-Home and Office-31, we sample the data according to the Reversely-unbalanced Source and Unbalanced Target (RS-UT) protocol so that the class distribution can be imbalanced. The extensive experimental results show that UDA models integrated with our proposed nPAT can achieve prominent improvements on most tasks compared to the baseline methods as well as the state-of-the-art CDA methods. The average accuracy of our nPAT can achieve 66.56% and 80.22% on Office-Home and DomainNet, respectively, which are higher than that of the second-best methods. Besides, Experiments also show that our method is robust to the unreliability of the pseudo labels.

Abstract:
The integration of multi-modality images significantly enhances the clarity of critical details for object detection. Valuable semantic data from object detection enriches the fusion process of these images. However, the potential reciprocal relationship that could enhance their mutual performance remains largely unexplored and underutilized, despite some semantic-driven fusion methodologies catering to specific application needs. To address these limitations, this study proposes a mutually reinforcing, dual-task-driven fusion architecture. Specifically, our design integrates a feature-adaptive interlinking module into both image fusion and object detection components, effectively managing the inherent feature discrepancies. The core idea is to channel distinct features from both tasks into a unified feature space after feature transformation. We then design a feature-adaptive selection module to generate features rich in target semantic information and compatible with the fusion network. Finally, effective combination and mutual enhancement of the two tasks are achieved through an alternating training process. A diverse range of swift evaluations is performed across various datasets to corroborate the potential efficiency of our framework, actualizing visible advancements in both fusion effectiveness and detection accuracy.

Abstract:
The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/.

Abstract:
Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: i) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. ii) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks. The source code are publicly available at https://github.com/sakura/MASA.

Abstract:
Image summary, an abridged version of the original visual content, can be used to represent the scene. Thus, tasks such as scene classification, identification, indexing, etc., can be performed efficiently using the unique summary. Saliency is the most commonly used technique for generating the relevant image summary. However, the definition of saliency is subjective in nature and depends upon the application. Existing saliency detection methods using RGB-D data mainly focus on color, texture, and depth features. Consequently, the generated summary contains either foreground objects or non-stationary objects. However, applications such as scene identification require stationary characteristics of the scene, unlike state-of-the-art methods. This paper proposes a novel volumetric saliency-guided framework for indoor scene classification. The results highlight the efficacy of the proposed method.

Abstract:
Underwater imaging systems have evolved into essential hardware equipment for developing and utilizing marine resources. However, the complex underwater physical environment has often led to severe quality degradation of underwater visual perception. To address these issues, we design a principal component fusion method of foreground and background to enhance an underwater image, named PCFB. Specifically, we present a color balance-guided color correction strategy to remove color distortion issues that equalize the pixel values of the a and b channels of the CIELab color model. Subsequently, we implement a percentile maximum-based contrast enhancement strategy and a multilayer transmission map estimated dehazing strategy on the color-corrected image to yield the contrast-enhanced foreground and dehazed background sub-images. Finally, we employ a principal component analysis fusion method to reconstruct a high-visibility underwater image by integrating the advantages of the foreground contrast-enhanced sub-image and the background dehazed sub-image. Comprehensive experiments on three datasets demonstrate that our PCFB surpasses state-of-the-art methods both qualitatively and quantitatively. Moreover, our PCFB exhibits outstanding generalization capabilities for addressing haze and low-light images. The code is publicly available at: https://www.researchgate.net/publication/381259520_2024-PCFB.

Abstract:
Zero-shot temporal action detection (ZS-TAD), aiming to recognize and detect new and unseen video actions, is an emerging and challenging task with limited solutions. Recent studies have adapted the vision-language pre-trained model CLIP for this task in a parameter-efficient fine-tuning fashion to achieve open-vocabulary detection. However, they suffer from insufficient vision-text alignment because of the dual-stream structure of CLIP and yield inferior TAD results due to the lack of accurate action prior. In this paper, we target the above limitations and propose to learn multimodal Prompts and Text-Enhanced Actionness (mProTEA) for ZS-TAD. Specifically, we insert learnable layer-wise prompts into the vision and text branches of the frozen CLIP and establish a strong coupling between them, resulting in multimodal prompts that can boost cross-modal alignment. To ease computation costs, we propose to conduct multimodal prompt learning on an image recognition dataset with rich concepts (e.g., ImageNet) first and then keep them frozen during TAD fine-tuning. For improving TAD, we introduce text-enhanced actionness modeling, where we leverage the concise semantics of text to assist the calculation of class-agnostic actionness scores, to offer accurate prior information for both action classification and localization. With the above designs, our mProTEA excels in extensive TAD experiments, surpassing the strong competitor STALE by 5.1% on ActivityNet under the zero-shot setting and achieving state-of-the-art performance in conventional supervised scenarios. Ablation studies confirm the effectiveness of our proposals and show superior domain generalization of multimodal prompts learned on ImageNet against the other 10 image recognition datasets.

Abstract:
Existing H.265/HEVC selective encryption (SE) schemes do not take into account the semantic features of input videos, nor do they adjust the encryption syntax elements according to the sensitivity of video content, which greatly limits their applicability. In this paper, we propose a chaos-based tunable H.265/HEVC SE scheme with semantic understanding. First, a deep hashing network is employed to identify content-sensitive videos by analyzing the semantic features of video sequences. Then, the non-sensitive videos and the retrieved sensitive ones are encrypted with different encryption strengths, respectively. Specifically, for non-sensitive videos, seven syntax elements with bypass-coded bins are selected for encryption at a constant bit rate. Hence, the encrypted bitstream keeps exactly the same compression ratio. To provide heavier visual distortion for content-sensitive videos, the regular-coded bins of four syntax elements and the intra prediction mode (IPM) are encrypted based on their corresponding encoding characteristics as well. Additionally, the selected syntax elements are all masked using a keystream generated by a chaotic system to ensure real-time constraints. Experimental results demonstrate that our suggested scheme offers format compatibility and is secure against all common attacks. Meanwhile, it outperforms state-of-the-art SE schemes in terms of security strength. Furthermore, the proposed scheme can be flexibly used in a wide range of applications according to the user’s requirements for encryption strength and bit rate.

Abstract:
In real-world scenarios, the haze presents diversity and complexity. However, current dehazing researches usually focus solely on specific categories or the removal of common white haze, frequently lacking the ability to adapt across various unknown haze types. In this study, our emphasis is on constructing a model that shows excellent adaptability across diverse haze conditions. Unlike approaches that solely rely on network structure design to enhance model adaptability, we comprehensively improve dehazing model adaptability from three key aspects: constructing the multitype haze dataset from designed haze degradation models, designing the network architecture, and formulating training strategies suitable for cross-scene generalization. Firstly, to meet the diverse haze training data requirements, we design a multitype haze degradation model to generate more realistic pairs of hazy images. Secondly, to ensure thorough haze removal and natural restoration of texture details in the recovered images, we construct a dual-branch ensemble network framework by leveraging pre-trained clear image prior features and the characteristics of 2D discrete wavelet priors. Finally, to further enhance the adaptability for removing various types of haze, we employ a sample reweighting decorrelation strategy during the network training phase to eliminate dependencies between haze and haze-free background features. Through extensive experiments, our approach shows remarkable performance across diverse haze scenarios. Our method not only outperforms state-of-the-art scene-specific dehazing methods in typical scenarios like daytime and nighttime, but it also excels in handling challenging scenarios such as dusty conditions, and color haze. See more results https://github.com/fyxnl/Image-dehazing-CGID.

Abstract:
The recently invented retina-inspired spike camera has shown great potential for capturing dynamic scenes. However, reconstructing high-quality images from the binary spike data remains a challenge due to the existence of noises in the camera. This paper proposes SpikeODE, a novel approach to reconstructing clear images by exploring temporal-spatial correlation to depress noises. The main idea of our method is to restore the continuous dynamic process of real scenes in a latent space and learn the temporal correlations in a fine-grained manner. Furthermore, to model the dynamic process more effectively, we design a conditional ODE where the latent state of each timestamp is conditioned on the observed spike data. Subsequently, forward and backward inferences are conducted through the ODE to investigate the correlations between the representation of the target timestamp and the information from both past and future contexts. Additionally, we incorporate a Unet structure with a pixel-wise attention mechanism at each level to learn spatial correlations. Experimental results demonstrate that our method outperforms state-of-the-art methods across several metrics.

Abstract:
Video data refers to digital information in the form of a series of frames or images representing continuous motion captured by a video recording device. In various domains such as security, sports, education, and entertainment, a significant amount of video data is generated and stored daily. However, analyzing these videos manually is challenging due to their intrinsic characteristics, including large-scale, redundancy, contextual dependencies, and multimodality. Consequently, researchers have extensively explored visualization techniques to address these complexities. In this investigation, we review the state-of-the-art techniques in video visualization and visual analysis. Initially, we provide an overview of the design space for video visualization and visual analysis techniques. Subsequently, we organize and classify these techniques based on visual analysis tasks and application scenarios, providing detailed descriptions within each category. Drawing upon a comprehensive review of existing research, we provide a critical evaluation and propose potential opportunities for future research. Additionally, we have developed a web-based survey browser for convenient exploration of our created classification framework and the associated scholarly articles (https://zjutvis.github.io/VOVideo/).

Abstract:
Point cloud semantic understanding with fewer point-wise annotations is an ongoing challenge that has yet to be fully addressed in the literature. Although previous approaches have achieved some success with weak supervision, our research reveals that even basic bounding box annotations and subcloud-level tags can provide valuable information for point cloud semantic segmentation. We propose a framework using Bounding boxes and Subcloud-level Tags for Semantic Segmentation, named BSTS. Our method explores local topological structures and geometric priors within and outside bounding boxes to produce reliable pseudo labels. Once bounding boxes of instances are provided for a point cloud, raw points can be divided into three categories: potential foreground points, ambiguous points, and clear background points. To ensure the reliability of the pseudo labels derived from weak supervision, we utilized an Attention-based Self-Training (AST) pipeline and the Point Class Activation Maps (PCAMs) technique. Subsequently, the segmentation network is trained using the generated pseudo labels. Experiments are conducted on two widely used large-scale benchmarks, including S3DIS and ScanNet. Our method achieves competitive semantic performance with the fully-supervised counterpart via low-cost bounding box annotations and subcloud-level tags.

Abstract:
People normally watch 360 ° videos through a head-mounted display, inside which only the content of viewports can be seen. Therefore, viewport proposal, referring to detecting potential viewport candidates, plays an important role in many 360 ° video processing tasks. In this paper, we advance the viewport proposal by further aligning the predicted viewports across frames for individual subject. This provides a better methodology and a deeper perspective to learn the human perceptual behaviours on 360 ° videos. Specifically, we first analyze three 360 ° video datasets and obtain several findings on human consistency, objectness and motion of viewports. Inspired by these findings, we propose a bi-directional transformer approach, named BiT, for 360 ° video viewport proposal and alignment. Specifically, BiT is composed of a multi-level residual module, a bi-directional encoder-decoder module and a spherical matching module. This way, the viewports can be well proposed and aligned via considering multi-level, bi-directional and non-local information. Moreover, the aligned viewports by BiT are used to refine the viewports and improve viewport proposal accuracy in return. Finally, we validate that our BiT approach is superior on viewport proposal, compared with the state-of-the-art approaches. Besides, the aligned viewports from BiT is verified to be effective in multiple applications, such as saliency prediction, trajectory prediction and perceptual video compression.

Abstract:
Person search aims to locate target pedestrians from scene images, involving detection and re-identification. The former seeks to separate the background and focus on the commonality between pedestrians, while the latter aims to identify the target and focus on the difference between pedestrians. To address the paradox of detection and re-identification in search tasks, we propose an efficient Tri-Hybrid person search model utilizing the feature hierarchy design. Our model introduces three feature hybrid models for various feature levels. Before the RoI-Align, we present “Spatial-Channel Hybrid” (SCH) and “Token-Channel Hybrid” (TCH). SCH perceives the boundary frame of pedestrians at multiple scales, thereby enhancing the information disparity between pedestrians and the background and refining the accuracy of the detection frame. TCH uses multi-layer perceptrons (MLP) and blends token and channel features, emphasizing detecting fine-grained semantic information for pedestrians. The interaction of multi-scale perception and fine-grained semantic information enhances the details of detected pedestrians, making them more suitable for similarity measurement in pedestrian matching. After the RoI-Align, we design the “CNN-Transformer Hybrid” to amalgamate global and local features to extract more comprehensive detailed features. Extensive experimental results on CUHK-SYSU and PRW demonstrate the effectiveness of the proposed method over the state-of-the-art performance. Specifically, our method achieves comparable performance on two benchmark datasets, CUHK-SYSU and PRW, with mAP scores of 94.62% and 57.84%, respectively.

Abstract:
Recent advances in point cloud completion make it possible to simultaneously recover complete shapes and fine details from partial point clouds captured by professional 3D devices, such as Lidar, or consumer cameras, such as iPhones. Despite significant progress, the potential utilization of self-projected views from partial inputs and the effective reduction of noise in generated point clouds remain under-explored. In this paper, we propose a novel point cloud completion method that leverages self-projected view augmentation and implicit field constraints. Specifically, we introduce a cross-view augmentation (CVA) module and a cross-modal fusion (CMF) module to enhance information interaction and integration at the image and modality levels, respectively. We also propose a bidirection-aware refinement block to improve detail and completeness by considering both complete-to-partial detail perception and partial-to-complete structure perception paths. Additionally, we address the issue of noise reduction from the perspective of implicit field constraints. We evaluate our method on several baseline datasets, including PCN, ShapeNet55/34 and KITTI (car). Extensive experiments demonstrate that our method outperforms state-of-the-art methods, achieving improvements of 0.11 CD- \ell _1 , 0.015 DCD and 0.009 F-score on the standard PCN test set. Furthermore, our approach effectively reduces noise in the generated point clouds, showcasing its promising potential for practical applications.

Abstract:
Manual scribbles have been introduced to RGB-D Salient Object Detection (SOD) as a credible indicator for salient regions and backgrounds, helping to strike a balance between detection accuracy and labeling efficiency. Previous works address this task by constructing loss functions on semantics, edges, and structures to distinguish salient pixels from the background. However, using local representations extracted by CNNs or Transformers and the incomplete scribble annotations are ineffective in capturing the global contexts of salient objects, and thus cause inaccurate predictions in cluttered regions. In this paper, we propose a local-global representation learning framework by incorporating multi-perception information to boost scribble-based RGB-D SOD. Our system is composed of three sub-modules: Local Representation Aggregation (LRA), Global Representation Initialization (GRI) and Dual Transformer Decoder (DTD). The LRA module first conducts integration of multi-scale, multi-modal local representations extracted from RGB images and depth maps. The GRI module then learns inter- and intra-image representations to capture the global contexts of salient regions from different aspects. Finally, the DTD module alternately updates local-global representations through a dual Transformer architecture. Experimental results on six benchmarks demonstrate that the proposed method performs favorably against state-of-the-art scribble-based RGB-D SOD approaches and is competitive with the fully-supervised approaches.

Abstract:
Anomaly inspection aims at identifying various defects in real time on modern industrial production lines. However, due to insufficient anomaly data, existing detectors cannot effectively accomplish the classification of defects, thereby failing to provide guidance for subsequent production. To address it, we propose TF2, a few-shot text-free training-free defect image generation method, which jointly models the image distribution of class-agnostic defects and backgrounds, achieving efficient semantic enhancement. Firstly, we propose the Response Alignment Strategy, which merges the reversed latent space of both defect-free and defective samples, generating new defect images not limited to textual descriptions yet with consistent content. Moreover, we introduce the Defect Moving Strategy and the Regional Average Loss to merge the reversed latent space of the moving areas and enhance the variability of detail features, increasing both the location and content diversity of defects. Extensive experiments demonstrate the superiority of our model over the state-of-the-art competitors. The metrics indicate that our generated anomaly data focuses on balancing both image quality and diversity, effectively improving the performance of downstream anomaly inspection tasks.

Abstract:
The Versatile Video Coding (VVC) standard introduces a quad-tree with a nested multi-type tree (QTMT) partition structure to improve the rate-distortion (RD) performance, but this leads to a substantial increase in encoding complexity. Previous studies have labeled partition modes of CUs using hard targets (i.e., one-hot labels) generated by VVC reference software (VTM), which is challenging for neural networks to predict accurately. Furthermore, in earlier works, the VVC restrictions are not incorporated into convolutional neural network (CNN), not fully exploiting the predicting capacity of CNN. In this paper, we propose a novel soft-target and restriction-aware neural network (STRANet) to address these issues. Firstly, inspired by the observation that a CU may split differently under various circumstances, we collect these RD costs and precisely estimate the probability of each partition mode to generate a soft target. Secondly, our neural network incorporates QP and restriction type through attention modules so as to output predictions that are standard-compliant with simple post-processing. Thirdly, Window Attention Module, a combination of CNN and attention mechanism, is adopted to further enhance performance on GPU. Through the application of these methods, STRANet reduces encoding time by 51.84% and 61.00% with 0.44% and 0.84% Bjøntegaard delta bit-rate (BD-BR) increase, superior to state-of-the-art methods. The code has been released at https://github.com/cppppp/STRANet.

Abstract:
Accurate repetitive action counting has crucial applications in the era of AI-assisted universal fitness. Existing methods are prone to large errors in spatially fine-grained action counting scenarios. In this study, we propose a joint-wise temporal self-similarity periodic selection network (JTSPS-Net) with a human skeleton as its input. Periodic knowledge is embedded in skeleton joint units and selected in a coarse-to-fine manner to focus on the temporal repetition that occurs in the local space. The proposed JTSPS-Net adopts a temporal multiscale fusion strategy to better handle videos with various lengths. To maintain the interpretability of the model, we design an impulse map regression module that uses one random frame per action unit as its labels. Furthermore, to fill the action counting gap in real physical fitness scenarios and to scale up the current repetition count dataset, we construct a high-quality dataset named FitnessRep, which consists of 2,110 fitness videos collected in realistic scenarios. Experiments demonstrate that the proposed JTSPS-Net outperforms the state-of-the-art approach on our dataset and two other public datasets, especially on fine-grained action samples. In addition, it has a good ability to generalize to repetitive actions belonging to unseen categories.

Abstract:
Graph Neural Networks (GNNs) have attracted increasing attentions for multimodal Emotion Recognition in Conversation (ERC) due to their good performance in contextual understanding. However, most existing GNN-based methods suffer from two challenges: 1) How to explore and propagate appropriate information in a conversational graph. Typical GNNs in ERC neglect to mine the emotion commonality and discrepancy in the local neighborhood, leading to learn similar embbedings for connected nodes. However, the embeddings of these connected nodes are supposed to be distinguishable as they belong to different speakers with different emotions. 2) Most existing works apply simple concatenation or co-occurrence prior for modality combination, failing to fully capture the emotional information of multiple modalities in relationship modeling. In this paper, we propose a multimodal Decoupled Distillation Graph Neural Network (D2GNN) to address the above challenges. Specifically, D2GNN decouples the input features into emotion-aware and emotion-agnostic ones on the emotion category-level, aiming to capture emotion commonality and implicit emotion information, respectively. Moreover, we design a new message passing mechanism to separately propagate emotion-aware and -agnostic knowledge between nodes according to speaker dependency in two GNN-based modules, exploring the correlations of utterances and alleviating the similarities of embeddings. Furthermore, a multimodal distillation unit is performed to obtain the distinguishable embeddings by aggregating unimodal decoupled features. Experimental results on two ERC benchmarks demonstrate the superiority of the proposed model. Code is available at https://github.com/gityider/D2GNN.

Abstract:
Monocular 3D object detection is a promising yet ill-posed task for autonomous vehicles due to the lack of accurate depth information. Cross-modality knowledge distillation could effectively transfer depth information from LiDAR to image-based network. However, modality gap between image and LiDAR seriously limits its accuracy. In this paper, we systematically investigate the negative transfer problem induced by modality gap in cross-modality distillation for the first time, including not only the architecture inconsistency issue but more importantly the feature overfitting issue. We propose a selective learning approach named MonoSTL to overcome these issues, which encourages positive transfer of depth information from LiDAR while alleviates the negative transfer on image-based network. On the one hand, we utilize similar architectures to ensure spatial alignment of features between image-based and LiDAR-based networks. On the other hand, we develop two novel distillation modules, namely Depth-Aware Selective Feature Distillation (DASFD) and Depth-Aware Selective Relation Distillation (DASRD), which selectively learn positive features and relationships of objects by integrating depth uncertainty into feature and relation distillations, respectively. Our approach can be seamlessly integrated into various CNN-based and DETR-based models, where we take three recent models on KITTI and a recent model on NuScenes for validation. Extensive experiments show that our approach considerably improves the accuracy of the base models and thereby achieves the best accuracy compared with all recently released SOTA models. The code is released on https://github.com/DingCodeLab/MonoSTL.

Abstract:
All-day self-supervised monocular depth estimation has strong practical significance for autonomous systems to continuously perceive the 3D information of the world. However, night-time scenes pose challenges of weak texture and violating the brightness consistency assumption due to low illumination and varying lighting, respectively, which easily leads to most existing self-supervised models only being able to handle day-time scenes. To address this problem, we propose a self-supervised monocular depth estimation unified framework that can handle all-day scenarios, which has three features: 1) an Illumination Compensation PoseNet (ICP) is designed, which is based on the classic Phong illumination theory and compensates for lighting changes in adjacent frames by estimating per-pixel transformations; 2) a Dual-Axis Transformer (DAT) block is proposed as the backbone network of the depth encoder, which infers the depth of local low-illumination areas through spatial-channel dual-dimensional global context information of night-time images; 3) a cross-layer Adaptive Fusion Module (AFM) is introduced between multiple DAT blocks, which learns attention weights between different layer features and adaptively fuses cross-layer features using the learned weights, enhancing the complementarity of different layer features. This work was evaluated on multiple datasets, including: RobotCar, Waymo and KITTI datasets, achieving state-of-the-art results in both day-time and night-time scenarios.

Abstract:
Most existing research on vehicle re-identification (Re-ID) focuses on supervised methods, while unsupervised methods that can take advantage of massive unlabeled data are underexplored. Due to the similarity of tasks, unsupervised person Re-ID methods that employ clustering to generate pseudo labels for model training can achieve good performance on unsupervised vehicle Re-ID task. However, vehicle exhibit higher intra-ID compactness and inter-ID separability within camera than person, which has not been exploited to reduce pseudo label noise for unsupervised vehicle Re-ID. To address this issue, we propose a camera-aware differentiated clustering with focal contrastive learning (CDF) method for unsupervised vehicle Re-ID task. Unlike the conventional global clustering approach that adopts a uniform processing strategy for pseudo-label generation, a camera-aware differentiated clustering (CDC) approach is designed to reduce label noise. In CDC, the entire clustering process is divided into two stages: inter-camera and intra-camera clustering, and each stage adopts different clustering strategies that are carefully designed according to the differences in feature distribution within and across cameras. By considering the distribution of pseudo labels generated by CDC, a measure for calculating the reliability of inter-camera and intra-camera pseudo labels is further designed, and a focal contrastive learning loss is proposed to improve the model’s ID discrimination ability within and across cameras. Extensive experiments on VeRi-776 and VERI-Wild demonstrate the effectiveness of each designed component and the superiority of the CDF.

Abstract:
Recent CNN-driven face super-resolution (FSR) technologies have achieved excellent breakthroughs by incorporating facial prior knowledge. However, most of them suffer from some obvious limitations. They always estimate facial priors from input low-resolution (LR) faces or coarsely enhanced LR faces, obtaining unfaithful priors that cannot be adequately exploited. This may bring noticeable artifacts to the target results, especially for large scaling factors, deteriorating the fidelity and naturalness and generating suboptimal reconstructed results. In this paper, we propose a two-stage prior-guided FSR approach to learn facial prior knowledge from the optimal SR results of stage one and explore the complementarity between priors to further guide more accurate reconstruction in stage two. Specifically, we develop an efficient local and global interactive hybrid network incorporating facial semantic and geometric priors for more discriminative results. To reach this, we devise a multiscale interconnected symmetric encoder-decoder architecture composed of Prior Interaction-Integration Modules (PIIMs), the Coarse-to-fine Feature Refinement Module (CFRM), and Feature Aggregation Modulation Modules (FAMMs). The encoder concentrates on hierarchically extracting multiscale features. The CFRM is devised to explore the potential correlations between the encoder and the decoder and further guide the refinement and reinforcement of the encoded features. The decoder aims to take full advantage of informative multiscale encoded features to reconstruct high-quality SR representations. Comprehensive evaluation and visualization results on four benchmark datasets demonstrate the superiority of the proposed PLGNet over current state-of-the-art methods. The source code of PLGNet will be available at https://github.com/lil808/PLGNet.git.

Abstract:
Composed image retrieval aims to search a target image by concurrently understanding the composed inputs with a reference image and the complementary modification text. It aims to find a shared latent space where the representation of the composed inputs is close to the desired target image. Most previous methods capture the one-to-one correspondence between the composed inputs and target image, which encodes the composed inputs and the target image into single points in the feature space. However, the one-to-one correspondence cannot effectively handle this task due to the inherent ambiguity problem arising from the various semantic meanings and data uncertainty. Specifically, the composed inputs and target image always exhibit various semantic meanings, affecting the retrieval results. Moreover, given the composed inputs (resp. target image), there are multiple target images (resp. composed inputs) that equally make sense. In this paper, we propose a novel method termed Set of Diverse Queries with Uncertainty Regularization (SDQUR) to solve such inherent ambiguity problem. First, we utilize diverse queries to adaptively aggregate the composed inputs and target image into multiple deterministic embeddings that capture different semantic meanings in the triplet affecting the retrieval process. It can exploit the deterministic many-to-many correspondence within each triple through these set-based queries. Moreover, we provide an uncertainty regularization module to encode the composed inputs and target image into gaussian distribution. Multiple potential positive candidates are sampled from the distribution for probabilistic many-to-many correspondence. Through the complementary deterministic and probabilistic many-to-many correspondence manner, we achieve consistent improvements on the standard FashionIQ, CIRR, and Shoes benchmarks, surpassing the state-of-the-art methods by a large margin.

Abstract:
Binary Neural Networks (BNNs) using 1-bit weights and activations are emerging as a promising approach for mobile devices and edge computing platforms. Concurrently, traditional Neural Architecture Search (NAS) has gained widespread usage in automatically designing network architectures. However, the computation involved in binary NAS is more complex than in NAS due to the substantial information loss incurred by binary modules, and different binary spaces are required for different tasks. To address these challenges, a universal binary neural architecture search (UBNAS) algorithm is proposed. In this paper, the ApproxSign function is used to reduce the gradient error and accelerate the convergence in binary network searching and training. Moreover, UBNAS adopts a novel search space consisting of operations appropriate for the binary methods. To improve the original space operation module, we explore the effect of diverse structures for various modules and ultimately obtain a universal binary network structure. Additionally, the channel sampling ratio is adjusted to balance the advantages of different operations and an early stopping strategy is implemented to significantly reduce the computational burden associated with searching. We perform extensive experiments on CIFAR10, and ImageNet datasets and the results demonstrate the effectiveness of the proposed method.

Abstract:
Novel view synthesis from existing inputs remains a research focus in computer vision. Predicting views becomes more challenging when only a limited number of views are available. This challenge is commonly referred to as the few-shot view synthesis problem. Recently, various strategies have emerged for few-shot view synthesis, such as transfer learning, depth supervision, and regularization constraints. However, transfer learning relies on massive scene data, depth supervision is affected by input depth quality, and regularization causes increased computational costs or impaired generalization. To address these issues, we propose a new few-shot view synthesis framework called FewarNet that introduces trend regularization to leverage depth structural features and a warping loss to supervise depth estimation, possessing the advantages of existing few-shot strategies, enabling high-quality novel view prediction with generalization and efficiency. Specifically, FewarNet consists of three stages: fusion, warping, and rectification. In the fusion stage, a fusion network is introduced to estimate depths using scene priors from coarse depths. In the warping stage, the predicted depths are used to guide the warping of the input views, and a distance-weighted warping loss is proposed to correctly guide depth estimation. To further improve prediction accuracy, we propose trend regularization which imposes penalties on depth variation trends to provide depth structural constraints. In the rectification stage, a rectification network is introduced to refine occluded regions in each warped view to generate novel views. Additionally, a rapid view synthesis strategy that leverages depth interpolation is designed to improve efficiency. We validate the method’s effectiveness and generalization on various datasets. Given the same sparse inputs, our method demonstrates superior performance in quality and efficiency over state-of-the-art few-shot view synthesis methods.

Abstract:
Deep learning-based watermarking frameworks have received extensive research attention in recent years. The main structure of this framework consists of an encoder, a noise layer and a decoder (Encoder-NoiseLayer-Decoder). However, such a framework has the major drawback that it requires visible markers to locate a watermarked image, which compromises the imperceptibility of watermarking. To address this restriction, a novel Lite localization network based on Lite-HRNet is proposed. In order to generate high-quality watermarked image, we designed the Double U-Net Encoder (DUE), which can better hide the watermarking information in image pixels that are invisible to the human eye. Meanwhile, to improve robustness, two bicubic interpolation operations are added to the noise layer to increase the type of distortion. In addition, to further enhance the performance of the watermarking algorithm, the novel WGAN-GP loss function based on discriminator is designed to guide the training of the model. Numerous experiments demonstrate the superior performance of our proposed scheme in terms of localization function, visual quality, and robustness. The proposed scheme shows better results compared to state-of-the-art algorithms.

Abstract:
LiDAR, as an excellent sensor, can provide positions, motion states, and other objective attribute information of objects in the 3D world. Inevitably, the inherent sparsity of point cloud and the problem of occlusion tend to cause incomplete semantic and geometry information of long-range small objects, posing challenges to 3D object detection. The multi-view models take advantage of the complementary information among bird’s eye view (BEV), range view (RV), and other views to alleviate the above issues. However, most of the existing methods coarsely learn the views’ features and neglect the learning of semantic information, which further leads to unsatisfactory detection performance. To this end, this paper proposes a Local-to-Global Semantic Learning Network (LGSLNet) for multi-view 3D object detection from point cloud. The proposed LGSLNet can effectively learn semantic information to explore the local semantics contained in various channels of RV features and to fuse them with BEV features. It has two branches with different backbones. In the BEV branch, the voxels quantized from the point cloud are extracted by sparse convolutional networks and compressed to BEV features. In the RV branch, a multi-scale backbone with semantic-aware convolution (SAC) is designed to learn the local semantic information of the RV. It allows for adaptation to the 3D location using the auxiliary network. In the fusion module, the bidirectional cross-view channel attention (Bi-CCA) is designed to compensate for the semantic information between multiple views and aggregate new RV and BEV features. Extensive experiments on the KITTI, ONCE, and nuScenes 3D object detection datasets demonstrate the superiority of our proposed method.

Abstract:
To obtain high-quality Positron emission tomography (PET) images while minimizing radiation hazards, various methods have been developed to acquire standard-dose PET (SPET) images from low-dose PET (LPET) images. Recent efforts mainly focus on improving the denoising quality by utilizing multi-modal inputs. However, these methods exhibit certain limitations. First, they neglect the varied significance of each modality in denoising. Second, they rely on inflexible voxel-based representations, failing to explicitly preserve intricate structures and contexts in images. To alleviate these problems, we propose a 3D Point-based Multi-modal Context Clusters GAN, namely PMC2-GAN, for obtaining high-quality SPET images from LPET and magnetic resonance imaging (MRI) images. Specifically, we transform the 3D image into unorganized points to flexibly and precisely express its complex structure. Moreover, a self-context clusters (Self-CC) block is devised to explore fine-grained contextual relationships of the image from the perspective of points. Additionally, considering the diverse importance of different modalities, we introduce a cross-context clusters (Cross-CC) block, which prioritizes PET as the primary modality while regarding MRI as the auxiliary one, to effectively integrate the knowledge from the two modalities. Overall, built on the smart integration of Self- and Cross-CC blocks, our PMC2-GAN follows GAN architecture. Extensive experiments validate our superiority.

Abstract:
Skeleton-based action recognition has broad prospects owing to the fact that skeleton data is more robust to scene noise and camera view changes. Recently, researchers mainly aim to explore deep-learning feature engineering with competitive recognition accuracy for skeleton actions. However, a high-performance recognition network is usually stacked by complex feature extraction modules introducing massive computational costs. In this work, we designed a powerful and universal action knowledge distillation paradigm based on decoupled knowledge distillation for transferring action knowledge from heavy teachers to lightweight students more robustly. We constructed a network architecture space consisting of the shrinking versions of outdated 2s-AGCN and searched for several robust students. On this basis, this paradigm is further developed into a powerful decoupled knowledge embedded graph convolutional network (DKE-GCN), which outperforms the teacher significantly on three public datasets and achieves the state-of-the-art. In addition, a light-DKE-GCN is designed to achieve comparable performance with teacher with 16× less parameters, 26× less FLOPs and 8× FPS.

Abstract:
Recent advancements have facilitated the simultaneous processing of multiple dense prediction tasks, utilizing diverse correlations between these tasks. However, many of these advances predominantly focus on a singular or fixed task interaction, leading to negative transfer effects. In this paper, we introduce an end-to-end model called the Adaptive Task-Wise Message Passing Network (ATMPNet) for multi-task learning. Our proposed model focuses on excavating comprehensive spatial messages among tasks in an adaptive manner. To achieve this, ATMPNet incorporates the Adaptive Spatial Message Interaction (ASMI) module, which models various local spatial message interactions and global interactions among tasks. ASMI explores potential spatial relationships by generating a task-specific message pool for each target task. Furthermore, we propose an Adaptive Task Message Passing (ATMP) module, a novel method for aggregating messages. The ATMP module generates refined global-local messages from each message pool and adaptively transfers them to the corresponding target tasks through a well-designed message passing scheme. We conduct extensive experiments on the NYUD-v2 and PASCAL-Context datasets to evaluate the effectiveness of ATMPNet. The results demonstrate the state-of-the-art performance of our proposed model in handling multi-task learning scenarios. Code will be publicly available in here.

Abstract:
In the Coded Aperture Snapshot Spectral Imaging (CASSI) systems, hyperspectral images (HSIs) reconstruction methods are employed to recover 3D signals from 2D compressive measurements. Among these methods, deep unfolding networks exhibit the benefits of interpretability and high efficiency, but they still have some notable shortcomings. Firstly, existing methods primarily exploit the spatial-spectral domain information of HSIs, neglecting exploration of the frequency domain, which is also beneficial to 3D HSIs. Secondly, current unfolding networks have limited utilization of information between different stages, failing to fully explore their relevance and thereby limiting the effectiveness of the overall framework. To address these issues, in this paper, we propose an integrated framework with dual-domain feature fusion and multi-level memory enhancement. Specifically, the former represents the first attempt to utilize frequency domain information in the feature space of HSIs overcoming the limitation of spatial-spectral domain features and thereby improving the data expression ability of the network by extracting dual-domain features. Simultaneously, our verification experiments also show that the proposed dual-domain feature representation can indeed extract complementary feature information in HSIs. Moreover, the latter aims to use the structural characteristics of the U-Net network to fully extract the correlation of information between different stages by designing a multi-level memory enhancement network. Extensive experimental results on various datasets validate the superiority of the proposed approach in both subjective and objective outcomes. Our proposed method achieves an average of 0.4dB improvement over the best counterpart method. And the code can be obtained from the link: https://github.com/yingyangke/DFFMM.

Abstract:
Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.

Abstract:
Due to the practical significance in smart video surveillance systems, Text-Based Person Search (TBPS) has been one of the research hotspots recently, which refers to searching for the interested pedestrian images given natural language sentences. To help researchers quickly grasp the developments of this important task, we comprehensively summarize the recent research advances of TBPS from two perspectives, i.e., Feature Extraction (FE) and Semantic Alignments (SA). Specifically, the FE mainly consists of pre-processing approaches and end-to-end frameworks, and the SA could be briefly divided into cross-modal attention mechanism, non-attention alignments, training objectives, and generative approaches. Afterwards, we elaborate four widely-used benchmarks and also the evaluation criterion for TBPS. And comparisons and analyses among the state-of-the-art (SOTA) solutions are provided based on these large-scale benchmarks. At last, we point out some future research directions that need to be further addressed, which will greatly facilitate the practical applications of TBPS.

Abstract:
Non-exemplar class-incremental learning refers to continual classifying of new and old classes without storing samples of old classes. Since only new class samples are available, catastrophic forgetting of old knowledge often occurs. In this paper, we propose an effective non-exemplar method called RAMF consisting of Random Auxiliary classes augmentation and Mixed Features. On the one hand, we design a novel random auxiliary classes augmentation method, where one augmentation is randomly selected from three augmentations and applied to inputs to generate augmented samples and extra class labels. By extending the data and label space, the model can learn more diverse and transferable representations, which can prevent the model from being biased towards learning task-specific features and facilitate the transfer among different tasks. In a word, when learning new tasks, the random auxiliary class augmentation will reduce the change of feature space and improve model generalization. On the other hand, we propose to replace the new features with mixed features for model optimization since only using new features will largely affect the previous representation embedded in the old feature space. Instead, by mixing new and old features, the cosine similarity is improved by reducing the angle between the current and old features, which allows for better stability over long-term incremental learning without increasing the computational complexity. We have conducted extensive experiments on three benchmarks CIFAR-100, TinyImageNet and ImageNet-Subset, where our method outperforms the state-of-the-art non-exemplar methods and is comparable to high-performance replay-based methods.

Abstract:
With the emergence of Vision Transformers, attention-based modules have demonstrated comparable or superior performance in comparison to CNNs on various vision tasks. However, limited research has been conducted to explore the potential of the self-attention module in learning the global and local geometric information for key-points based motion segmentation. This paper thus presents a new method, named GIET, that utilizes geometric information in the Transformer network for key-points based motion segmentation. Specifically, two novel local geometric information embedding modules are developed in GIET. Unlike the traditional convolution operators which model the local geometric information of key-points within a fixed-size spatial neighbourhood, we develop a Neighbor Embedding Module (NEM) by aggregating the feature maps of k-Nearest Neighbors (k-NN) for each point according to the semantics similarity between the input key-points. NEM not only augments the network’s ability of local feature extraction of the points’ neighborhoods, but also characterizes the semantic affinities between points in the same moving object. Furthermore, to investigate the geometric relationships between the points and each motion, a Centroid Embedding Module (CEM) is devised to aggregate the feature maps of cluster centroids that correspond to the moving objects. CEM can effectively capture the semantic similarity between points and the centroids corresponding to the moving objects. Subsequently, the multi-head self-attention mechanism is exploited to learn the global geometric information of all the key-points using the aggregated feature maps obtained from the two embedding modules. Compared to the convolution operators or self-attention mechanism, the proposed simple Transformer-like architecture can optimally utilize both the local and global geometric properties of the input sparse key-points. Finally, the motion segmentation task is formulated as a subspace clustering problem using the Transformer architecture. The experimental results on three motion segmentation datasets, including KT3DMoSeg, AdelaideRMF, and FBMS, demonstrate that GIET achieves state-of-the-art performance.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) has received increasing attentions recently, which leverages only limited supervision to train the segmentation networks. Meanwhile, Transformer utilizes multi-head attention mechanism to overcome the limited receptive fields of convolution neural networks (CNN). As a result, a large amount of WSSS work has been built upon Transformer architecture. However, most of them overlook over-smoothing issue (i.e. the patch tokens in deep layers tend to be similar, indicating that the learned information is nearly identical), which will affect the accuracy of Class Activation Map (CAM) in representing the pixel level semantic information. In this paper, an Attention-based Layer Fusion scheme with Token Masking is proposed to address the above issue. The core modules include a Weighted Layer Attention Aggregation (WLA) module and a Random-Masked Class Token Refinement (RMTR) module. The former integrates the attention information of different layers in an adaptive way, which refines CAM to obtain a refined object contour. The latter introduces random masks to aggregate higher-level semantic information. Extensive experiments have been conducted on public datasets, in which our network manifests superior performances than other state-of-the-arts methods.

Abstract:
Recently, facial attribute editing has drawn increasing attention and has achieved significant progress due to Generative Adversarial Network (GAN). Since paired images before and after editing are not available, existing methods typically perform the editing and reconstruction tasks simultaneously, and transfer facial details learned from the reconstruction to the editing via sharing the latent representation space and weights. In this way, they can not preserve those non-targeted regions well during editing. In addition, they usually introduce skip connections between the encoder and decoder to improve image quality at the cost of attribute editing ability. In this paper, we propose a novel method called InterGAN with high-frequency compensation to alleviate above problems. Specifically, we first propose the cross-task interaction (CTI) to fully explore the relationships between editing and reconstruction tasks. The CTI includes two translations: style translation adjusts the mean and variance of feature maps according to style features, and conditional translation utilizes attribute vector as condition to guide feature map transformation. They provide effective information interaction to preserve the irrelevant regions unchanged. Without using skip connections between the encoder and decoder, furthermore, we propose the high-frequency compensation module (HFCM) to improve image quality. The HFCM tries to collect potentially loss information from input images and each down-sampling layers of the encoder, and then re-inject them into subsequent layers to alleviate the information loss. Ablation analysis show the effectiveness of proposed CTI and HFCM. Extensive qualitative and quantitative experiments on CelebA-HQ demonstrate that the proposed method outperforms state-of-the-art methods both in attribute editing accuracy and image quality.

Abstract:
The aim of end-to-end sign language translation (SLT) is to interpret continuous sign language (SL) video sequences into coherent natural language sentences without any intermediary annotations, i.e., glosses. However, end-to-end SLT suffers several intractable issues: (i) the temporal correspondence constraint loss problem between SL videos and glosses, and (ii) the weakly supervised sequence labeling problem between long SL videos and sentences. To address these issues, we propose an adaptive video representation enhanced Transformer (AVRET), with three extra modules: adaptive masking (AM), local clip self-attention (LCSA) and adaptive fusion (AF). Specifically, we utilize the first AM module to generate a special mask that adaptively drops out temporally important SL video frame representations to enhance the SL video features. Then, we pass the masked video feature to the Transformer encoder consisting of LCSA and masked self-attention to learn clip-level and continuous video-level feature information. Finally, the output feature of encoder is fused with the temporal feature of AM module via the AF module and use the second AM module to generate more robust feature representations. Besides, we add weakly supervised loss terms to constrain these two AM modules. To promote the Chinese SLT research, we further construct CSL-FocusOn, a Chinese continuous SLT dataset, and share its collection method. It involves many common scenarios, and provides SL sentence annotations and multi-cue images of signers. Our experiments on the CSL-FocusOn, PHOENIX14T, and CSL-Daily datasets show that the proposed method achieves the competitive performance on the end-to-end SLT task without using glosses in training. The code is available at https://github.com/LzDddd/AVRET.

Abstract:
Weather prediction plays a crucial role in human development. Recently, deep learning has demonstrated promising prospects in weather forecasting by integrating convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, two main challenges still exist in multiple weather condition prediction. The first challenge considers multiple weather condition correlations in predictions. The second challenge is how to model long- and short-range spatial dependencies under multiple weather conditions. A novel operator named as tensor-based long- and short-range convolution (TLS-Conv) is proposed to address these challenges. Within this operator, the node & relation attention is utilized to identify the contributions of spatial grid points and weather conditions for prediction. Additionally, the adaptive tensor graph convolution (ATGCN) is tailored to dynamically capture long-range spatial dependencies within multiple weather conditions. Finally, the traditional convolution is integrated with the ATGCN to model both long- and short-range spatial dependencies and weather condition correlations. Building upon the TLS-Conv, the tensor-based long- and short-range convolution for multiple weather prediction (TLS-MWP) model is proposed to predict multiple weather conditions. Extensive experiments are conducted under real-world weather conditions to evaluate its performance. These results unequivocally demonstrate that TLS-MWP surpasses previous methods. The code is available on GitHub at: https://github.com/xuguangning1218/TLS_MWP.

Abstract:
The demand for security surveillance has grown exponentially, making video anomaly detection particularly crucial. Existing image-domain based anomaly detection algorithms face implementation challenges due to several drawbacks, including latency during long-distance transmission, the need for complete decoding, and the complexity of network inference structures. Moreover, current frame prediction methods using generative models suffer from low prediction quality and mode collapse. To tackle these challenges, we propose VADiffusion, a compressed domain information guided conditional diffusion framework. VADiffusion adopts a dual-branch structure that combines motion vector reconstruction and I-frame prediction, effectively addressing the limitations of the reconstruction method in identifying sudden anomalies and the struggles of the frame prediction method in detecting persistent anomalies. Furthermore, our proposed framework incorporates the diffusion model into the realm of video anomaly detection, thereby improving the stability and accuracy of the model. Specifically, we employ sparse sampling of the compressed video, utilizing I-frames to capture appearance information and motion vectors to represent motion-related details. Different from the existing independent two-branch mechanism, we adopt a reconstruction-assisted prediction strategy, leveraging I-frames and the reconstructed motion vectors from the reconstruction branch as conditions for the diffusion model utilized in frame prediction. Ultimately, we perform decision fusion of reconstruction and prediction branches to determine anomalies. Through extensive experiments, we demonstrate that our algorithm achieves an effective trade-off between detection accuracy and model complexity. The source code is publicly released at https://github.com/LHaoooo/VADiffusion.

Abstract:
Due to the high flexibility and remarkable performance, low-rank approximation has been widely studied for color image denoising. However, existing methods usually ignore the cross-channel difference or the spatial variation of noise, which limits their capacity in the task of real world color image denoising. To overcome these drawbacks, this paper proposes a double-weighted truncated nuclear norm minus truncated Frobenius norm minimization (DtNFM) model, and apply it to color image denoising through exploiting the nonlocal self-similarity prior. The proposed DtNFM model has two merits. First, it models and utilizes both the cross-channel difference and the spatial variation of noise. This provides sufficient flexibility for handling the complex distribution of noise in real world images. Second, the proposed DtNFM model provides a close approximation to the underlying clean matrix since it can treat different rank components flexibly. To solve the DtNFM model, an efficient algorithm is devised through exploiting the framework of alternating directions method of multipliers (ADMM). Meanwhile, the truncated nuclear norm minus truncated Frobenius norm regularized least squares subproblem is discussed in detail, and the results show that its global optimum can be directly obtained in closed form. Therefore, the DtNFM model can be efficiently solved by a single ADMM. Rigorous mathematical derivation proves that the solution sequences generated by our proposed algorithm converge to a single critical point. Extensive experiments on synthetic and real noise datasets demonstrate that the proposed method outperforms many state-of-the-art color image denoising methods. MATLAB code is available at https://github.com/wangzhi-swu/DtNFM.

Abstract:
Compressive Spectral Imaging (CSI) techniques have attracted considerable attention among researchers for their ability to simultaneously capture spatial and spectral information using low-cost, compact optical components. A prominent example of CSI techniques is the Dual-Camera Coded Aperture Snapshot Spectral Imaging (DC-CASSI), which involves reconstructing hyperspectral images from CASSI measurements and uncoded panchromatic or RGB images. Despite its significance, the reconstruction process in DC-CASSI is challenging. Conventional DC-CASSI techniques rely on different models to explore the similarity between uncoded images and hyperspectral images. Nevertheless, two main issues persist: i) the effective utilization of spatial information from RGB images to guide the reconstruction process, and ii) the enhancement of spectral consistency of recovered images when using panchromatic/RGB images, which inherently lack precise spectral information. To address these challenges, we propose a novel Prior images guided generative autoEncoder (PiE) model. The PiE model leverages RGB images as prior information to enhance spatial details and designs a generative model to improve spectral quality. Notably, the generative model is optimized in a self-supervised manner. Comprehensive experimental results demonstrate that the proposed PiE method outperforms existing techniques, achieving state-of-the-art performance.

Abstract:
Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.

Abstract:
Recent years have witnessed a variety of applications of gaming video coding, while how to improve the coding efficiency has been relatively under-explored. The state-of-the-art video coding standard, Versatile Video Coding (VVC), adopts the Reference Picture Resampling (RPR) which allows the variation of the frame resolutions in encoding/decoding. The great flexibility supported by RPR motivates us to develop a bidirectional quality impulse based guidance scheme, in an effort to fully exploit the potential of RPR in gaming video coding. The design philosophy involves reducing the data volume on the encoder side through selective downsampling, and enhancing reconstruction by harnessing quality conveyance from neighboring frames. More specifically, a new RPR structure is developed based on the underlying philosophy that the periodic quality impulse could promisingly boost the quality of the whole sequence. On top of the developed structure, we propose a bidirectional guidance model that faithfully enhances video quality by resorting to frames with quality impulse. Experimental results exhibit the proposed scheme can achieve significant bit-rate savings for gaming videos.

Abstract:
Occluded person Re-identification (ReID) aims to match occluded and holistic pedestrian images across different camera views. This task presents two primary challenges. First, it is crucial to accurately capture pedestrian foregrounds from seriously occluded person images. Second, a noticeable information asymmetry exists between the partial body in occluded images and the complete body in corresponding holistic images, which could cause the ReID model to underestimate their similarities. To address these challenges, we introduce a contrastive pedestrian attentive and correlation learning (CpaCol) model. Within CpaCol, we first design a Contrastive Pedestrian Attention (ContrastAttn) module to capture pedestrian foregrounds from occluded images. In this process, we notice that most existing attention-based methods only supervise the final predictions with identity loss yet neglect its causality with the generated attention maps, which could mislead the model to capture some salient yet pedestrian-irrelevant noises as discriminative clues. To rectify this, we integrate contrastive learning into our ContrastAttn module to guide it to learn the semantic divergence between pedestrian foregrounds and noises, thereby capturing pedestrian foregrounds more accurately. Besides, we propose a correlation learning module, where we tailor an effective dense feature correlation learning tool, 4D convolution, to enable it to adapt to pedestrian images and capture corresponding clues between comparing images. By focusing more on corresponding clues, our model could avoid overemphasizing the inherent information asymmetry between occluded and holistic images, thereby improving re-identification. Empowered by these modules, our CpaCol achieves state-of-the-art performance on three relevant ReID settings, i.e., occluded, partial, and holistic ReID. Our code is available in https://github.com/nwpugaoliying/CpaCol.

Abstract:
The joint task of video moment retrieval and video highlight detection is a challenging study, which requires building a model that not only captures contextual information between sequences in time but also has the ability to understand and judge significance. This paper solves these problems from three aspects. Firstly, we design a parameter-free cross-modal statistical correlation interaction method. A novel saliency enhancement function is defined to quantify the saliency differences between the important features associated with the query and other features to achieve parameter-free cross-modal fusion. Secondly, we propose a novel modality-aware heterogeneous graph reasoning mechanism (MHGR). MHGR can effectively capture the global context information between sequences, enhance the local association relationship between sequences, and deal with the complexity of multi-modal data better through the organic combination of two key modules: parameter-free cross-modal statistical correlation interaction, and heterogeneous graph reasoning mechanism. Thirdly, a lightweight solution for the joint task of video moment retrieval and highlight detection is designed based on the above two novel algorithm modules. Comprehensive experiments are conducted on publicly available benchmark data to validate the advantages of the new solution in comparison with a series of state-of-the-art peer methods. Quantitative results consistently demonstrate that the new solution is lightweight and has high inference performance so the remarkable improvement in accuracy achieved by the new solution with respect to peer methods. An extended ablation study is further conducted to show the usefulness of each module of the solution in acquiring its computational capabilities.

Abstract:
The metaverse, a 3D virtual world, requires efficient interactive avatar communication. To achieve this goal, we envision a new metaverse paradigm for virtual avatar faces and develop semantic face compression with compact 3D facial descriptors. The paradigm comprises a compression framework that transmits 3D face descriptors for semantic compression and applications based on the semantic descriptors. The fundamental principle is that the communication of virtual avatar faces primarily emphasizes the conveyance of semantic information. In light of this, the proposed scheme offers the advantages of being highly flexible, efficient, and semantically meaningful. The promise of the proposed paradigm is also demonstrated by performance comparisons with the state-of-the-art video coding standard, Versatile Video Coding. A significant improvement in terms of rate-accuracy performance has been achieved. The proposed scheme is expected to enable numerous applications especially for real-time communication in the metaverse, such as digital human communication based on machine analysis, and to form the cornerstone of interactions.

Abstract:
In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that focus on the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic changes or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic changes in a contrastive learning framework, mimicking self-awareness in human representation learning. The saccades are generated by alternating the fixations following the predicted scanpath. Second, we model the semantic consistency in eye fixation by minimizing the prediction error between the predicted and the true state of another time point. Finally, we incorporate prototypical contrastive learning to reorganize the learned representations to enhance the associations among perceptually similar ones. Compared to previous video SSL solutions, our method can capture finer-grained semantics from video instances and further associate similar ones together. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.

Abstract:
Anomaly detection in surveillance videos is challenging but important for ensuring public security. Different from pixel-based anomaly detection methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden and also avoids the negative impact of background noise. However, pose-based methods lack an alternative dynamic representation akin to the explicit motion features, such as optical flow, employed by pixel-based methods. In this paper, a novel Motion Embedder (ME), a label-efficient scheme without extra annotation efforts, is proposed to provide a pose motion representation for the structured posed data from a probability perspective. Furthermore, a novel task-specific Spatial-Temporal Transformer (STT) is deployed for self-supervised pose sequence reconstruction. These two modules are then integrated into a unified framework for pose regularity learning, which is referred to as Motion Prior Regularity Learner (MoPRL). MoPRL achieves competitive results on multiple challenging datasets while minimizing computational costs. Extensive experiments validate the versatility of the proposed modules and provide insights for future research.

Abstract:
The combination of infrared and visible videos aims to gather more comprehensive feature information from multiple sources and reach superior results on various practical tasks, such as detection and segmentation, over that of a single modality. However, most existing dual-modality object detection algorithms ignore the modal differences and fail to consider the correlation between feature extraction and fusion, which leads to incomplete extraction and inadequate fusion of dual-modality features. Hence, there raises an issue of how to preserve each unique modal feature and fully utilize the complementary infrared and visible information. Facing the above challenges, we propose a novel Differential Feature Awareness Network (DFANet) within antagonistic learning for infrared and visible object detection. The proposed model consists of an Antagonistic Feature Extraction with Divergence (AFED) module used to extract the differential infrared and visible features with unique information, and an Attention-based Differential Feature Fusion (ADFF) module used to fully fuse the extracted differential features. We conduct performance comparisons with existing state-of-the-art models on two benchmark datasets to represent the robustness and superiority of DFANet, and numerous ablation experiments to illustrate its effectiveness.

Abstract:
The evolution of natural life is guided by a perpetually adaptive set of rules, encompassing natural laws, human policies, and game mechanics. Automated game design, through the creation of simulated environments populated by AI agents, embodies these rules, aligning with the objectives of artificial life research that seeks to replicate the dynamics of biological life through computational models. This paper presents a comprehensive framework, the Rule Generation Networks (RGN), devised for automated rule design, evaluation, and evolution in line with controllable expectations. We refine and formalize three cardinal elements - rules, strategies, and evaluation - to elucidate the intricate relationships inherent in rule generation tasks. The RGN integrates generative neural networks for rule design and a suite of reinforcement learning models for rule evaluation. To exemplify rule evolution and adaptation across varying environments, we introduce a controllability metric to gauge game dynamics and evolve the rule designer accordingly. Furthermore, we develop two game environments, Maze Run and Trust Evolution, modelling human exploration and societal trade dynamics, to gamify and evaluate the generated rules.

Abstract:
Audio-visual deepfake detection is the process of identifying and detecting deepfakes that have been generated using both audio and visual content with AI algorithms. Most existing methods primarily focus on the overall authenticity while neglecting the position of forgeries in time. This can be particularly problematic, as even a small alteration in a clip can significantly impact its meaning. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. In this paper, we present a novel neural network-based model to tackle the temporal forgery detection (TFD) problem. It consists of new audio and visual encoders with cross-modal attention for embedding extraction, and an embedding-level fusion mechanism with self-attention for forgery localization. Besides, a multi-dimensional contrastive loss is proposed which helps the model not only to capture audio-visual inconsistency for deepfake detection but also to exploit temporal inconsistency by coherently constraining the extracted embeddings. Extensive experiments on the LAV-DF dataset show that the presented method outperforms several state-of-the-art temporal forgery localization methods by up to 23.4% on AP@0.5 and 13.8% on AR@100. In addition, we also show the effectiveness of the proposed model on deepfake detection.

Abstract:
Given a descriptive text query, text-based person search (TBPS) aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. To better align the two modalities, most existing works focus on introducing sophisticated network structures and auxiliary tasks, which are complex and hard to implement. In this paper, we propose a simple yet effective dual Transformer model for text-based person search. By exploiting a hardness-aware contrastive learning strategy, our model achieves state-of-the-art performance without any special design for local feature alignment or side information. Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training. The PDG module first introduces an automatic generation algorithm based on a text-to-image diffusion model, which generates new text-image pair samples in the proximity space of original ones. Then it combines approximate text generation and feature-level mixup during training to further strengthen the data diversity. The PDG module can largely guarantee the reasonability of the generated samples that are directly used for training without any human inspection for noise rejection. It improves the performance of our model significantly, providing a feasible solution to the data insufficiency problem faced by such fine-grained visual-linguistic tasks. Extensive experiments on two popular datasets of the TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%, 4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES.

Abstract:
Due to the high cost of pixel-level labels required for fully-supervised semantic segmentation, weakly-supervised segmentation has emerged as a more viable option recently. Existing weakly-supervised methods tried to generate pseudo-labels without pixel-level labels for semantic segmentation, but a common problem is that the generated pseudo-labels contain insufficient semantic information, resulting in poor accuracy. To address this challenge, a novel method is proposed, which generates class activation/attention maps (CAMs) containing sufficient semantic information as pseudo-labels for the semantic segmentation training without pixel-level labels. In this method, the attention-transfer module is designed to preserve salient regions on CAMs while avoiding the suppression of inconspicuous regions of the targets, which results in the generation of pseudo-labels with sufficient semantic information. A pixel relevance focused-unfocused module has also been developed for better integrating contextual information, with both attention mechanisms employed to extract focused relevant pixels and multi-scale atrous convolution employed to expand receptive field for establishing distant pixel connections. The proposed method has been experimentally demonstrated to achieve competitive performance in weakly-supervised segmentation, and even outperforms many saliency-joined methods.

Abstract:
Image Emotion Classification (IEC) is an essential research area, offering valuable insights into user emotional states for a wide range of applications, including opinion mining, recommendation systems, and mental health treatment. The challenges associated with IEC are mainly attributed to the complexity and ambiguity of human emotions, the lack of a universally accepted emotion model, and excessive dependence on prior knowledge. To address these challenges, we propose a novel Unified Generative framework for Image Emotion Classification (UGRIE), which is capable of simultaneously modeling various emotion models and capturing intricate semantic relationships between emotion labels. Our approach employs a flexible natural language template, converting the IEC task into a template-filling process that can be easily adapted to accommodate a diverse range of IEC tasks. To further enhance the performance, we devise a mapping mechanism to seamlessly integrate the multimodal pre-training model CLIP with the text generation pre-training model BART, thus leveraging the strengths of both models. A comprehensive set of experiments conducted on multiple public datasets demonstrates that our proposed method consistently outperforms existing approaches to a large margin in supervised settings, exhibits remarkable performance in low-resource scenarios, and unifies distinct emotion models within a single, versatile framework.

Abstract:
Few-shot object detection (FSOD), a formidable task centered around developing inclusive models with annotated constrained samples, has attracted increasing interest in recent years. This discipline addresses unbalanced data distributions, which are particularly relevant to authentic scenarios. Although recent FSOD efforts have achieved considerable success in terms of localization, recognition remains a formidable obstacle. This stems from the fact that typical FSOD models evolve from general object detection frameworks predicated on extensive training data, and they underutilize and mine data information in scenarios with restricted samples, resulting in subpar performance. To address this deficiency, we introduce a groundbreaking methodology that is specifically tailored to overcome the inadequate sample challenge in FSOD tasks. Our approach incorporates a neighborhood information adaption (NIA) module that is designed to dynamically utilize information near the target, assisting in robustly performing object identification within the target domain. In addition, we propose an innovative attention mechanism called all attention, which not only encapsulates the dependencies of each position within a single feature map but also leverages correlations with other feature maps. This methodology culminates in more refined feature representations, which are particularly advantageous in situations with limited data. Comprehensive experiments conducted on the PASCAL VOC and COCO datasets illustrate that our technique achieves a substantial improvement with regard to addressing the FSOD task.

Abstract:
Image retrieval systems help users to browse and search among extensive images in real time. With the rise of cloud computing, retrieval tasks are usually outsourced to cloud servers. However, the cloud scenario brings a daunting challenge of privacy protection as cloud servers cannot be fully trusted. To this end, image-encryption-based privacy-preserving image retrieval (PPIR) schemes have been developed, which first extract features from cipher-images, and then build retrieval models based on these features. Yet, most existing PPIR approaches extract shallow features and design trivial unsupervised retrieval models, resulting in insufficient expressiveness for the cipher-images. In this paper, we propose a novel paradigm named Encrypted Vision Transformer (EViT), which advances the discriminative representations capability of cipher-images. First, to capture comprehensive ruled information, we extract multi-level local length sequence and global Huffman-Code frequency features from the cipher-images which are encrypted by permutation encryption, sign encryption, and stream cipher during the JPEG compression process. Second, we design the modified self-supervised Vision Transformer with Huffman-embedding and propose two robust data augmentations on cipher-images to improve representation power of the retrieval model. Moreover, our proposal can be easily adapted to unsupervised or supervised settings. Extensive experiments reveal that EViT achieves both excellent encryption and retrieval performance, outperforming current schemes in terms of retrieval accuracy by large margins while protecting image privacy effectively. Code is publicly available at https://github.com/onlinehuazai/EViT.

Abstract:
Irregular hole face inpainting is a challenging task, since the appearance of faces varies greatly (e.g., different expressions and poses) and the human vision is more sensitive to subtle blemishes in the inpainted face images. Without external information, most existing methods struggle to generate new content containing semantic information for face components in the absence of sufficient contextual information. As it is known that text can be used to describe the content of an image in most cases, and is flexible and user-friendly. In this work, a concise and effective Multimodal Face Inpainting Network (MuFIN) is proposed, which simultaneously utilizes the information of the known regions and the descriptive text of the input image to address the problem of irregular hole face inpainting. To fully exploit the rest parts of the corrupted face images, a plug-and-play Multi-scale Multi-level Skip Fusion Module (MMSFM), which extracts multi-scale features and fuses shallow features into deep features at multiple levels, is illustrated. Moreover, to bridge the gap between textual and visual modalities and effectively fuse cross-modal features, a Multi-scale Text-Image Fusion Block (MTIFB), which incorporates text features into image features from both local and global scales, is developed. Extensive experiments conducted on two commonly used datasets CelebA and Multi-Modal-CelebA-HQ demonstrate that our method outperforms state-of-the-art methods both qualitatively and quantitatively, and can generate realistic and controllable results.

Abstract:
Although video deraining technology has achieved great success in recent years, extracting spatiotemporal feature representations across the domains of spatial and temporal in successive frames, then performing spatial and temporal modeling, and restoring high-quality deraining videos with rich details are still challenging tasks. In this paper, we use the hybrid Transformer for the first attempt in video rain removal tasks, and propose a novel video deraining network based on hybrid transformer (VDN-HT) to aggregate global and local representations to accomplish video deraining. In the feature extraction process, we propose to use a U-shaped structure based on serial Transformer blocks to extract shallow local features, deep global features and global dependencies, and then adaptively aggregate them to obtain rainy video features with rain streaks of different directions and densities. In order to better model spatiotemporal relationships, the VDN-HT uses the Transformer’s long-range and relational modeling abilities to obtain the features of spatial and the correlations of temporal between continuous video frames to achieve multi-frame alignment. For ensuring the global-local consistency of the reconstructed frames, we design a global-local reconstruction module composed of Transformer and convolutional neural network (CNN) in parallel to aggregate global and local information to better reconstruct each frame. In addition, the proposed gating-based refinement module and color loss effectively retain the details and color information after removing rain streaks. Extensive experiments on NTURain, RainSynLight25 and RainSynHeavy25 datasets have shown that the VDN-HT can handle many types of rainy videos and perform better than previous methods.

Abstract:
Visible-Infrared person Re-IDentification (VI-ReID) is a challenging cross-modality image retrieval task that aims to match pedestrians’ images across visible and infrared cameras. To solve the modality gap, existing mainstream methods adopt a learning paradigm converting the image retrieval task into an image classification task with cross-entropy loss and auxiliary metric learning losses. These losses follow the strategy of adjusting the distribution of extracted embeddings to reduce the intra-class distance and increase the inter-class distance. However, such objectives do not precisely correspond to the final test setting of the retrieval task, resulting in a new gap at the optimization level. By rethinking these keys of VI-ReID, we propose a simple and effective method, the Multi-level Cross-modality Joint Alignment (MCJA), bridging both the modality and objective-level gap. For the former, we design the Visible-Infrared Modality Coordinator in the image space and propose the Modality Distribution Adapter in the feature space, effectively reducing modality discrepancy of the feature extraction process. For the latter, we introduce a new Cross-Modality Retrieval loss. It is the first work to constrain from the perspective of the ranking list in the VI-ReID, aligning with the goal of the testing stage. Moreover, to strengthen the robustness and cross-modality retrieval ability, we further introduce a Multi-Spectral Enhanced Ranking strategy for the testing phase. Based on the global feature only, our method outperforms existing methods by a large margin, achieving the remarkable rank-1 of 89.51% and mAP of 87.58% on the most challenging single-shot setting and all-search mode of the SYSU-MM01 dataset.

Abstract:
Previous methods in salient object detection (SOD) mainly focused on favorable illumination circumstances while neglecting the performance in low-light condition, which significantly impedes the development of related down-stream tasks. In this work, considering that it is impractical to annotate the large-scale labels for this task, we present a framework (HDNet) to detect the salient objects in low-light images with the synthetic images. Our HDNet consists of a foreground highlight sub-network (HNet) and an appearance-aware detection sub-network (DNet), both of which can be learned jointly in an end-to-end manner. Specifically, to highlight the foreground objects, we design the HNet to estimate the parameters to adjust the dynamic range for each pixel adaptively, which can be trained via the weak supervision signals of the salient object labels. In addition, we design a simple detection network (DNet) with a contextual feature fusion module and a multi-scale feature refine module for detailed feature fusion and refinement. Furthermore, we contribute the first annotated dataset for salient object detection in low-light images (SOD-LL), including 6,000 labeled synthetic images (SOD-LLS) and 2,000 labeled real images (SOD-LLR). Experimental results on SOD-LL and other low-light videos in the wild demonstrate the effectiveness and generalization ability of our method. Our dataset and code are available at https://github.com/Ylinyuan/HDNet.

Abstract:
Extended reality (XR) is one of the most important applications of beyond 5G and 6G networks. Real-time XR video transmission presents challenges in terms of data rate and delay. In particular, the frame-by-frame transmission mode of XR video makes real-time XR video very sensitive to dynamic network environments. To improve the users’ quality of experience (QoE), we design a cross-layer transmission framework for real-time XR video. The proposed framework allows the simple information exchange between the base station (BS) and the XR server, which assists in adaptive bitrate and wireless resource scheduling. We utilize the cross-layer information to formulate the problem of maximizing user QoE by finding the optimal scheduling and bitrate adjustment strategies. To address the issue of mismatched time scales between two strategies, we decouple the original problem and solve them individually using a multi-agent-based approach. Specifically, we propose the multi-step Deep Q-network (MS-DQN) algorithm to obtain a frame-priority-based wireless resource scheduling strategy and then propose the Transformer-based Proximal Policy Optimization (TPPO) algorithm for video bitrate adaptation. The experimental results show that the TPPO+MS-DQN algorithm proposed in this study can improve the QoE by 3.6% to 37.8%. More specifically, the proposed MS-DQN algorithm enhances the transmission quality by 49.9%-80.2%.

Abstract:
Recent advanced trackers, composed of discriminative classification and dedicated bounding box estimation, have achieved remarkable advancements in performance of visual object tracking. However, existing methods cannot satisfy the demands of tracking tasks in complex scenes, such as occlusion, scale variations, and etc. To this end, we propose a novel online multi-scale classification and global feature modulation for robust visual tracking, which is developed over accurate tracking by overlap maximization, named ATOM+. First, coordinate attention (CA) is applied to enhance the target features in the channel dimension and spatial dimension, which can effectively optimize the feature representation ability of the backbone network. Second, an online multi-scale classification (OMC) module is designed. During the online tracking phase, more reliable matching responses are comprehensively generated by aggregating information from different scales related to the target. This new operation enables stable perception of the target by the tracker, particularly when severe changes in the appearance and posture of the target are encountered. Third, a global feature modulation (GFM) mechanism is constructed, which requires only a small amount of computational resources, to fuse the spatial contextual information of the template image into the search region. This integration refines the bounding box to obtain an accurate estimate of the target state. Finally, comprehensive experiments on conventional tracking benchmarks of OTB100, LaSOT, and VOT2018 show that our tracker can sufficiently address different challenging scenarios, and achieves state-of-the-art performance. For the average running speed, our tracker can achieve 37 FPS in real time.

Abstract:
One-shot object detection (OSOD) without fine-tuning has recently garnered considerable attention and research focus. It aims to directly detect novel-class objects in the target image by providing merely one support image patch without undergoing the fine-tuning stage. However, most existing methods adopt image pair matching regardless of the scale inconsistency and spatial semantic mismatch of image pairs, which limits their ability to acquire high-quality target-support related features. This paper addresses these limitations by incorporating cross-scale contexts and semantic-consistent cues that are robust against the challenges of scarce and ambiguous matching. Specifically, we first introduce a simple yet effective Aggregation-Transformer-based Pyramid (ATP) module to explore the long-range cross-scale spatial interactions by employing the customized size-aware aggregation approach and the vanilla transformer encoder, thus the coarse-to-fine local image patterns are optimally utilized. Furthermore, we formulate the 4D contrastive cross-correlation tensor for instance-level features matching and suggest a Geometric Consistent Correlation (GCC) module that utilizes the bidirectional spatial-aware convolutions to extract the long-range semantic correspondences for target-support pairs. Additionally, a Channel Contrastive Learning (CCL) branch is adopted to complement the inter-channel interactions between target-support pairs for the GCC module. Extensive experiments demonstrate that our approach significantly outperforms the previous state-of-the-art methods by 6.5% and 2.1% on PASCAL VOC and COCO datasets for unseen classes, respectively.

Abstract:
Action learning is a research area that aims to recognize the action category of each frame in the video. Context information is crucial for learning actions, but most existing methods face two challenges in exploiting this information: 1) They apply global attention to aggregate global features for action representation, resulting in inefficiency and redundancy. 2) They impose implicit action constraints to regularize the action distribution, leading to subjectivity, interpretability issues, and optimization difficulties. To address these challenges, we propose an end-to-end weakly-supervised Action Learning framework with Process Knowledge Decomposition (AL-PKD), which leverages the intrinsic characteristics of procedural task videos. To enhance the effectiveness and adaptability of context aggregation, we first design the TEAL-Net action recognition network. Specifically, the TEAL-Net accounts for the diverse neighbor distributions of action nodes across categories and collects local neighborhood features with different receptive fields through feature pyramids, improving the accuracy and efficiency of action representation. Moreover, to overcome the drawbacks of implicit constraint strategies, we next employ process mining techniques to extract three types of explicit action pair constraints: sequentiality, concurrency, and selectivity. These constraints guide the model’s predictions and improve the interpretability of the learning process. Finally, we use the Viterbi algorithm to dynamically infer the optimal action boundaries based on the frame-level predictions, which helps to eliminate local misclassifications. Experiments on three datasets of Breakfast, CrossTask, and PEVD demonstrate that our method achieves state-of-the-art performance.

Abstract:
Human-Object Interaction (HOI) detection is a fertile research ground that merits further investigation in computer vision, and plays an important role in image high-level semantic information understanding. To achieve superior object detection performance, existing HOI models predominantly concentrate on the corresponding bounding box information of humans and objects, respectively, and ignore their surrounding information, thus it results in imprecise inference of instance interaction, which is severe for indirectly-contact interaction images (Intersection-over-Union = 0). To address that, a novel Triple stream Enhanced encoder-decoder Dispersal Network (TED-Net), equipped with human, object, and instance interaction decoding streams, is proposed to decouple instances’ relationships. Meanwhile, we design a dispersal attention mechanism to capture indirectly-contact interaction information and an auxiliary discrimination mechanism to improve the ability of instance interaction decoding stream for action category recognition. Experimental results show that the proposed TED-Net achieves the best performance among HOI models using the ResNet-50 backbone on the (big) HICO-Det dataset and comes third on the (small) V-COCO dataset in leaderboard. Additionally, two indirectly-contact interaction datasets, namely, HICO-Det-IC and V-COCO-IC, are constructed to demonstrate the usefulness and effectiveness of our TED-Net in interacting between indirectly-contact instances, with an average of +3.80 mAP on HICO-Det-IC and +5.46 mAP on V-COCO-IC. Code is available at https://drliuqi.github.io/.

Abstract:
GCN-based methods have achieved remarkable performance in skeleton-based action recognition. However, existing methods have not explicitly attempted to remove temporal and spatial redundancy that might introduce additional computational costs. Inspired by the fact that humans always tend to glimpse at overall motion and then zoom into the most important spatio-temporal regions, we propose a Spatio Temporal Focused Dynamic Network (STFD-Net) trained with reinforcement learning for skeleton-based action recognition. Specifically, we first propose a global extractor with Skeleton Pooling Module (SPM) to enable the network to focus on overall motion information with a refined skeleton structure. Then, a local extractor, containing pair-wise part partition, tubelet proposal network, and Partition-Grouped Module (PGM), is proposed to extract local motion details as a complement to the overall motion information. Finally, the dynamic classifier utilizes a recurrent neural network to dynamically terminate the process once the network is adequately confident. Extensive experiments have demonstrated that the proposed network achieves SOTA level performance with lower computational cost on the NTU 60 and NTU 120 dataset.

Abstract:
Out-of-distribution (OOD) detection is essential when deploying neural networks in the real world. One main challenge is that neural networks often make overconfident predictions on OOD data. In this study, we propose an effective post-hoc OOD detection method, named HIMPLoS, based on a new feature masking strategy and a novel logit smoothing strategy. Feature masking determines the important features at the penultimate layer for each in-distribution (ID) class based on the weights of the ID class in the classifier head and masks the rest features. Logit smoothing computes the cosine similarity between the feature vector of the test sample and the prototype of the predicted ID class at the penultimate layer and uses the similarity as an adaptive temperature factor on the logit to alleviate the network’s overconfidence prediction for OOD data. With these strategies, we can reduce feature activation of OOD data and enlarge the gap in OOD score between ID and OOD data. Extensive experiments on multiple standard OOD detection benchmarks demonstrate the effectiveness of our method and its compatibility with existing methods, with new state-of-the-art performance achieved from our method.

Abstract:
In recent years, few-shot object detection (FSOD) in remote sensing images has attracted increasing attention. Numerous studies address the challenges posed by both intra-class and inter-class variance through strategies such as augmenting sample diversity and incorporating multi-scale features. However, these features still encompass a considerable amount of noise attributes due to the complex characteristic of satellite images, persistently and adversely affecting classification. In contrast, we advocate for the belief that a limited yet refined set of features surpasses a multitude of coarse features. Accordingly, we tackle above issues through the meticulous refinement of representative category features, enhancing performance by eliminating irrelevant attributes that interfere with classification. Specifically, two pivotal modules: retentive compensation module (RCM) and personality filtering module (PFM), are introduced. The former module RCM systematically scrutinizes features proximate to the category center, yielding prototypes that exhibit both intra-class compactness and inter-class distinctiveness. Furthermore, the latter module PFM utilizes previous obtained prototypes to supervise the filtering process, diminishing the intra-class variance by excluding personality features which could impede the classification task. The integration of the above two modules enables a holistic feature representation, capturing inherent similarities within individual classes while accentuating distinctions between classes. Experiments have been conducted on the DIOR and NWPU VHR-10.v2 datasets, and the results demonstrate that our proposed approach exceeds several state-of-the-art methods. Code is available at https://github.com/yomik-js/RP-FSOD.

Affiliations: College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China; School of Software Technology, Dalian University of Technology, Dalian, China; School of Mechanical Engineering, Dalian University of Technology, Dalian, China; School of Mathematical Sciences, Dalian University of Technology, Dalian, China; DUT-RU International School of Information Science and Engineering and the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian University of Technology, Dalian, China

Abstract:
Image fusion is indispensable in a comprehensive medical imaging pipeline. By embracing deep learning technology, medical image fusion has achieved tremendous progress over the past few years. However, existing approaches make efforts on the specific type of medical image fusion task and may face difficulties in generalizing well. Moreover, most of them strain every nerve to design various architectures with an increase of the width of depth, placing an obstacle in running efficiency. To address the above problems, we propose an Auto-searching Light-weighted Multi-source Fusion network, namely ALMFnet, aiming at incorporating both software and hardware knowledge in a network architecture searching manner for medical image fusion. Specifically, the ALMFnet, consisting of two different feature-extracting modules and one fusion module, is developed to extract and refine multi-source features in a generalized model. Besides, motivated by the collaborative principle, we introduce hardware constraints for sufficient searching the each particular component, further reducing the complexity of the obtained model. Furthermore, to preserve important details in pathological image areas, we introduce a segmentation mask into the developed method. Experimental results demonstrate that our generalized model outperforms previous methods not only in terms of quantitative scores but also in model complexity. Source code will be available at https://github.com/RollingPlain/ALMFnet.

Abstract:
Low-light image enhancement (LLIE) aims to improve the illuminance of images due to insufficient light exposure. Recently, various lightweight learning-based LLIE methods have been proposed to handle the challenges of unfavorable prevailing low contrast, low brightness, etc. In this paper, we have streamlined the architecture of the network to the utmost degree. By utilizing the effective structural re-parameterization technique, a single convolutional layer model (SCLM) is proposed that provides global low-light enhancement as the coarsely enhanced results. In addition, we introduce a local adaptation module that learns a set of shared parameters to accomplish local illumination correction to address the issue of varied exposure levels in different image regions. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art LLIE methods in both objective metrics and subjective visual effects. Additionally, our method has fewer parameters and lower inference complexity compared to other learning-based schemes. Code will be made publicly available at the URL https://gitee.com/zhanghahaxixi/SCLM

Abstract:
Self-supervised learning (SSL) has been successfully applied to remote sensing image classification by designing pretext tasks to extract valuable feature representations of targets. However, existing SSL methodologies overlook the edge information integral to ground objects, culminating in frequent misclassifications at target boundaries. Additionally, the scarcity of training samples often restricts the full utilization of the knowledge encapsulated in the pre-training model. To address these issues, we propose a novel self-supervised edge perception learning framework (SEPLF) to improve the classification performance of high-resolution remote sensing images (HRSI). The framework comprises self-supervised edge perception learning (SEPL) and training sample augmentation (TSA) algorithms. On the one hand, the SEPL approach leverages morphological data enhancement strategies to render the extracted invariant features more robust. It also effectively mines the potential information concealed at target edges, augmenting ground objects’s edge separability. On the other hand, the TSA algorithm not only obtains a large number of training samples but also enhances the intra-class diversity of the samples by considering different spectral features of the same category of ground objects. Experimental results validate that our proposed method outperforms state-of-the-art algorithms, particularly with limited labeled samples.

Abstract:
One of the fundamental challenges in image restoration is denoising, where the objective is to estimate the clean image from its noisy measurements. Existing denoising approaches generally focus on exploiting effective natural image priors to remove the noise. However, the utilization and analysis of the noise model are often ignored, although the noise model can provide complementary information to the denoising algorithms. As a result, they are very sensitive to different noise distributions. To tackle this issue and hence towards a robust image denoiser in practice, in this paper, we propose a novel Flow-based joint Image and NOise model (FINO) that distinctly decouples the image and noise in the latent space and losslessly reconstructs them via a series of invertible transformations. We further present a variable swapping strategy to align structural information in images and a noise correlation matrix to constrain the noise based on spatially minimized correlation information. Experimental results demonstrate FINO’s capacity to remove both synthetic additive white Gaussian noise (AWGN) and real noise. Furthermore, the generalization of FINO to the removal of spatially variant noise and noise with inaccurate estimation surpasses that of the popular and state-of-the-art methods by large margins.

Abstract:
Model-free rectification methods are limited by poor rectification quality and low generalization. This paper introduces a novel framework for enhancing model-free distortion rectification by addressing the limitations of existing methods. Our proposed method incorporates a Cascaded Distortion Model (CDM) inspired by fisheye lenses, which combines multiple reversible distortion models to create a versatile and comprehensive framework. By utilizing backward warping instead of forward warping, our approach overcomes the limitations of non-integer pixel positions and grid artifacts. Furthermore, our data synthesis method facilitates the fusion of different distortion models, bridging the distribution gap and improving generalization. To improve flow prediction accuracy, we introduce a two-stream network that incorporates both forward and backward flow branches. This approach enhances the prediction of backward flow and improves overall distortion rectification performance. We evaluate our method on large-scale synthetic datasets and real distorted images, and the results demonstrate its superior performance in both qualitative and quantitative experiments.

Abstract:
Banding, also known as staircase-like contours, frequently occurs in flat areas of images/videos processed by compression or quantization algorithms. As undesirable artifacts, banding destroys the original image structure, thus inevitably degrading users’ quality of experience (QoE). In this paper, we systematically investigate the banding image quality assessment (IQA) problem, aiming to detect the image banding artifacts and evaluate their perceptual visual quality. Considering that the existing image banding databases only contain limited content sources and banding generation methods, and lack perceptual quality labels (i.e. mean opinion scores), we first build the largest banding IQA database so far, named B anding A rtifact N oticeable D atabase (BAND-2k), which consists of 2,000 banding images generated by 15 compression and quantization schemes. A total of 23 workers participated in the subjective IQA experiment, yielding over 214,000 patch-level banding class labels and 44,371 reliable image-level quality rating scores. Subsequently, we develop an effective no-reference (NR) banding evaluator for banding detection and quality assessment by leveraging frequency characteristics of banding artifacts. To be more specific, a dual convolutional neural network (CNN) is employed to concurrently learn the feature representation from the high-frequency and low-frequency maps, thereby enhancing the ability to discern banding artifacts. The quality score of a banding image is generated by pooling the banding detection maps masked by the spatial frequency filters. The experimental results demonstrate that our banding evaluator achieves remarkably high accuracy in banding detection and also exhibits high SRCC and PLCC results with the perceptual quality labels, even without directly learning a regression model for banding quality evaluation. These findings unveil the strong correlations between the intensity of banding artifacts and the perceptual visual quality, thus validating the necessity of banding quality assessment. The BAND-2k database and the proposed banding evaluator are available at https://github.com/zijianchen98/ BAND-2k.

Abstract:
Recently, it has been shown that adversaries can reconstruct images from SIFT features through reverse attacks. However, the images reconstructed by existing reverse attack methods suffer from information loss and are unable to sufficiently reveal the private contents of the original images. In this paper, a two-stage deep reverse attack model called Coarse-to-Fine Generative Adversarial Network (CFGAN) is proposed to more deeply explore the information in SIFT features and further demonstrate the risk of privacy leakage associated with SIFT features. Specifically, the proposed model consists of two sub-networks, namely coarse net and fine net. The coarse net is developed to restore coarse images using SIFT features, while the fine net is responsible for refining the coarse images to obtain better reconstruction results. To effectively leverage the information contained in SIFT features, an efficient fusion strategy based on the AdaIN operation is designed in the fine net. Additionally, we introduce a new loss function called sift loss that enhances the color fidelity of reconstructed images. Extensive experiments conducted on various datasets verify that the proposed CFGAN performs favorably against state-of-the-art methods. The reconstructed images exhibit better visual quality, less texture distortion, and higher color fidelity. Source code is available at https://github.com/HITLiXincodes/CFGAN.

Abstract:
Cross-modal retrieval aims at retrieving highly semantic relevant information among multi-modalities. Existing cross-modal retrieval methods mainly explore the semantic consistency between image and text while rarely consider the rankings of positive instances in the retrieval results. Moreover, these methods seldom take into account the cross-interaction between image and text, which leads to the deficiency of learning their semantic relations. In this paper, we propose a Unified framework with Ranking Learning (URL) for cross-modal retrieval. The unified framework consists of three sub-networks, visual network, textual network, and interaction network. Visual network and textual network project the image feature and text feature into their corresponding hidden spaces respectively. Then, the interaction network forces the target image-text representation to align in the common space. For unifying both semantics and rankings, we propose a new optimization paradigm including pre-alignment for semantic knowledge transfer and ranking learning for final retrieval, which can decouple semantic alignment and ranking learning. The former focuses on the semantic pre-alignment optimized by semantic classification and the latter revolves around the retrieval rankings. For the ranking learning, we introduce a cross-AP loss which can directly optimize the retrieval metric average precision for cross-modal retrieval. We conduct experiments on four widely-used benchmarks, including Wikipedia dataset, Pascal Sentence dataset, NUS-WIDE-10k dataset, and PKU XMediaNet dataset respectively. Extensive experimental results show that the proposed method can obtain higher retrieval precision.

Abstract:
Remote photoplethysmography measurement (also called rPPG prediction) is a vision-based technique that allows for the non-contact monitoring of human physiological activity using facial video. However, precisely detecting subtle color changes on facial skin, especially in less-constrained real-life scenarios, remains a formidable challenge for rPPG prediction. In this work, we address a rPPG-based heart rate estimation task by proposing an end-to-end Channel-wise Interaction Network (CIN-rPPG), in which the core idea contains two specialized units: channel-temporal interactive learning (CIT) and channel-spatial interactive learning (CIS). The CIT unit gets the periodicity of the rPPG signal by using temporal-wise shifting and channel-wise scaling to measure the interaction between channels and temporal dimensions. The CIS unit does both spatial-wise scaling and channel-wise scaling at the same time to perform channel-spatial interaction. This is intended to reveal how rPPG-related visual responses are detected on the human face. We exploit the rPPG recovery through the alternation of CIT and CIS implementations. The CIN-rPPG is completely conducted by convolutional operations on the sequential 2D feature maps of facial video in an end-to-end manner. Extensive experiments on three heart rate estimation datasets (UBFC-rPPG, PURE, and MMSE-HR) demonstrate that CIN-rPPG achieves state-of-the-art performance on both intra-dataset and cross-dataset testing.

Abstract:
Currently, Convolutional Neural Network (CNN) has dominated guided depth map super-resolution (SR). However, the inefficient receptive field growing and input-independent convolution limit the generalization of CNN. Motivated by vision transformer, this paper proposes an efficient transformer-based backbone \textA^2 GSTran for guided depth map SR, which resolves the above intrinsic defect of CNN. In addition, state-of-the-art (SOTA) models only refine depth features with the guidance which is implicitly selected without supervision. So, there is no explicit guarantee to mitigate the artifacts of texture copying and edge blurring. Accordingly, the proposed \textA^2 GSTran simultaneously solves two sub-problems, i.e., guided monocular depth estimation and guided depth SR, in separate branches. Specifically, the explicit supervision upon monocular depth estimation lifts the efficiency of guidance selection. The feature fusion between branches is designed via bi-directional cross attention. Moreover, since guidance domain is defined in high resolution (HR), we propose asymmetric cross attention to maintain the guidance information via pixel unshuffle instead of pooling which has unequal channel number to depth features. Based on the supervisions to depth reconstruction and guidance selection, the final depth features are refined by fusing the output features of the corresponding branches via channel attention to generate the HR depth map. Sufficient experimental results on synthetic and real datasets for multiple scales validate our contributions compared with SOTA models. The code and models are public via https://github.com/alex-cate/Depth_Map_Super-resolution_via_Asymmetric_Attention_with_Guidance_Selection.

Abstract:
Vision is an important source of information for underwater observations, but underwater images commonly suffer severe visual degradation due to the complexity of the underwater imaging environment and wavelength-dependent absorption effects. There is an urgent need for underwater image enhancement techniques to improve the visual quality of underwater images. Due to the scarcity of high-quality paired training samples, underwater image enhancement based on deep learning has never achieved success similar to other vision tasks. Instead of learning complicated distortion-to-clear mappings with deep networks, we design a template-free color transfer learning framework for predicting transfer parameters, which are more easily captured and described. In addition, we add attention-driven modules to learn differentiated transfer parameters for more flexible and robust enhancement. We verify the effectiveness of our method on multiple publicly available datasets and show its efficiency in enhancing high-resolution images. The source code and the trained models are available on the project homepage: https://trentqq.github.io/TCTL-Net.html.

Abstract:
CNNs are widely used in remote sensing image classification because of its outstanding feature extraction ability. However, the classification performance is limited by the complexity of remote sensing scenes and the large inter-class similarity. Furthermore, the existing methods usually distinguish multiple classes of complex targets at the same time, which brings great difficulties to the classification model. To alleviate the above problems, we propose a coarse-to-fine cell division (CFCD) approach to improve HRSIs classification. The algorithm divides the limited labeled samples into two subclasses through continuous decomposition, which reduces the similarity between the ground object classes from the data level. We employ the \ell _12 -norm to depict the specific distribution of the target for only two subclasses rather than multiple classes of ground objects, so that the exclusive features of targets can be selected more accurately. Moreover, we propose an optimization process of multi-level training, which not only significantly reduces the difficulty of distinguishing multi-class targets, but also improves the utilization of training samples. Experimental results show that the CFCD algorithm outperforms the state-of-the-art methods with limited training samples on three publicly available HRSIs datasets.

Abstract:
Visible thermal person re-identification (VT-ReID) plays a vital role in intelligent surveillance systems, particularly in weak lighting environments. VT-ReID faces substantial challenges, including the cross-modality gap and intra-class variations. Existing methods address these challenges through either pixel-level image translation techniques or feature-level metric learning techniques. However, the former approaches require additional computational costs and often generate noisy images, making model training challenging. The latter methods focus on constraining the relations between individual instances or class centers, while often ignoring joint consideration of the relationship between the two aspects. In addition, these works do not fully investigate the mutual benefits at both pixel-level and feature-level. To address these limitations, we propose a unified Dual-level Smooth Gap (DSG) learning framework that simultaneously smooths the cross-modality gap at the pixel and feature levels. Specifically, on the one hand, we develop a parameter-free Class-aware Modality Mix (CMM) to smooth the cross-modality gap at the pixel level. CMM can capture and explore internal information between the two modalities by mixing images from different modalities belonging to the same class. On the other hand, we devise an efficient Center-guided Metric Learning (CML) to reduce the inter-modality discrepancy and intra-class variations at the feature level. CML enhances model discrimination and generalization by enforcing constraints on both class centers and instances. Experiments on two benchmark datasets demonstrate the mutual benefits of our proposed and show the superior performance of our method over state-of-the-art methods.

Abstract:
There is a growing need to explore the potential of transformers in Unsupervised Domain Adaptation (UDA) due to their increasing success in various vision tasks. However, the application of transformers in UDA has yet to be thoroughly investigated and requires further research. In this study, our primary focus is to design a novel pipeline specifically tailored for transformer-based UDA, to address a crucial challenge: the overemphasis on the transfer of target-oriented information, mainly caused by the self-attention blocks in transformers and the cross-domain adversarial learning scheme. First, we show that non-target information, including semantic contextual information such as background features and non-target classes, must be addressed in the domain adaptation process. Recognizing the importance of incorporating non-target knowledge, we propose a decoupled non-target knowledge distillation method called DeNKD. DeNKD decouples non-target information across domains at both feature and logit levels. This decoupling is achieved through a bi-directional knowledge distillation approach that facilitates the interaction and exchange of non-target knowledge to facilitate an effective transformer-based cross-domain knowledge transfer. We perform extensive evaluations on several well-established UDA benchmark datasets. The results consistently show that DeNKD outperforms other methods, achieving the best performance across the board. For example, on the Office-Home dataset, DeNKD achieves an accuracy of 85.54%, while on the VisDA-2017 dataset, it achieves an accuracy of 89.95%. These results highlight the effectiveness of DeNKD in transformer-based UDA and its potential for improving cross-domain adaptation performance.

Abstract:
Video anomaly detection is an important task in the field of intelligent security. However, existing methods mainly detect and analyze videos from a single time direction, ignoring the semantic information of the video context, which adversely affects the detection accuracy. To address this issue, we design a multi-branch generative adversarial network with context learning (MGAN-CL) to detect abnormal events. In particular, we combine video context information to generate predicted frames, and determine whether an anomaly occurs by comparing the predicted frame with the actual frame. Different from the existing GAN-based methods, in the anomaly event detection stage, we use the discriminator to judge the video frames generated by the generator, which improves the accuracy of anomaly detection. In order to improve the ability of the discriminator, a pseudo-anomaly module is added to the discriminator for data augmentation to improve the robustness of the model. An extensive set of experiments performed on public datasets demonstrate the method’s superior performance.

Affiliations: National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology and the School of Computer Science, Northwestern Polytechnical University, Xi’an, China; Key Laboratory of Big Data Storage and Management and the School of Computer Science, Northwestern Polytechnical University, Xi’an, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology and the School of Software Engineering, Northwestern Polytechnical University, Xi’an, China

Abstract:
To imitate the ability of keeping learning of human, continual learning which can learn from a never-ending data stream has attracted more interests recently. In all settings, the online class incremental learning (OCIL), where incoming samples from data stream can be used only once, is more challenging and can be encountered more frequently in real world. Actually, all continual learning models face a stability-plasticity dilemma, where the stability means the ability to preserve old knowledge while the plasticity denotes the ability to incorporate new knowledge. Although replay-based methods have shown exceptional promise, most of them concentrate on the strategy for updating and retrieving memory to keep stability at the expense of plasticity. To strike a preferable trade-off between stability and plasticity, we propose an Adaptive Focus Shifting algorithm (AFS), which dynamically adjusts focus to ambiguous samples and non-target logits in model learning. Through a deep analysis of the task-recency bias caused by class imbalance, we propose a revised focal loss to mainly keep stability. By utilizing a new weight function, the revised focal loss will pay more attention to current ambiguous samples, which are the potentially valuable samples to make model progress quickly. To promote plasticity, we introduce a virtual knowledge distillation. By designing a virtual teacher, it assigns more attention to non-target classes, which can surmount overconfidence and encourage model to focus on inter-class information. Extensive experiments on three popular datasets for OCIL have shown the effectiveness of AFS. The code will be available at https://github.com/czjghost/AFS.

Abstract:
Multi-focus image fusion (MFIF) creates an image from different source images with various sensors or optical settings as the devices can’t focus all objects at different distances. Most of the MFIF methods have several limitations in encoder enough features from the images and the result are not robust. To overcome the primary issue, we present a robust fusion algorithm based on the Frequency mask and the Hyperdimensional computing. We propose the Frequency Mask Filter (FMF) to get the narrow-band signals by encoding the frequency domain vector through the mask filter in the frequency domain. The Hyperdimensional encoder uses monogenic mapping, in which the multi-modulation features (MMF) such as the frequency, phase and amplitude are dynamically selected to obtain robust focus maps. Generated by multiscale monogenic representations of each image, the narrow-band image are mapped to hypervector encoding. Hyperdimensional encoder shows the energetic and structural information and leads to robust fusion results. Our proposed method is far superior to the existing MFIF method in terms of both objective evaluation metrics and visual effects on three publicly available datasets.Additionally, our proposed method requires only 0.88 seconds and has a parameter count of 0.13 million for multi-focus image fusion.

Abstract:
Recently, transform-based tensor nuclear norm (TNN) methods have received increasing attention as a powerful tool for multi-dimensional visual data (color images, videos, and multispectral images, etc.) recovery. Especially, the redundant transform-based TNN achieves satisfactory recovery results, where the redundant transform along spectral mode can remarkably enhance the low-rankness of tensors. However, it suffers from expensive computational cost induced by the redundant transform. In this paper, we propose a learnable spatial-spectral transform-based TNN model for multi-dimensional visual data recovery, which not only enjoys better low-rankness capability but also allows us to design fast algorithms accompanying it. More specifically, we first project the large-scale original tensor to the small-scale intrinsic tensor via the learnable semi-orthogonal transforms along the spatial modes. Here, the semi-orthogonal transforms, serving as the key building block, can boost the spatial low-rankness and lead to a small-scale problem, which paves the way for designing fast algorithms. Secondly, to further boost the low-rankness, we apply the learnable redundant transform along the spectral mode to the small-scale intrinsic tensor. To tackle the proposed model, we apply an efficient proximal alternating minimization-based algorithm, which enjoys a theoretical convergence guarantee. Extensive experimental results on real-world data (color images, videos, and multispectral images) demonstrate that the proposed method outperforms state-of-the-art competitors in terms of evaluation metrics and running time.

Abstract:
Unsupervised deep hashing has demonstrated significant advancements with the development of contrastive learning. However, most of previous methods have been hindered by insufficient similarity mining using global-only image representations. This has led to interference from background or non-interest objects during similarity reconstruction and contrastive learning. To address this limitation, we propose a novel unsupervised deep hashing framework named Fine-grained Similarity-preserving Contrastive learning Hashing (FSCH), which explores fine-grained semantic similarity among different images and their augmented views more comprehensively. It mainly comprises two modules: the global-local fine-grained similarity consistency preservation module and the local fine-grained similarity contrast preservation module. Specifically, we reconstruct local pairwise similarity structures by matching fine-grained patches, in conjunction with global similarity structures based on global hash codes cosine similarity, to generate hash codes with the ability to preserve global-local similarity consistency. Moreover, the preservation of local fine-grained similarity among augmented views is accomplished through the common regional features mutual representation between patches, then we enhance the discriminability of hash codes by mitigating the potential features difference during contrastive learning. Experimental results on four benchmark datasets demonstrate that our FSCH achieves an excellent retrieval performance compared to state-of-the-art unsupervised hashing methods.

Abstract:
The objective of the occluded person re-identification (ReID) task is to capture the same person from different camera angles when the pedestrian’s body is partially occluded. In this task, there are two main challenges: 1) pedestrians are often occluded by other persons or objects, and 2) pedestrians change poses. Moreover, these two issues often simultaneously occur. Although many occluded person ReID algorithms have been proposed, many existing methods can often only solve one of these issues well, and the other issue is often ignored. In this work, a novel semantic perception and CNN-transformer hybrid network (abbreviated as SPH) is proposed for occluded person ReID, which consists of a CNN-based human semantic perception stream and a transformer-based pose perception stream. In the former, a human semantic auxiliary module and a human semantic perception module are designed to obtain human semantic information where multi-granularity region features of the human body are extracted to solve the issues of occlusion. In the latter, we propose a token-based pose integration module to obtain the corresponding patch for each pose key-point and the relative position information to solve the change in pedestrian pose. Moreover, these two streams are jointly optimized in a unified framework. In addition, to further solve the issue of occlusion, the human completion strategy is proposed for the query sample where the gallery samples are used to complete the missing parts of the query. Extensive experimental results on three public occluded person ReID datasets, Occluded-DukeMTMC, P-DukeMTMC-reID, and Occluded-REID, demonstrate that the proposed method can outperform all SOTA occluded person ReID methods in terms of the mAP and Rank-1. Compared with PAT (CVPR21) on the Occluded-DukeMTMC and Occluded-REID datasets, the improvements in mAP/Rank-1 reached 10.1%/7.4%, and 10%/1%, respectively. Moreover, when TransReID (ICCV21) was used, SPH achieved improvements of 4.5% (mAP) and 5.5% (Rank-1) on the Occluded-DukeMTMC dataset.

Abstract:
Falls are a major health threat for older people. A timely assistance can reduce the extent of physical injury caused by the falls. Currently, low-cost and convenient video surveillance systems based on ordinary RGB cameras are widely used for improving the safety of people. The fall detection is a research hotspot in intelligent video surveillance. In this work, we propose an unsupervised fall detection method. The proposed method first converts the RGB video frames into human pose images to eliminate the background interferences and focus on human motion and protect privacy. Afterwards, the future pose images are predicted by using the continuous historical human pose images based on a constrained generative adversarial network (GAN). Finally, the prediction errors of the human pose images and the anomaly scores of actual poses calculated by using the traditional hand-crafted features are used to realize the fall detection. As compared to the existing vision-based fall detection methods, the proposed method possesses strong generalization ability, and is robust to environmental interferences and small local occlusions, and effectively protects the privacy, and avoids time-consuming data annotations. In addition, in this work, a new large-scale and comprehensive fall dataset is created and is available for download. We perform extensive experiments on the public benchmark datasets and the proposed dataset. The results demonstrate the validity and superiority of the proposed method.

Abstract:
3D human pose and shape estimation from a single RGB image is an appealing yet challenging task. Due to the graph-like nature of human parametric models, a growing number of graph neural network-based approaches have been proposed and achieved promising results. However, existing methods build graphs for different instances based on the same template SMPL mesh, neglecting the geometric perception of individual properties. In this work, we propose an end-to-end method named Personalized Graph Generation (PGG) to construct the geometry-aware graph from an intermediate predicted human mesh. Specifically, a convolutional module initially regresses a coarse SMPL mesh tailored for each sample. Guided by the 3D structure of this personalized mesh, PGG extracts the local features from the 2D feature map. Then, these geometry-aware features are integrated with the specific coarse SMPL parameters as vertex features. Furthermore, a body-oriented adjacency matrix is adaptively generated according to the coarse mesh. It considers individual full-body relations between vertices, enhancing the perception of body geometry. Finally, a graph attentional module is utilized to predict the residuals to get the final results. Quantitative experiments across four benchmarks and qualitative comparisons on more datasets show that the proposed method outperforms state-of-the-art approaches for 3D human pose and shape estimation.

Abstract:
Quantifications of image quality and aesthetic have been regarded as two independent fields in computer vision. Generally, image quality assessment aims at measuring image distortions and image aesthetic is judged by commonly established photography rules. However, either measuring image quality or aesthetic alone is not sufficient to qualitatively rank images. Therefore, this paper puts forward the synergetic assessment of quality and aesthetic to help understand the subjective human preferences of digital pictures more comprehensively. Specifically, considering that the images of existing benchmark datasets are only labeled with single attribute, we first establish a new dataset which contains 9042 real-world images with the corresponding human rated pair-wise quality-aesthetic scores. Previously, these images are only labeled with aesthetic score, and we evaluate the subjective quality score of them, so that it can make up the lack of image dataset with double attributes. Moreover, since the existing methods are mostly designed for individual attribute prediction. We then propose a two-stream learning network to assess both quality and aesthetic of images in parallel. This network follows the top-down perception mechanism which learns from both fined grained details and holistic image layout simultaneously. Furthermore, we introduce a Channel-Diversity loss, which can be deployed in grouped convolution operation, and can constrain channels to be mutually exclusive across the spatial dimensions. To some extent, this contributes to spotlight different local discriminative regions with a finer granularity. Finally, experiments demonstrate that our method outperforms the state-of-the-art methods on our established benchmark dataset and other benchmark datasets in terms of image quality and aesthetic assessment. We hope this paper could serve as a potent reference and be useful for future research on the study of image ranking. Both the benchmark dataset and the code will be publicly available to facilitate further research.

Abstract:
Self-supervised monocular depth estimation methods typically rely on the reprojection error to capture geometric relationships between successive frames in static environments. However, this assumption does not hold in dynamic objects in scenarios, leading to errors during the view synthesis stage, such as feature mismatch and occlusion, which can significantly reduce the accuracy of the generated depth maps. To address this problem, we propose a novel dynamic cost volume that exploits residual optical flow to describe moving objects, improving incorrectly occluded regions in static cost volumes used in previous work. Nevertheless, the dynamic cost volume inevitably generates extra occlusions and noise, thus we alleviate this by designing a fusion module that makes static and dynamic cost volumes compensate for each other. In other words, occlusion from the static volume is refined by the dynamic volume, and incorrect information from the dynamic volume is eliminated by the static volume. Furthermore, we propose a pyramid distillation loss to reduce photometric error inaccuracy at low resolutions and an adaptive photometric error loss to alleviate the flow direction of the large gradient in the occlusion regions. We conducted extensive experiments on the KITTI and Cityscapes datasets, and the results demonstrate that our model outperforms previously published baselines for self-supervised monocular depth estimation.

Abstract:
Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.

Abstract:
In recent years, Transformers have been gradually applied in salient object detection tasks with good results. However, the Transformer’s global modeling capabilities can lead to the loss of local details that are important in salient object detection tasks. A feature extraction backbone based on a convolutional neural network (CNN) is good at extracting local detail features due to the gradual expansion of the receptive field but is limited by the size of the receptive field, resulting in an insufficient ability to extract global semantic features. Therefore, this paper combines the Transformer with a CNN and presents a dual-branch encoder to ensure that the features extracted contain rich global semantic information as well as local detail features. In addition, due to the different features extracted by the Transformer and CNN, noise may be introduced in the fusion of the two features, so different features need to be processed correspondingly during fusion. The fusion enhancement module (FEM) we propose fuses the features of the two branches step by step. A hybrid attention mechanism is used to carry out weighted fusion of different features. This progressive approach minimizes the differences between the features of the two branches so that the merged features retain the semantic and detail features extracted by the two branches to the greatest extent. Considering the loss of detailed information caused by repeated downsampling, we propose an edge refinement module (ERM) to address the need for accurate outline prediction. This module leverages salient features to obtain edge features and gradually refines the prediction results by incorporating these edge features. It makes full use of the connection between salient features and edge features and does not introduce additional edges to extract branches. Extensive experimental evaluations conducted on five benchmark tests demonstrate the superior performance of our method compared to other existing approaches. Code can be found at https://github.com/gfq1605694825/DSRNet-main.

Abstract:
Existing makeup transfer methods typically transfer simple makeup colors in a well-conditioned face image and fail to handle makeup style details (e.g., complicated colors and shapes) and facial occlusion. To address these problems, this paper proposes Hybrid Transformers with Attention-guided Spatial Embeddings (named HT-ASE) for makeup transfer and removal. Specifically, a makeup context extractor adopts makeup context global-local interactions to aggregate the high-level context and low-level detail features of the makeup styles, which obtains the context-aware makeup features that encode the complicated colors and shapes of the makeup styles. A face identity extractor adopts a face identity local interaction to aggregate the identity-relevant features of shallow layers into identity semantic features, which refines the identity features. A spatially similarity-aware fusion network introduces a spatially-adaptive layer-instance normalization with attention-guided spatial embeddings to perform semantic alignment and fusion between the makeup and identity features, yielding precise and robust transfer results even with large spatial misalignment and facial occlusion. Extensive experimental results demonstrate that the proposed method outperforms the state-of-the-art methods, especially in the preservation of makeup style details and handling facial occlusion.

Abstract:
The Composed Query-Based Image Retrieval (CQBIR) task aims to precisely obtain the preserved and modified parts, based on the multi-grained semantics learned from the composed query. Since the composed query includes a reference image and the modification text, not just a single modality, this task is more challenging than the general image retrieval tasks. Most previous methods attempt to learn preserved and modified parts via different attention modules and fuse them as a unified representation. However, these methods have two intrinsic drawbacks: 1) The different granular semantic information of the composed query is neglected, which results in the fact that learned preserved and modified parts are irrelevant to correct semantics. 2) The preserved and modified parts learned by previous methods have obvious overlaps, which may lead the model to obtain sub-optimal preserved and modified regions. To this end, we propose a novel method termed Multi-Grained Attention Network with Mutual Exclusion (MANME) to address the above problems. Our MANME method mainly consists of two components: 1) A multi-grained semantic construction for obtaining various textual and visual semantic information. 2) An attention with mutual exclusion constraint for reducing the degree of overlap between preserved and modified parts. It adequately utilizes the various granular semantic information and effectively refines the learned preserved and modified parts. Extensive experiments and further analyses on three widely used CQBIR datasets demonstrate that our proposed MANME method achieves new state-of-the-art performance on the CQBIR task.

Abstract:
Video semantic role grounding has gained substantial interest from both the academic and industrial communities. While existing methods have demonstrated considerable performance improvements, the influence of noisy and intra-object proposals, referring to proposals with the same object label, has yet to be explored in video semantic role grounding. In this study, we propose a semantic-aware contrastive learning network with proposal suppression to enhance the accuracy of grounding referenced objects. To fully exploit the semantic information in each semantic role, we introduce a novel semantic role encoding module that allows for precise representations of each semantic role. We also design a semantic-aware proposal suppression network to reduce the impact of noisy proposals on object representation learning. Additionally, we propose a proposal contrastive loss to improve cross-modal alignment and reduce the effect of irrelevant intra-object proposals. Extensive experiments on four datasets demonstrate that our model achieves significant improvements over state-of-the-art methods.

Abstract:
Image-text retrieval is a fundamental task in bridging the semantics between vision and language. The key challenge lies in accurately and efficiently learning the semantic alignment between two heterogeneous modalities. Existing image-text retrieval approaches can be roughly classified into two paradigms. The first independent-embedding paradigm is to learn the global embeddings of two modalities, which can achieve efficient retrieval while failing to effectively capture the cross-modal fine-grained interaction information between images and texts. The second interactive-embedding paradigm is to learn fine-grained alignment between regions and words, which can achieve accurate retrieval while sacrificing retrieval efficiency. In this paper, we propose a novel Independent Memory-Enhanced emBedding learning framework (IMEB), which introduces a lightweight middleware, i.e., memory network, into the independent-embedding approaches to simultaneously exploit the complementary of both paradigms. Specifically, first, in the training stage, we propose a novel cross-modal association graph to learn cross-modal fine-grained interaction information. Then, we delicately design a memory-assisted embedding learning network to store these prototypical features after interaction as agents, and effectively update the memory network via two learning strategies. Finally, in the inference stage, we directly interact with these agent-level prototypical features from the memory bank, thus efficiently obtaining cross-modal memory-enhanced embeddings. In this way, our model not only effectively learns cross-modal interaction information, but also maintains the retrieval efficiency. Extensive experimental results on two benchmarks, i.e., Flickr30K and MS-COCO, demonstrate that our IMEB performs favorably against state-of-the-art methods.

Abstract:
Limb movement recognition of preterm infants (PI-LMR) in neonatal intensive care units (NICUs) is important for infant health monitoring. However, little attention has been paid to intelligent PI-LMR. Due to the weak correlation among limb movements of preterm infants, the various limb movement combinations and imbalanced data distributions are the main challenges of PI-LMR. To address these issues, a novel multi-label limb movement recognition (MLLMR) algorithm with a dual-branch structure and multi-label fusion loss is proposed. The various movement combinations can be decomposed into limbs thanks to multi-label learning. Particularly, the multi-label fusion loss consisting of the binary cross entropy (BCE) and the pairwise ranking loss (PRL) is proposed to optimize the probabilities to the ground truth labels and the ranking between positive and negative labels, simultaneously. The weighted fusion loss is further developed to address the imbalanced label distributions. Subsequently, an auxiliary task for the classification of zero-, single- and multi-limb movements is constructed to constrain the feature space of primary task for better multi-label learning. Experiments on real clinical preterm infants video dataset from Jiaxing Maternity and Child Health Care Hospital are conducted and the results demonstrate the effectiveness of the proposed algorithm.

Abstract:
Video coding is a video compression technique that compresses the original video sequence to produce a smaller archive file or reduce the transmission bandwidth under constraints on the visual quality loss. Rate control (RC) plays a critical role in video coding. It can achieve stable stream output in practical applications, especially real-time video applications such as video conferencing or game live streaming. Most RC algorithms either directly or indirectly characterise the relationship between the bit rate (R) and quantisation (Q) and then allocate bits to every coding unit so as to guarantee the global bit rate and video quality level. This paper comprehensively reviews the classic RC technologies used in international video standards of past generations, analyses the mathematical models and implementation mechanisms of various schemes, and compares the performance of recent state-of-the-art RC algorithms. Finally, we discuss future directions and new application areas for RC methods. We hope that this review can help support the development, implementation, and application of RC for new video coding standards.

Abstract:
Image dehazing is an emblematical low-level vision task that aims at restoring haze-free images from haze images. Recently, some methods adopts deep learning techniques to rebuild haze-free images. However, in real-world scenarios, complex degradation of captured images and non-uniform spatial distributions of haze will significantly weaken the generalization ability of these models. Accordingly, we propose a novel Spatial Dual-Branch Attention Dehazing network (SDBAD-Net) based on the Meta-Former paradigm for end-to-end dehazing. Specifically, we firstly design a robust Spatial Dual-Branch Attention (SDBA) module to filter the haze distribution features from different densities, which is suitable for both uniform and non-uniform situations. Secondly, we introduce a Structural Features Supplementary (SFS) module to dynamically fuse the contextual structural features in a nonlinear manner, so as to correct the image distortion caused by the lack of structural details. Finally, the quantitative and qualitative experiments are carried out on two challenging datasets, and the results show that our method outperforms most of state-of-the-art algorithms with fewer parameters and faster speed, especially surpassing FFA-Net with only 50% parameters and 7% computational costs. In addition, we ulteriorly explore its performance on object detection in foggy weather with our model on the challenging Real-world Task-driven Testing Set (RTTS), and the surprising results further prove the robustness and wide-applicability of our method.

Abstract:
Due to enormous computing and storage overhead for well-trained Deep Neural Network (DNN) models, protecting the intellectual property of model owners is a pressing need. As the commercialization of deep models is becoming increasingly popular, the pre-trained models delivered to users may suffer from being illegally copied, redistributed, or abused. In this paper, we propose DeepDIST, the first end-to-end secure DNNs distribution framework in a black-box scenario. Specifically, our framework adopts a dual-level fingerprint (FP) mechanism to provide reliable ownership verification, and proposes two equivalent transformations that can resist collusion attacks, plus a newly designed similarity loss term to improve the security of the transformations. Unlike the existing passive defense schemes that detect colluding participants, we introduce an active defense strategy, namely damaging the performance of the model after the malicious collusion. The extensive experimental results show that DeepDIST can maintain the accuracy of the host DNN after embedding fingerprint conducted for true traitor tracing, and is robust against several popular model modifications. Furthermore, the anti-collusion effect is evaluated on two typical classification tasks (10-class and 100-class), and the proposed DeepDIST can drop the prediction accuracy of the collusion model to 10% and 1% (random guess), respectively.

Abstract:
Pairwise modification is one of the most effective ways to solve the critical issues of balancing the embedding capacity, image distortion, and file expansion in JPEG reversible data hiding (RDH). To design a satisfactory scheme based on pairwise modification, existing schemes focus on improving pairing rules, two-dimensional (2D) mappings, or ordering strategies separately while neglecting the connections between them. As a result, once pairing rules are changed, the correlative 2D mapping and ordering strategies are no longer available. To address such issues, this study proposes a framework that automatically generates optimal 2D mappings and efficient ordering strategies only by carefully initializing pairing rules. To construct optimal 2D mappings, a 2D mapping mathematical model is built to form a feasible 2D mapping solution space, in which optimal solutions are found, to assign efficient mappings for pairs with high probability. To design efficient ordering strategies, all frequencies are ranked according to an embedding evaluation model to determine more suitable ACs for data embedding. To initialize better pairing rules, this study selects nonzero ACs within ±2 for interblock pairing to form a more centralized pairwise histogram. The experimental results show that the proposed scheme introduces minor file expansion and obtains better visual quality than existing JPEG RDH schemes when the payload is the same as.

Abstract:
While Transformer has achieved remarkable performance in various high-level vision tasks, it is still challenging to exploit the full potential of Transformer in image restoration. The crux lies in the limited depth of applying Transformer in the typical encoder-decoder framework for image restoration, resulting from heavy self-attention computation load and inefficient communications across different depth (scales) of layers. In this paper, we present a deep and effective Transformer-based network for image restoration, termed as U2-Former, which is able to employ self-attention of Transformer as the core operation for feature learning to perform image restoration in a deep encoding and decoding space. Specifically, it leverages the nested U-shaped structure to facilitate the interactions across different layers with different scales of feature maps. Furthermore, we optimize the computational efficiency for the basic Transformer block by introducing a simple yet effective feature-filtering mechanism to compress the token representation. Apart from the typical supervision ways for image restoration, our U2-Former also performs multi-view contrastive learning, which constructs positive pairs in various aspects, to learn noise-sensitive but content-irrelevant features and further decouple the noise component from the background image. Extensive experiments on various image restoration tasks, including reflection removal, rain streak removal and dehazing respectively, demonstrate the effectiveness of the proposed U2-Former.

Abstract:
3D hand reconstruction is an important technique for human-computer interaction. Interactive experience depends on the accuracy, efficiency, and robustness of the algorithm. Therefore, in this paper, we first propose a balanced framework called spatial-aware regression (SAR) to achieve precise and fast reconstruction. SAR can bridge convolutional networks and graph-structure networks more effectively than existing frameworks to fully exploit extracted spatial information using a novel spatial-aware initial graph building module. In addition, SAR uses adaptive-GCN to make keypoints interact efficiently and effectively; and regresses 2.5D belief maps to characterize uncertainty. SAR is highly flexible because it can predict an arbitrary number of keypoints and apply pose-guided refinement for coarse to fine regression. To produce more rational results for challenging cases and mitigate 3D label reliance, we also propose a more robust model-based framework called spatial-guided model-based regression (SMR) that is based on SAR. There are two critical designs of SMR: 1) it uses SAR to enhance the features with pose information to help the regression of hand model parameters; and 2) it regresses parameters in a spatially aware manner that is similar to SAR. Experiments demonstrate that the proposed frameworks surpass existing fully-supervised approaches on the FreiHAND, HO-3D, RHD, and STB datasets. Also, the performances of the proposed frameworks under weakly/self-supervised settings outperform other competitors. Meanwhile, the proposed frameworks are accurate and efficient.

Abstract:
Self-supervised monocular depth estimation has been a challenging task in computer vision for a long time, and it relies on only monocular or stereo video for its supervision. To address the challenge, we propose a novel multi-frame monocular depth estimation method called IterDepth, which is based on an iterative residual refinement network. IterDepth extracts depth features from consecutive frames and computes a 3D cost volume measuring the difference between current and previous features transformed by PoseCNN (pose estimation convolutional neural network). We reformulate depth prediction as a residual learning problem, revamping the dominating depth regression to enable high-accuracy multi-frame monocular depth estimation. Specifically, we design a gated recurrent depth fusion unit that seamlessly blends depth features from the cost volume, image features, and the depth prediction. The unit updates the hidden states and refines the depth map through iterative refinement, achieving more accurate predictions than existing methods. Our experiments on the KITTI dataset demonstrate that IterDepth is 7× faster in terms of FPS (frames per second) than the recent state-of-the-art DepthFormer model with competitive performance. We also test IterDepth on the Cityscapes dataset to showcase its generalization capability in other real-world environments. Moreover, IterDepth can balance accuracy and computational efficiency by adjusting the number of refinement iterations and performs competitively with other CNN-based monocular depth estimation approaches. Source code is available at https://github.com/PCwenyue/IterDepth-TCSVT.

Abstract:
Knowledge Distillation transfers knowledge learned by a teacher network to a student network. A common mode of knowledge transfer is directly using the teacher network’s experience for all samples without differentiating whether the experience of teacher is successful or not. According to common sense, experience varies with its nature. Successful experience is used for guidance, and failed experience is used for correction. Inspired by that, this paper analyzes the failure of teacher and proposes a reflective learning paradigm, which additionally uses heuristic knowledge extracted from the teacher’s failure besides following the authority of teacher. Specifically, this paper defines Mutual Error Distance (MED) based on the teacher’s wrong predictions. MED measures the adequacy of the decision boundary learned by teacher, which concretizes the failure of teacher. Then, this paper proposes DCGD (divide-and-conquer grouping distillation) to critically transfer the teacher’s knowledge by grouping the target task into small-scale subtasks and designing multi-branch networks on the basis of MED. Finally, a switchable training mechanism is designed to integrate a regular student which provides an option of student network without parameter addition compared with the multi-branch student network. Extensive experiments on three image classification benchmarks (CIFAR-10, CIFAR-100 and TinyImageNet) show the effectiveness of the proposed paradigm. Especially on CIFAR-100 dataset, the average error of students using DCGD+DKD decreased by 4.28%. In addition, the experiment results show that the paradigm is also applicable to self-distillation.

Abstract:
Object detection with the capacity to incrementally adapt to new domains is a crucial yet relatively under-explored research topic. The catastrophic forgetting problem presents a significant challenge to achieve this goal, where the model’s performance improves quickly in new conditions but deteriorates sharply in old ones after several incremental learning sessions. Drawing on recent discoveries in visual memories of the human brain, we introduce the Topology-Preserving Domain Incremental Object Detection (TP-DIOD) approach, which aims to address the catastrophic forgetting problem by extracting the topological structure of the feature space learned by the Convolutional Neural Network (CNN) model and preserving this topology during the subsequent incremental learning sessions. Specifically, we model the feature space topology using the self-organizing map (SOM) and construct an anchor image set based on the centroid vectors of the SOM nodes to memorize the feature space topology. We then develop the anchor loss function to penalize the topological changes of the feature space during the subsequent incremental learning sessions. Experimental evaluations on two sets of datasets demonstrate the effectiveness of the proposed TP-DIOD method in mitigating the catastrophic forgetting problem and achieving high accuracy on both old and new domain datasets.

Abstract:
Object detection has developed rapidly with the help of deep learning technologies recent years. However, object detection on drone view remains challenging due to two main reasons: (1) It is difficult to detect small-scale objects lacking detailed information. (2) The diversity of camera angles of drones brings dramatic differences in object scale. Although feature pyramid network (FPN) alleviates the problem caused by scale difference to some extent, it also retains some worthless features, which wastes resources and slows down the speed. In this work, we propose a novel High-Resolution Feature Pyramid Network (HR-FPN) to improve the detection accuracy of small-scale objects and avoid feature redundancy. The key components of HR-FPN include a high-resolution feature alignment module (HRFA), a high-resolution feature fusion module (HRFF) and a multi-scale decoupled head (MSDH). HRFA feeds multi-scale features from backbone into parallel resampling channels to obtain high-resolution features at the same scale. HRFF establishes a bottom-up path to distribute context-rich low-level semantic information to all layers that are then aggregated into classification feature and localization feature. MSDH cope with the scale difference of objects by predicting the categories and locations corresponding to different scales of objects separately. Moreover, we train model by scale-weighted loss to focus more on small-scale objects. Extensive experiments and comprehensive evaluations demonstrate the effectiveness and advancement of our approach.

Abstract:
For RGB-based temporal action segmentation (TAS), excellent methods that capture frame-level features have achieved remarkable performance. However, for motion-centered TAS, it is still challenging for existing methods that ignore the extraction of spatial features of joints. In addition, inaccurate action boundaries caused by the frames of similar motion destroy the integrity of the action segments. To alleviate the issues, an end-to-end Involving Distinguished Temporal Graph Convolutional Networks called IDT-GCN is proposed. First, we construct an enhanced spatial graph structure that adaptively captures the similar and differential dependencies between joints in a single topology through learning two independent correlation modeling functions. Then, the proposed Involving Distinguished Graph Convolutional (ID-GC) models the spatial correlations of different actions in a video by using multiple enhanced topologies on the corresponding channels. Furthermore, we design a generic modeling temporal action regression network, termed Temporal Segment Regression (TSR), to extract segmented encoding features and action boundary representations by modeling action sequences. Combining them with label smoothing modules, we develop powerful spatial-temporal graph convolutional networks (IDT-GCN) for fine-grained TAS, which notably outperforms state-of-the-art methods on the MCFS-22 and MCFS-130 datasets. Adding TSR to TCN-based baseline methods achieves competitive performance compared with the state-of-the-art transformer-based methods on RGB-based datasets, i.e., Breakfast and 50Salads. Further experimental results on the action recognition task verify the superiority of the enhanced spatial graph structure over the previous graph convolutional networks.

Abstract:
Thanks to the efficacy of Symmetric Positive Definite (SPD) manifold in characterizing video sequences (image sets), image set-based visual classification has made remarkable progress. However, the issue of large intra-class diversity and inter-class similarity is still an open challenge for the research community. Although several recent studies have alleviated the above issue by constructing Riemannian neural networks for SPD matrix nonlinear processing, the degradation of structural information during multi-stage feature transformation impedes them from going deeper. Besides, a single cross-entropy loss is insufficient for discriminative learning as it neglects the peculiarities of data distribution. To this end, this paper develops a novel framework for image set classification. Specifically, we first choose a mainstream neural network built on the SPD manifold (SPDNet) [25] as the backbone with a stacked SPD manifold autoencoder (SSMAE) built on the tail to enrich the structured representations. Due to the associated reconstruction error terms, the embedding mechanism of both SSMAE and each SPD manifold autoencoder (SMAE) forms an approximate identity mapping, simplifying the training of the suggested deeper network. Then, the ReCov layer is introduced with a nonlinear function for the constructed architecture to narrow the discrepancy of the intra-class distributions from the perspective of regularizing the local statistical information of the SPD data. Afterward, two progressive metric learning stages are coupled with the proposed SSMAE to explicitly capture, encode, and analyze the geometric distributions of the generated deep representations during training. In consequence, not only a more powerful Riemannian network embedding but also effective classifiers can be obtained. Finally, a simple maximum voting strategy is applied to the outputs of the learned multiple classifiers for classification. The proposed model is evaluated on three typical visual classification tasks using widely adopted benchmarking datasets. Extensive experiments show its superiority over the state of the arts.

Abstract:
In this work, we focus on studying the differentiable relaxations of several linear regression problems, where the original formulations are usually both nonsmooth with one nonconvex term. Unfortunately, in most cases, the standard alternating direction method of multipliers (ADMM) cannot guarantee global convergence when addressing these kinds of problems. To address this issue, by smoothing the convex term and applying a linearization technique before designing the iteration procedures, we employ nonconvex ADMM to optimize challenging nonconvex-convex composite problems. In our theoretical analysis, we prove the boundedness of the generated variable sequence and then guarantee that it converges to a stationary point. Meanwhile, a potential function is derived from the augmented Lagrange function, and we further verify that the objective function is monotonically nonincreasing. Under the Kurdyka-Łojasiewicz (KŁ) property, the global convergence is analyzed step by step. Finally, experiments on face reconstruction, image classification, and subspace clustering tasks are conducted to show the superiority of our algorithms over several state-of-the-art ones.

Abstract:
Recently, deep learning has been widely employed across various domains. The Convolution Neural Network (CNN), a popular deep learning algorithm, has been successfully utilized in object recognition tasks, such as face recognition, vehicle recognition, and license plate recognition. However, conventional methods for object recognition may not be appropriate for low-light image recognition due to information loss in the dark regions and unexpected noise that can impair object quality. Therefore, the development of specialized techniques for low-light image enhancement has become a major research focus for object detection. This paper proposed a gradient-based saliency map detection method with an improved ResNet architecture that outperforms previous works in detecting multiple or large objects. Additionally, the proposed method enhances images with the object as the center and emphasizes foreground-background differences. Compared with previous works, this paper achieves 1.28× improvements in the parameters and 1.32× faster inference speed than the original ResNet architecture.

Abstract:
Infrared and visible image fusion aims to generate one image with comprehensive information. It can maintain rich texture characteristics and thermal information. However, for existing image fusion methods, the fused images either sacrifice the salience of thermal targets and the richness of textures or introduce the interference of useless information like artifacts. To alleviate these problems, an effective cross-modal coordinate attention network for infrared and visible image fusion called CCAFusion is proposed in this paper. To fully integrate complementary features, the cross-modal image fusion strategy based on coordinate attention is designed, which consists of the feature-awareness fusion module and the feature-enhancement fusion module. Moreover, a multiscale skip connection-based network is employed to obtain multiscale features in the infrared image and the visible image, which can fully utilize the multi-level information in the fusion process. To reduce the discrepancy between the fused image and the input images, a multiple constrained loss function including the base loss and the auxiliary loss is developed to adjust the gray-level distribution and ensure the harmonious coexistence of structure and intensity in fused images, thereby preventing the pollution of useless information like artifacts. Extensive experiments conducted on widely used datasets demonstrate that our CCAFusion achieves superior performance over state-of-the-art image fusion methods in both qualitative evaluation and quantitative measurement. Furthermore, the application to salient object detection reveals the potential of our CCAFusion for high-level vision tasks, which can effectively boost the detection performance.

Abstract:
This work introduces a new task of instance-incremental scene graph generation: Given a scene of the point cloud, representing it as a graph and automatically increasing novel instances. A graph denoting the object layout of the scene is finally generated. It is an important task since it helps to guide the insertion of novel 3D objects into a real-world scene in vision-based applications like augmented reality. It is also challenging because the complexity of the real-world point cloud brings difficulties in learning object layout experiences from the observation data (non-empty rooms with labeled semantics). We model this task as a conditional generation problem and propose a 3D autoregressive framework based on normalizing flows (3D-ANF) to address it. First, we represent the point cloud as a graph by extracting the label semantics and contextual relationships. Next, a model based on normalizing flows is introduced to map the conditional generation of graphic elements into the Gaussian process. The mapping is invertible. Thus, the real-world experiences represented in the observation data can be modeled in the training phase, and novel instances can be autoregressively generated based on the Gaussian process in the testing phase. To evaluate the performance of our method sufficiently, we implement this new task on the indoor benchmark dataset 3DSSG-O27R16 and our newly proposed graphical dataset of outdoor scenes GPL3D. Experiments show that our method generates reliable novel graphs from the real-world point cloud and achieves state-of-the-art performance on the datasets.

Abstract:
Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.

Abstract:
With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are easily affected by language priors, which ignore image information and learn the superficial relationship between questions and answers, even in the optimal pre-training model. The main reason is that visual information is not fully extracted and utilized, which results in a domain gap between vision and language modalities to a certain extent. In order to mitigate the circumstances, we propose to extract dense captions (auxiliary semantic information) from images to enhance the visual information for reasoning and utilize them to release the gap between vision and language since the dense captions and the questions are from the same language modality (i.e., phrase or sentence). In this paper, we propose a novel dense caption-aware visual question answering model called DenseCapBert to enhance visual reasoning. Specifically, we generate dense captions for the images and propose a multimodal interaction mechanism to fuse dense captions, images, and questions in a unified framework, which makes the VQA models more robust. The experimental results on GQA, GQA-OOD, VQA v2, and VQA-CP v2 datasets show that dense captions are beneficial to improving the model generalization and our model effectively mitigates the language bias problem.

Abstract:
Current semantic segmentation methods mainly focus on modeling the context of the global image to obtain high-quality segmentation results. However, they ignore the role of local image patches, which contain complementary and effective context information. In this paper, we propose an adaptive post-processing network (APPNet) for semantic segmentation based on the predictions of current methods in the global image and local image patches. The key point of APPNet is the global-local aggregation module, which models the context between global predictions and local predictions to generate accurate pixel-wise representation. Furthermore, we develop an adaptive points replacement module to compensate for the lack of fine detail in global prediction and the overconfidence in local predictions. Our method can be readily integrated into existing segmentation methods (i.e., ConvNeXt, HRNet, ViT-Adapter) with little memory and without extra modification in current models. We empirically demonstrate our method brings performance improvements across diverse datasets (i.e., Cityscapes, ADE20K, PASCAL-Context, COCO-Stuff). The code and models will be publicly available at https://github.com/zhu-gl-ux/APPN.

Abstract:
Light field (LF) depth estimation is a crucial basis for LF-related applications. Most existing methods are based on the Lambertian assumption and cannot deal with non-Lambertian surfaces represented by transparent objects and mirrors. In this paper, we propose a novel Adaptive-Cross-Operator-based(ACO) depth estimation algorithm for non-Lambertian LF. By analyzing the imaging characteristics of non-Lambertian regions, it is found that the difficulty of depth estimation lies in the photo inconsistency of the center view. Combining with the two-branch structure, we propose ACO with an inter-branch cooperation strategy to adaptively separate depth information with different reflectance coefficients. We discover that the bimodal distribution feature of the operator filtering results can assist in the separation of multi-layer scene information. The first detection branch filters the EPI and implicitly records the severity of multi-layer scene aliasing. According to the identification of bimodal distribution features, the non-Lambertian regions are marked out and the depth of the foreground is estimated. The second branch receives guidance from the first to dynamically adjust the inner weight and infer the background’s depth after weakening the interference from the foreground. Finally, the depth information separation of multi-layer scenes is achieved by extracting the unique X-shaped linear structure. Without the reflection coefficients of the non-Lambertian object, the proposed method can produce high-quality depth estimation under the transparency of 90% to 20%. Experimental results show that the proposed ACO outperforms state-of-the-art LF depth estimation methods in terms of accuracy and robustness.

Abstract:
The recent success of text-to-image generation diffusion models has also revolutionized semantic image editing, enabling the manipulation of images based on query/target texts. Despite these advancements, a significant challenge lies in the potential introduction of contextual prior bias in pre-trained models during image editing, e.g., making unexpected modifications to inappropriate regions. To address this issue, we present a novel approach called Dual-Cycle Diffusion, which generates an unbiased mask to guide image editing. The proposed model incorporates a Bias Elimination Cycle that consists of both a forward path and an inverted path, each featuring a Structural Consistency Cycle to ensure the preservation of image content during the editing process. The forward path utilizes the pre-trained model to produce the edited image, while the inverted path converts the result back to the source image. The unbiased mask is generated by comparing differences between the processed source image and the edited image to ensure that both conform to the same distribution. Our experiments demonstrate the effectiveness of the proposed method, as it significantly improves the D-CLIP score from 0.272 to 0.283. The code will be available at https://github.com/JohnDreamer/DualCycleDiffsion.

Abstract:
Synthesizing color images based on line arts while considering the styles of reference photos is a flexible form of artistic creation that has recently attracted public attention. Previous approaches usually require large datasets at training, causing great inconvenience to the application. Besides, the sparsity of line art pictures often leads to a failure in learning valid mappings. To this end, we present SDL, a self-driven dual-path framework for reference-based line art colorization under limited data. Given small training sets containing sketch-image pairs, SDL first utilizes a novel Dynamic Pseudo Sample Generator (DPSG) to produce quantities of fake samples. Then, we introduce a dual-path network to achieve better visual effects, in which the Content-Generation Path reconstructs reliable content features to help establish multi-level correspondence in the Content-Color Aggregation Module (CCAM) of the Color-Transfer Path. Furthermore, we develop a Region-aware Contrastive Scheme (RCS) to focus on fine-grained details and a Style-augmented Contrastive Scheme (SCS) to encourage style consistency. Experiments verify the superiority of our model compared with existing works. We also demonstrate SDL outperforms state-of-the-art self-driven methods even though they adopt much more data than us ( 30× on CelebA-HQ Dataset and 17× on ASCP Dataset).

Abstract:
Classic unsupervised anomaly detection learns normative patterns from normal behavior and assumes that unforeseen anomalous behavior will result in significant prediction deviations. However, anomaly detection in specific situations faces challenges in detecting ambiguous behavior in which the abnormal representation is not particularly intuitive. Existing anomaly detection approaches perform poorly for ambiguous behavior due to limited normative representational capacity, resulting in a narrow normality gap. We observe that the ambiguity of behavior comes from the contradiction between the properties of appearance and motion modalities. In this paper, we propose a novel memory-guided autoencoder named appearance-motion synergy autoencoder to detect anomalous behavior by event prediction. To address the above challenge, we leverage the synergy of the normative appearance-motion modalities to strengthen the representation of normative patterns and improve the detection of ambiguous behavior. Specifically, we design the memory networks with dynamic fusion mechanisms to integrate the correlated appearance-motion information and to remember normal patterns. A consistency measurement unit is designed to optimize the consistency of normative appearance-motion features via a joint distribution measurement pool. A larger normality gap in detecting ambiguous behavior in our approach enhances the abnormal detection capability. Extensive experiments demonstrate our superiority in detecting anomalous behavior.

Abstract:
Hyperspectral image segmentation is an emerging area with numerous applications, including agriculture, forestry, environment monitoring, and remote sensing. This paper proposes a new neural architecture search algorithm, named AdaptorNAS, for hyperspectral image segmentation. AdaptorNAS aims to design the optimum decoder for any given encoder. In our approach, the search space of AdaptorNAS is a large deep neural network (DNN), and the optimal decoder is derived by pruning the large DNN via a perturbation-based pruning strategy. Verified on three popular encoders, i.e., ResNet-34, MobileNet-V2, and EfficientNet-B2, AdaptorNAS can design high-speed decoders that are significantly better than six common hand-crafted decoders. Additionally, with the EfficientNet-B2 encoder, AdaptorNAS (mIoU of 92.47% and mDice of 95.15%) outperforms the state-of-the-art NAS algorithms and hand-crafted network architectures on the hyperspectral image segmentation task. We also introduce a new hyperspectral image dataset of 4,625 images for objective evaluation in hyperspectral image segmentation research.

Abstract:
Cloud service is a natural choice to store and manage the exponentially produced images. Data privacy is one of the most concerned points in cloud-based image services. Reversible data hiding over encrypted images (RDH-EI) is an effective technique to securely store and manage confidential images in the cloud. However, existing RDH-EI schemes have obvious weaknesses such as reliable key management system dependence and single point of failure. To securely store and manage confidential images in the cloud, in this study, we propose a new reversible data hiding strategy via image secret sharing. We first design a secure (r,n) -threshold preprocessing-free matrix secret sharing (PFMSS) technique. It can directly share m -bit data by matrix multiplication without preprocessing. Using the PFMSS, we further design a secure (r,n) -threshold reversible data hiding scheme over encrypted images. The content owner divides a confidential image into n shares without accessing to a secret encryption key, and then sends the n shares to n cloud-based image servers from competing providers. For each share, some additional data, e.g., integrity and identification of the image, can be embedded into it and these data can also be losslessly extracted. An authorized receiver can recover the confidential image from r shares. By designing, the content owner doesn’t need to access a secret key when encrypting the image and the scheme can withstand n-r points of failure. Simulation results show that our scheme can ensure image content confidentiality and has a much larger embedding capacity compared to state-of-the-art schemes.

Abstract:
Image inpainting based on generative adversarial networks (GANs) has achieved great success in producing visually plausible images and plays an important role in many real tasks. However, the techniques of image inpainting might also be maliciously used, e.g., altering or removing interesting objects to report fake news. Despite the promising performance of recently developed inpainting detection algorithms, they are built on convolutional neural networks (CNNs) with limited receptive fields. Consequently, they fail to fully capture the disparity between the inpainted regions and untouched regions and thus are ineffective in obtaining fine-grained detection results. In this work, we develop a new image inpainting detection approach. First, we propose a locally enhanced transformer architecture tailored for image inpainting detection. Unlike previous CNN-based methods, our approach leverages both the short-range and long-range dependencies of pixels, enabling the learning of diverse statistical behaviors of inpainted and untouched regions. Second, to mitigate the distraction caused by near-edge pixels with a mixed nature during training, we propose decoupling the label into a body map and a soft-edge map, and then a cross-modality attention module is designed to propagate their information interactively. It demonstrates that our decoupling strategy outperforms the conventional edge supervision in enhancing detection accuracy. Finally, we devise a constrained adversarial training methodology in consideration of the confrontational generation procedure of deep image inpainting methods. It shows that our constrained adversarial training further enhances the detection performance by adaptively introducing interference noise in the inpainted regions. Extensive experiments validate the superiority of our scheme compared to existing CNN-based methods, showcasing its desirable detection generalizability for both deep inpainting and traditional inpainting algorithms.

Abstract:
Current deep learning models often catastrophically forget the knowledge of old classes when continually learning new ones. State-of-the-art approaches to continual learning of image classes often require retaining a small subset of old data to partly alleviate the catastrophic forgetting issue, and their performance would be degraded sharply when no old data can be stored due to privacy or safety concerns. In this study, inspired by human learning of visual knowledge with the effective help of language, we propose a novel continual learning framework based on a pre-trained vision-language model (VLM) without retaining any old data. Rich prior knowledge of each new image class is effectively encoded by the frozen text encoder of the VLM, which is then used to guide the learning of new image classes. The output space of the frozen text encoder is unchanged over the whole process of continual learning, through which image representations of different classes become comparable during model inference even when the image classes are learned at different times. Extensive empirical evaluations on multiple image classification datasets under various settings confirm the superior performance of our method over existing ones. The source code is available at https://github.com/Fatflower/CIL_LG_VLM/.

Abstract:
Video Visual Relation Detection (VidVRD) is a pivotal task in the field of video analysis. It involves detecting object trajectories in videos, predicting potential dynamic relation between these trajectories, and ultimately representing these relationships in the form of triplets. Correct prediction of relation is vital for VidVRD. Existing methods mostly adopt the simple fusion of visual and language features of entity trajectories as the feature representation for relation predicates. However, these methods do not take into account the dependency information between the relation predication and the subject and object within the triplet. To address this issue, we propose the entity dependency learning network(EDLN), which can capture the dependency information between relation predicates and subjects, objects, and subject-object pairs. It adaptively integrates these dependency information into the feature representation of relation predicates. Additionally, to effectively model the features of the relation existing between various object entities pairs, in the context encoding phase for relation predicate features, we introduce a fully convolutional encoding approach as a substitute for the self-attention mechanism in the Transformer. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed EDLN.

Abstract:
Recently, many Light Field Salient Object Detection (LF SOD) methods have been proposed. However, guaranteeing the integrality and recovering more high-frequency details of the generated salient object map still remain challenging. To this end, we propose a spatial attention-guided LF SOD network with implicit neural representation to further improve LF SOD performance. We adopt an encoder-decoder structure for model construction. In order to ensure the completeness of the generated salient object map, a multi-modal and multi-scale feature fusion module is designed in the encoder part to refine the salient regions within all-in-focus image and aggregate the focal stack and all-in-focus image in spatial attention-guided manner. In order to recover more high-frequency details of the obtained salient object map, an implicit detail restoration module is proposed in the decoder part. In virtue of implicit neural representation, we convert the detail restoration problem into a functional mapping problem. By further integrating the self-attention mechanism, the derived saliency map can be depicted at a more refined level. Comprehensive experimental results demonstrate the superiority of the proposed method. Ablation studies and visual comparisons further validate that the proposed method can guarantee the integrality and recover more high-frequency detail information of the obtained saliency map. The code is publicly available at https://github.com/ldyorchid/LFSOD-Net.

Abstract:
Pose estimation plays a crucial role in human-centered vision applications. Some recent efforts achieved pose estimation by keypoints detection. Drawing inspiration from object detection, they treated keypoints as objects and achieved unbiased estimation through implementation of classification and regression heads. However, they still failed to achieve satisfactory performance for detecting heavily occluded keypoints and required elaborate and unavoidable post-processing steps. With a thorough exploration of keypoints’ characteristics, we have developed a novel Adaptive positive Sample selection and dynamic soft Label Assignment (ASLA) scheme tailored for keypoint detection. Specifically, we select positive samples for each keypoint according to the summation distance from the sample coordinates and their predicted coordinates to their corresponding ground truth (GT) in the training phase. For occluded keypoints, the positive samples defined by our method may fall in the semantically relevant regions of pedestrians, rather than the spatially adjacent regions of obstructions, significantly improving their localization performance. Meanwhile, we dynamically assign classification labels to these positive samples based on the distance between their predicted coordinates and their corresponding GT, which ensures that high quality positive samples are assigned with high classification labels. Benefiting from the practical design of our ASLA, the post-processing step is not essential; however, the simple vector-level post-processing would be the icing on the cake. Finally, we extensively evaluate our ASLA performance on two popular human pose estimation benchmarks, COCO and MPII, and comprehensive experiments show that our ASLA significantly outperforms state-of-the-art algorithms. Our code and models will be available at https://github.com/SCUT-BIP-Lab/ASLA.

Abstract:
No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of score boundary and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our method outperforms all compared state-of-the-art attack methods and is far ahead of previous black-box methods. The effective NR-IQA model DBCNN suffers a Spearman’s rank-order correlation coefficient (SROCC) decline of 0.6381 attacked by our method, revealing the vulnerability of NR-IQA models to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.

Affiliations: Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems and Safety Control, School of Transportation Science and Engineering, Beihang University, Beijing, China; Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing, China; Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, Sydney, NSW, Australia

Abstract:
Traffic video question answering (TrafficVQA) constitutes a specialized VideoQA task designed to enhance the basic comprehension and intricate reasoning capacities of videos, specifically focusing on traffic events. Recent VideoQA models employ pretrained visual and textual encoder models to bridge the feature space gap between visual and textual data. However, in addressing the unique challenges inherent to the TrafficVQA task, three pivotal issues must be addressed: (i) Dimension Gap: Between the pretrained image (appearance feature) and video (motion feature) models, there exists a conspicuous dimension difference in static and dynamic visual data; (ii) Scene Gap: The common real-world datasets and the traffic event datasets differ in visual scene content; (iii) Modality Gap: A pronounced feature distribution discrepancy emerges between traffic video and text data. To alleviate these challenges, we introduce the coarse-fine multimodal contrastive alignment network (CFMMC-Align). This model leverages sequence-level and token-level multimodal features, grounded in an unsupervised visual multimodal contrastive loss to mitigate dimension and scene gaps and a supervised visual-textual contrastive loss to alleviate modality discrepancies. Finally, the model is validated on the challenging public TrafficVQA dataset SUTD-TrafficQA and outperforms the state-of-the-art method by a substantial margin (50.2% compared to 46.0%). The code is available at https://github.com/guokan987/CFMMC-Align.

Abstract:
It has long been an ill-posed problem to predict absolute depth maps from single images in unseen scenes. We observe that it is essentially due to not only the scale-ambiguous problem, but more importantly, the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. Our model is enabled to be well trained on either a single dataset or a mixed dataset with diverse focal lengths and scene scales by a dual-directional alignment strategy. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. The experiments verify that our model trained on NYUDv2 significantly improves the generalization ability of monocular depth estimation by 32%/14% (RMSE) on three unseen datasets with/without data augmentation compared with state-of-the-art (SOTA) baselines, and well alleviates the deformation problem of depth maps in 3D view. The generalization ability is further improved by 16% when the model is trained on a mixture of NYUDv2 and SUNRGBD. In addition, our model maintains a SOTA accuracy, when it is trained and tested on NYUDv2 similar to existing models. The code is released on https://github.com/wcrwcrwcr/FS-Depth-v1.

Abstract:
The introduction of depth/thermal modality has significantly enhanced the performance of dual-modal salient object detection (SOD) methods. However, depth maps and thermal images are prone to environmental interference, making them insufficient for providing salient information. To address this challenge, triple-modal SOD methods have been proposed. However, these methods often overlook the detrimental effects of defective modalities during fusion, leading to subpar performance. To tackle this issue, we present a novel dynamic weighted fusion and progressive refinement network (DWFPRNet) for Visible-Depth-Thermal (V-D-T) SOD. Specifically, we first use the dual-modal fusion module (DFM) to fuse dual modalities, thereby obtaining fused features. Subsequently, the modality selective fusion module (MSFM) mines complementary information between fused features, considering both fusion features and the quality of feature maps, to achieve weighted fusion. Finally, we design a progressive refinement decoder (PRD) to realize interaction and multi-scale learning among different scale features and generate high-quality saliency maps. Extensive experiments conducted on the VDT-2048 public dataset demonstrate that our method outperforms existing state-of-the-art multi-modal methods.

Abstract:
To combine the advantages of deterministic and probabilistic 3D human pose estimation methods, we decompose pose estimation into two processes: hypotheses generation and hypotheses aggregation. For hypotheses generation, we propose a novel Diffusion-based 3D Pose generation (D3DP) method. D3DP generates a diversified group of plausible 3D pose hypotheses from a single 2D keypoint observation. Utilizing a diffusion process, it gradually transforms ground-truth 3D poses towards a random distribution, subsequently employing a conditioned denoiser guided by the observed keypoints to recover the uncorrupted 3D poses. Moreover, D3DP is compatible with existing deterministic 3D pose estimators and allows users to optimize the trade-off between computational efficiency and pose accuracy via two adjustable parameters. For hypotheses aggregation, we propose two alternative approaches: a Reprojection-Based Selection (RBS) method and a Hypotheses Selection Network (HSN). These methods adopt the joint-level strategy to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. Specifically, RBS reprojects 3D pose hypotheses to the 2D camera plane, and selects the best hypothesis based on the reprojection errors. HSN evaluates each hypothesis and selects the hypothesis with the highest confidence score as the output. Then these selected joints are combined into the final pose. The proposed methods implement a joint-by-joint aggregation strategy that capitalizes on the 2D prior and temporal information, both of which have been ignored by previous pose-level methods. Extensive experiments on two benchmarks highlight that the proposed method outperforms the state-of-the-art deterministic and probabilistic approaches.

Abstract:
The varying environmental conditions pose challenges to existing object detection methods as they lead to changes in the overall feature distribution of images. Underwater images are particularly susceptible to environmental conditions changes, resulting in phenomena like color deviation. This paper propose an object detection model, FIOD-VUE, which focuses on invariant information across different underwater environments to enhance the model’s generalization capability. Inspired by frequency domain analysing, we design a Frequency-Invariant Attention (FIA) module. This module use frequency filters to focus on specific frequency signals, i.e., cross-domain invariant information. Additionally, we design the Multi-scale Image-level Feature Alignment (MIFA) to adaptively adjust the frequency filters in the FIA and assist the backbone in extracting domain-confusion features. Through adversarial training, the distribution gap between the source domain and target domain is reduced. To enrich the domain shift database, we also afford an HD-Deepfish dataset. Numerous experiments on the S-UODAC2020 and the HD-Deepfish datasets were executed and yielded impressive results, with average precision (AP) scores of 56.8% and 37.1%, respectively, surpassing the performance of the existing underwater object detection (UOD) models. The link of the code is released at: https://github.com/JOU-UIP/FIOD-VUE.

Abstract:
Leg agility is a key indicator of bradykinesia, which in turn is a cardinal manifestation of Parkinson’s disease (PD). In fact, automated video assessment of the leg-agility task is critically required for improving the efficiency and objectivity of PD diagnosis. Therefore, we propose a causality-informed graph convolutional network to extract discriminative clinically-meaningful motion features from human skeletons in videos, finally achieving stable leg-agility 5-point scoring. The proposed scheme systematically mines causal features of each skeleton graph from graph node, structure, and representation levels. Specifically, we firstly developed a causality-informed node selection mechanism to mine the graph nodes representing the discriminative features, and thus identify nodes causally correlated to the clinical assessment aspects and suppress the interference from other nodes. Afterwards, a causality-informed structure generation mechanism was designed to generate a graph structure encoding the connections between the discriminative nodes, hence maintaining the discriminability of features associated with these causality-informed nodes. Finally, we employed a clinically-driven self-supervised learning scheme to embed clinical prior knowledge into the proposed model and hence boost the clinical significance of the causality-informed graph nodes, structures, and representations. The proposed method achieved a 71.11% accuracy and a 98.93% acceptable accuracy on a large clinical video dataset. Its effectiveness was also confirmed on an independent test set, and the obtained results exhibited interpretability from modeling and clinical perspectives. In conclusion, our method provides a highly stable scheme for objective video quantification of bradykinesia. Our source code will be released at https://github.com/SJTUBME-QianLab/PD-CIGCN.

Abstract:
Salient object detection and camouflaged object detection have attracted increasing attention due to their significant practical applications. While these two domains share similarities in recognition methods and object characteristics, they also exhibit distinctions. In this paper, we propose a novel multi-view guided network for camouflaged and salient object detection, utilizing the Transformer as the backbone network for feature extraction. Capitalizing on shared characteristics, we introduce a CNN-based multi-view encoder and a multi-view fusion module, enhancing the acquisition of multi-perspective information while minimizing the increase in computational cost. Moreover, recognizing domain differences, we incorporate an attention exploration module, seamlessly integrating multi-view features with globally extracted features from the backbone network. This integration involves simultaneous exploration from both positional and color perspectives, unearthing valuable information to identify salient and camouflaged objects. Our approach maximizes shared characteristics between the two tasks while effectively addressing their differences, leading to precise object identification—be it for camouflaged or salient objects. Extensive experiments on nine challenging benchmark datasets demonstrate the superior performance of our method across four widely used evaluation metrics, outperforming 34 state-of-the-art methods. Furthermore, we applied our method to other visually-related tasks, such as polyp segmentation and defect detection. The results further demonstrate the versatility of our model. The source code and results of our method are available at https://github.com/1900zpf/MVGNet.

Abstract:
Joint video moment retrieval and highlight detection is an emerging and challenging research task. It requires the generation of robust joint task features to satisfy the demands of video moment retrieval and video highlight detection. Moreover, it involves the interaction of multiple modalities. Presently, methods typically focus on the design of distinct enhancement modules and the addition of supplementary input data to improve the solution for joint video moment retrieval and highlight detection. However, they overlook subtask interference during joint training. Joint task learning leverages the correlations and complementarities between tasks, yet it also introduces task interference arising from the differences between tasks. In order to address task interference, we proposes a subtask prior-driven optimized mechanism. The mechanism consists of two stages. In the free stage, we train subtask model to get subtask prior features. In the constrained stage, the joint task model is constrained by the subtask. Besides, we propose a cross adaptive-gated mechanism. It addresses the issue of information loss in cross-modal fusion and filters out redundant information by conducting cross-modal interaction during feature compression and an adaptive gating process. Extensive experimental results exhibit the effectiveness of the subtask prior-driven optimized mechanism and the cross adaptive-gated transformer in joint video moment retrieval and highlight detection.

Abstract:
To meet users’ demands for video retrieval, text-video cross-modal retrieval technology continues to evolve. Methods based on pre-trained models and transfer learning are widely employed in designing cross-modal retrieval models, significantly enhancing the accuracy of video retrieval. However, these methods exhibit shortcomings when it comes to studying the relationships between video frames, preventing the model from fully establishing the hidden semantic relationships within video features. To further deduce the implicit semantic relationships among video frames, we propose a cross-modal retrieval model based on graph convolutional networks (GCN) and visual semantic inference (GVSI). The GCN is utilized to establish relationships between video frame features, facilitating the mining of hidden semantic information across video frames. In order to use text semantic features to help the model to infer temporal and implicit semantic information between video frames, we introduce a semantic mining and temporal space (SM&TS) inference module. Additionally, we design semantic alignment modules (SA_M) to align explicit and implicit object features present in both video and text. Finally, we analyze and validate the effectiveness of the model using MSR-VTT, MSVD, and LSMDC datasets.

Abstract:
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where a backdoored model behaves normally with clean inputs but exhibits attacker-specified behaviors upon the inputs containing triggers. Most previous backdoor attacks mainly focus on either the all-to-one or all-to-all paradigm, allowing attackers to manipulate an input to attack a single target class. Besides, the two paradigms rely on a single trigger for backdoor activation, rendering attacks ineffective if the trigger is destroyed. In light of the above, we propose a new M-to-N attack paradigm that allows an attacker to manipulate any input to attack N target classes, and each backdoor of the N target classes can be activated by any one of its M triggers. Our attack selects M clean images from each target class as triggers and leverages our proposed poisoned image generation framework to inject the triggers into clean images invisibly. By using triggers with the same distribution as clean training images, the targeted DNN models can generalize to the triggers during training, thereby enhancing the effectiveness of our attack on multiple target classes. Extensive experimental results demonstrate that our new backdoor attack is highly effective in attacking multiple target classes and robust against pre-processing operations and existing defenses.

Abstract:
Motion capture technology is crucial in various applications like animation, virtual reality and sports analysis. With the development of deep learning methods, significant progress has been experienced in this field, producing cost-effective and user-friendly solutions for various applications. This paper provides a comprehensive review of deep learning-based human motion capture techniques. Our review aims to bridge the gap between academic research and practical applications, providing valuable insights and guidance for researchers and practitioners in deep learning-based human motion capture. Our study puts forth a new application-oriented taxonomy that comprehensively summarises five fundamental routes of motion capture technology. In addition to that, we also delve into the research priorities linked with each route, following the structure of “hardware requirements - technical routes - datasets - evaluation metrics” and extending the necessary criteria for transferring traditional motion capture systems to deep learning-based ones. Meanwhile, for the motion capture technology, the current state of the art is reviewed, the challenges are identified, and the future directions of the research are outlined.

Abstract:
Remote photoplethysmography (rPPG) has considerable significance in areas such as disease diagnosis and emotion analysis. Recent rPPG models have demonstrated excellent performance due to their powerful heart rate information extraction capabilities. However, these models often focus on limited regions of interest (ROI) on facial image, which makes them sensitive to interference. If the ROI is affected by muscle movement, lighting variation and noise, the model’s performance would degrade significantly. To address this limitation, we propose a two-stage model called MaskFusionNet. The model includes two stages: 1) During the pre-training stage, the mask-reconstruction mechanism drives MaskFusionNet to learn rPPG information from various facial regions by applying a tube masking strategy. This enhances the model’s ability to resist interference. Based on the periodicity and continuity of the heart rate signal, we also design a novel spatio-temporal reconstruction loss function that focuses on the data’s spatial features and temporal continuity. 2) In the fine-tuning stage, we propose the Multi-Scale Fusion Block (MFB) to combine multi-scale features from the dual-stream network. It allows the model to detect subtle heart rate variations in adjacent frames while minimizing the impact of interference by extracting features within longer segments. The transformer-based MaskFusionNet can extract multi-scale fused heart rate features from a wide range of skin regions while preserving the modeling capability of long-range sequence information. To validate its effectiveness, we extensively evaluate our model on three benchmark datasets (VIPL-HR, COHFACE, and PURE), demonstrating its superior performance in both intra-dataset and cross-dataset testing scenarios.

Abstract:
Most existing image deblurring methods construct statistical prior to describe the difference between blur and clear image. They discard the position information and ignore pixel feature changing in deblurring, which results in inferior restoration performance for images unsatisfying corresponding assumptions. Intuitively, fuzziness of pixel belonging to different image regions will reduce along with image deblurring. This phenomenon could intrinsically describe the pixel characteristic. To this end, we analyze fuzziness of pixels and objects in a blurry image, and utilize the similarity between two fuzzy objects on image pixels to depict the blur degree of an image, which is inspired by overlap functions and overlap indices. To minimize the similarity between fuzzy objects, we introduce the non-parameters model to construct an integer programming problem. Energy minimization could significantly reduce the similarity between two fuzzy objects. Experimental results show that the proposed method can achieve better performance than the state-of-the-art blind deblurring methods on benchmark datasets and natural images.

Abstract:
Class-agnostic binary segmentation identifies objects that are similar or very different from the complex background, including salient object detection (SOD) and camouflage object detection (COD). Most existing models only focus on a specific type of foreground and background segmentation by employing the global modeling ability of transformers, without explicitly explaining or eliminating the discrepancy between these two different distributions. They also suffer from inefficient local feature learning and inadequate feature aggregation. To make binary segmentation research more accessible and trivially generalized, we introduce a novel unified uncertainty-aware paradigm, called uncertainty-aware feature reassembly (UAFer). Specifically, the Spatial Feature Reassembly (SFR) module is presented to formulate the uncertainty of binary segmentation map as the variance of generalized Bernoulli distribution and entropy from two perspectives. Our transformer-based model is then trained to prioritize regions of higher certainty, obtaining more confident and accurate predictions during the feature upsampling. Moreover, the Channel Feature Reassembly (CFR) with adjacent feature aggregation is designed to facilitate an iterative exploration of channel integrity. This iterative learning process enhances the interaction of neighboring channel features; thus, improving universal object information decoding efficiency. Extensive quantitative and qualitative evaluations demonstrate that our proposed UAFer consistently outperforms the state-of-the-art models across three challenging domains including SOD, COD, and polyp segmentation (POLYP). The implementation codes for our approach will be publicly available at https://github.com/zihaodong/UAFR.

Abstract:
In the realm of large-scale industrial manufacturing, the precise detection of defective parts stands as a critical imperative. While current unsupervised anomaly detection algorithms exhibit commendable accuracy when applied to clean training datasets, their susceptibility to contaminated training data limits their real-world efficacy. In response to this challenge, this paper proposes a novel Outlier-Probability-Based Feature Adaptation (OPFA) network to realize robust unsupervised anomaly detection on contaminated training data. This method distinguishes itself by maintaining both high accuracy and robustness in the face of contaminated training data, enabling effective learning of discriminative features for anomaly detection. Specifically, the model enhances feature representations through the contraction of normal features and the contrast between normal and outlier features. Our methodology employs an iterative mechanism, featuring three core designs. First, outlier detection evaluates the outlier probabilities of current feature embeddings, providing a basis for subsequent improvements. Second, Gaussian Mixture Model (GMM) is leveraged to model the distributions of normal feature embeddings. Third, the adaptive network refines feature representations based on the GMM models and outlier scores of feature embeddings. Ablation experiments underscore the effectiveness of each component within our model. Furthermore, our approach outperforms other state-of-the-art methods on three benchmark datasets, demonstrating a notable advantage especially in scenarios with contaminated training data.

Abstract:
Eyeglass reflection removal is of great importance to the portrait image processing. However, it remains a challenge to eliminate the reflections on the glass and restore the textual contents of eyes without introducing visual artifacts. Addressing this problem, in this paper, we propose an Eyeglass Reflection Removal Network (ER2Net) by learning reflection elimination and content inpainting jointly. The reflection elimination branch is effective in weak reflection regions, and the content inpainting branch is dedicated to content reasoning in strong reflection regions. We then propose a result fusion module (RFM), which adaptively fuses the elimination result and the inpainting result according to the reflection intensity of each pixel, to produce high-quality result. We also design a memory module for improving the content inpainting result, and propose an eye-symmetry loss to avoid visual artifacts. Additionally, we construct the first Real-world eyeglass Reflection (ReyeR) dataset for eyeglass reflection removal. Extensive quantitative and qualitative experiments demonstrate the superiority of the ER2Net over state-of-the-art methods for eyeglass reflection removal.

Abstract:
Local collaborative representation (CR) has drawn much attention in exploring data relationships due to considering local knowledge in the global linear combination, subsequently, local CR-based graph embedding methods have been applied to dimensionality reduction of hyperspectral image (HSI). However, HSI data with nonlinear distribution cannot be handled with pure linear combination accurately. Furthermore, the existing local knowledge in terms of binary relations between pairwise neighbors makes it hard to learn the accurate local structure among neighborhood sets through local CR-based graph embedding. To this end, this paper proposes a novel multiple neighborhood-aware nonlinear collaborative analysis (MNNCA) method. Relying on the primary and secondary neighborhoods, a dual-level neighborhood reconstruction is designed to search for optimal neighbors and mine the common attributes within the neighborhood. With the reconstruction information, a nonlinear extend multiple neighborhood-aware collaborative representation (NE-MNACR) model is built on nonlinear geodesic constraint and multi-neighborhood-aware items. It can explore the collaborative relationship among multiple neighborhood sets in the nonlinear space of HSI data. By preserving the multivariate local structure instead of pairwise local relations, a pair of collaborative structure preservation graphs are constructed to realize the final embedding of HSI data. Experimental results on serval HSI data sets demonstrate the superior performance of the proposed MNNCA method and NE-MNACR model in comparison with some state-of-the-art DR methods and local CR models.

Abstract:
Model quantization is a prevalent method to compress and accelerate neural networks. Most existing quantization methods usually require access to real data to improve the performance of quantized models, which is often infeasible in some scenarios with privacy and security concerns. Recently, data-free quantization has been widely studied to solve the challenge of not having access to real data by generating synthetic data, among which generator-based data-free quantization is an important type. Previous generator-based methods focus on improving the performance of quantized models by optimizing the spatial distribution of synthetic data, while ignoring the study of changes in synthetic data from a temporal perspective. In this work, we reveal that generator-based data-free quantization methods usually suffer from the issue that synthetic data show homogeneity in the mid-to-late stages of the generation process due to the stagnation of the generator update, which hinders further improvement of the performance of quantized models. To solve the above issue, we propose introducing the discrepancy between the full-precision and quantized models as new supervision information to update the generator. Specifically, we propose a simple yet effective adversarial Gaussian-margin loss, which promotes continuous updating of the generator by adding more supervision information to the generator when the discrepancy between the full-precision and quantized models is small, thereby generating heterogeneous synthetic data. Moreover, to mitigate the homogeneity of the synthetic data further, we augment the synthetic data with linear interpolation. Our proposed method can also promote the performance of other generator-based data-free quantization methods. Extensive experimental results show that our proposed method achieves superior performances for various settings on data-free quantization, especially in ultra-low-bit settings, such as 3-bit.

Abstract:
With the rapid expansion of image data and advancements in artificial intelligence, a significant portion of image analysis is performed by machines rather than humans. To enhance efficiency in data transmission and visual analysis, on-demand transmission becomes a preferable approach, which adaptively transmits the necessary information based on specific requirements. In this paper, we propose a novel joint feature and image compression scheme to facilitate flexible on-demand transmission. The bitstreams generated by the proposed scheme can be adapted to multiple machine vision tasks and image reconstruction based on specific needs. To achieve a good balance between the feature-based visual analysis performance and computational overhead at the receiver side, we adopt a reversible neural network as the feature extractor. The extracted features contain all information from the original image and necessitate a low-complexity analysis network. Additionally, we develop end-to-end compression models for multi-granularity features and image signals, where prediction models are incorporated in both feature/image space and latent space to improve the efficiency of joint compression. Furthermore, several feature transform blocks are designed to align the features with the requirements of different tasks. Experimental results on the COCO dataset show that the proposed compression method outperforms state-of-the-art image codecs on several machine vision tasks, and can also achieve comparable results in terms of image reconstruction.

Abstract:
To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model’s intrinsic ‘subitizing’ capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn’t have strict structural or loss constraints. In addition, we observe that the model trained with our framework shows strong contextual modeling capabilities, which allows it to make robust predictions even when some local details of patches are lost. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

Abstract:
Extracting discriminative representations is the key step for correspondence-free point cloud registration. The extracted representations require to be discriminative to transformation, which demands representations to reduce the influence of redundant information irrelevant to transformation. However, recently proposed methods ignore this crucial property, resulting in limited ability to represent point cloud. In addition, researching correspondence-free point cloud registration has stagnated in recent years. In this paper, we try to relieve features redundancy issue for correspondence-free point cloud registration from a new perspective. Specifically, our method comprises two stages: feature extraction stage and rigid body transformation stage. In feature extraction stage, we aim to maximize multi-hierarchical mutual information between different hierarchical features, which can provide discriminative and less redundancy representations to regress transformation parameters for next stage. In rigid body transformation stage, we utilize dual quaternion to estimate transformation parameters, which combines rotation and translation simultaneously within a unified framework and obtains a compact representations for rigid transformation. The proposed model is trained in an unsupervised manner on the ModelNet40 dataset. The experimental results illustrate that our method achieves higher accuracy and robustness compared with existing correspondence-free methods.

Abstract:
Current cloud detection methods have demonstrated effectiveness by utilizing the rich spectral features of multispectral images. Compared to multispectral images, single-band infrared images offer higher efficiency in terms of sampling and processing speed. However, single-band cloud detection methods have not been fully developed, and existing methods based on multispectral cloud detection have some limitations when applied directly to single-band images: Firstly, they often blend shallow features containing spatial details with deep features providing high-level semantic information, yet struggle to disentangle features with strong discrimination representing cloud edges and bodies from limited information. Additionally, the correlation between features at different aspects is not fully reasoned, resulting in blurred boundary segmentation. To address these issues, we introduce a Multi-level Information Fusion Network (MIFNet) with an integrated edge information injection strategy. Our method effectively decouples clouds into their fundamental components: body and edge (Low-Frequency (LF) and High-Frequency (HF) components), enabling the comprehensive acquisition of strong discriminative features. Specifically, we propose an Edge Feature Extraction Module (EFEM) that isolates the cloud body through low-pass filtering, while the cloud’s edge is extracted by subtracting lower-level features from LF components. Furthermore, we employ a Feature Refinement Module (FRM) to locate the cloud body’s position precisely. Building upon this foundation, we devise a Graph Reasoning Module (GRM) to facilitate the full inference of feature correlations at different levels and to model the global interdependence between edges and semantics. Through comprehensive evaluations on benchmark datasets comprising infrared band images from Landsat 8 and MODIS satellites, we demonstrate that our proposed MIFNet outperforms state-of-the-art methods, yielding promising results in cloud detection accuracy. Our code is publicly available at https://github.com/KwunYat/MIFNet.

Abstract:
Domain generalization in person re-identification (DG-ReID) stands out as the most challenging task and practically important branch in the ReID field, which enables the direct deployment of pre-trained models in unseen and real scenarios. Recent works have made significant efforts in this task via the image-matching paradigm, which searches for the local correspondences in the feature maps. A common practice of employing pixel-wise matching is typically used to ensure efficient matching. This, however, makes the matching susceptible to deviations caused by identity-irrelevant pixel features. On the other hand, patch-wise matching also demonstrates that it will disregard the spatial orientation of pedestrians and amplify the impact of noise. To address the mentioned issues, this paper proposes the Multi-Scale Query-Adaptive Convolution (QAConv-MS) framework, which encodes patches in the feature maps to pixels using template kernels of various scales. This enables the matching process to enjoy broader receptive fields and robustness to orientations and noises. To stabilize the matching process and facilitate the independent learning of each sub-kernel within the template kernels to capture diverse local patterns, we propose the OrthoGonal Norm (OGNorm), which consists of two orthogonal normalizations. We also present Mutual Subject Teacher Learning (MSTL) to address the potential issues of overconfidence and overfitting in the model. MSTL allows two models to individually select the most challenging data for training, resulting in more dependable soft labels that can provide mutual supervision. Extensive experiments conducted in both single-source and multi-source setups offer compelling evidence of our framework’s generalization and competitiveness.

Abstract:
In this study, we present WebCeph2k, an extensive and diverse cephalometric landmark localization dataset that surpasses previous benchmark datasets in terms of number of landmark annotations. This diverse cephalometric landmarks dataset has significant value in medical imaging research. Existing studies predominantly focus on datasets obtained from a single medical center and provider, which offers a limited number of landmarks and a limited diversity of cephalograms, resulting in models that exhibit low robustness and generalization when applied to more diverse datasets. The clinical application of cephalometry is hampered by significant localization errors in landmark localization models, in addition to the inadequacy of existing datasets’ landmarks for clinical cephalometric diagnosis. The limited generalization ability and the occurrence of “overfitting” in deep learning models are mainly caused by the small size in the dataset. In the medical field, the inclusion of large and diverse datasets can greatly improve the generalization and performance of landmark localization models. This paper presents our WebCeph2k dataset from 9 medical centers, covering 9 different imaging devices, which surpasses the only publicly available ISBI2015 dataset in terms of sample size and number of landmarks. In addition, this study employs a low computational cost methodology to achieve optimal landmarks localization: 1) ROI regions of X-ray images are derived by exploiting the prior distribution of the data, 2) the model computational cost is reduced by adopting a spatial-depth transformation strategy, 3) the standard heatmap decoding method is optimized by integrating a compensation strategy. The results show that the proposed method not only achieves competitive localization results to other state-of-the-art approaches, but also offers a reduction of the model computational cost, resulting in faster inference. Consequently, this research offers valuable prospects in the field of general-purpose medical landmark localization methods. We also find that our proposed dataset is more complex and challenging than the ISBI dataset. The dataset and code are available at https://github.com/switch626/WebCeph2k.

Abstract:
Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements. In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity prior is defined to distinguish between strong and weak facial activity, obtained by statistically analyzing facial animations. Based on the facial activity intensity prior, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions. Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods. The source code and supplementary video are publicly available at: https://zjchu.github.io/projects/CorrTalk/.

Abstract:
Self-supervised learning (SSL) has demonstrated its power in generalized model acquisition by leveraging the discriminative semantic and explicit positional information of unlabeled datasets. Unfortunately, mainstream contrastive learning-based methods excessive focus on semantic information and ignore the position is also the carrier of image content, resulting in inadequate data utilization and extensive computational consumption. To address these issues, we present an efficient SSL framework, learning What and Where to learn ( \text W^2 \text SSL ), to aggregate semantic and position features. Concretely, we devise a spatially-coupled sampling manner to process images through pre-defined rules, which integrates the advantage of semantic (What) and positional (Where) features into framework to enrich the diversity of feature representation capabilities and improve data utilization. Besides, a spectrum of latent vectors is obtained by mapping the positional features, which implicitly explores the relationship between these vectors. Whereafter, the corresponding discriminative and contrastive optimization objectives are seamlessly embedded in the framework via a cascade paradigm to explore semantic and positional features. The proposed \text W^2 \text SSL is verified on different types of datasets, which demonstrates that it still outperforms state-of-the-art SSL methods even with half the computational consumption. Code will be available at https://github.com/WilyZhao8/W2SSL.

Abstract:
As a new emerging task, video corpus moment retrieval (VCMR) aims to find the video segments relevant to a given natural language query from a large number of untrimmed videos. It mainly includes two subtasks, finding the most relevant video based on the query text (video retrieval), and locating the segment most relevant to a given query in a video (moment localization). At the same time, since videos often contain rich multi-modal information such as audio, text, and images, how to align and interact with the multi-modal information of videos and the text information of natural language queries across modalities is the core issue of this task. This article proposes a Deformable Multigranularity Feature Fusion with Adversarial Training Network (DMFAT), first inputs the subtitle and frame multi-modal information of the video into our Multi-Scale Deformable Attention module and performs multi-granularity feature fusion through Deformable Attention respectively. Then, guided by the query, adaptive weights are generated to fuse the two multi-granularity modality features of the video. Finally, the cross-modal representation of the query and video features is obtained through a bidirectional attention module, and an adversarial contrastive learning objective is introduced to enhance more precise moment localization. Our model is evaluated on two representative video corpus moment retrieval benchmarks: TVR and DiDeMo. Extensive experiments have been conducted to demonstrate that our method outperforms existing work.

Abstract:
Spatial transcriptomics (ST) has become an important methodology in the analysis of the tumor microenvironment (TME) due to its ability to provide gene expression information with spatial resolution, enabling the identification and characterization of TME gene markers. Deep learning methods are proposed for analyzing spatial transcriptomic data for clustering the spatial regions of the TME based on gene expression. However, deep learning methods are often imposed by errors, which can impact the accuracy of gene expression quantification and TME gene identification. To address this issue, we propose a label-efficient method that utilizes curriculum learning and confidence learning to identify errors in graph deep learning when analyzing ST data. Our method explicitly incorporates the effect of noise in the learning process and employs probabilistic models or uncertainty estimates to represent the uncertainty in the data. Validated on human breast cancer ST data, we studied spatial gene expression in HER2-positive breast tumors using our method. The evaluation results suggest that the error quantification helps identify the noisy samples and subset the samples that results in more accurate gene expression quantification and TME gene identification. Additionally, there are biological insights obtained from the new subset formed by error samples. This error-robust deep learning method offers promising avenues for the analysis of spatial transcriptomic data, enabling accurate and label-efficient quantification of gene expression and identification of TME gene markers.

Abstract:
Weakly-supervised temporal action localization (WTAL) is a problem learning an action localization model with only video-level labels available. In recent years, many WTAL methods have developed. However, hard-to-predict snippets near action boundaries are often not considered in these existing approaches, causing action incompleteness and action over-complete issues. To solve these issues, in this work, an end-to-end snippets relation and hard-snippets mask network (SRHN) is proposed. Specifically, a hard-snippets mask module is applied to mask the hard-to-predict snippets adaptively, and in this way, the trained model focuses more on those snippets with low uncertainty. Then, a snippets relation module is designed to capture the relationship among snippets and can make hard-to-predict snippets easy to predict by aggregating the information of multiple temporal receptive fields. Finally, a snippet enhancement loss is further developed to reduce the action probabilities that are not present in videos for hard-to-predict snippets and other snippets, enlarging the action probabilities that exist in videos. Extensive experiments on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of the SRHN method.

Abstract:
Blind image quality assessment (BIQA) targets predict the perceptual quality of an image without any reference information. However, known methods have considerable room for performance improvement due to limited efforts in distortion knowledge usage. This paper proposes a novel multitask learning based BIQA method termed KGANet, which takes image distortion classification as an auxiliary task and uses the knowledge learned from the auxiliary task to assist accurate quality prediction. Different from existing CNN-based methods, KGANet adopts a transformer as the backbone for feature extraction, which can learn more powerful and robust representations. Specifically, it comprises two essential components: a cross-layer information fusion (CIF) module and a knowledge-guided attention (KGA) module. Considering that both global and local distortions appear in an image, CIF fuses the features of the adjacent layers extracted by the backbone to obtain a multiscale feature representation. KGA incorporates the distortion probability estimated by the auxiliary task with the distortion embeddings, which are selected from subword unit embeddings based on a textual template, to form distortion knowledge. This knowledge further serves as guidance to enhance the features of each layer and strengthen the connection between the main and auxiliary task. We demonstrate the effectiveness of the proposed KGANet through extensive experiments on benchmark databases. Experimental results show that KGANet correlates well with subjective perceptual judgments and achieves superior performance over 12 state-of-the-art BIQA methods.

Abstract:
In multimodal land cover classification (MLCC), a common challenge is the redundancy in data distribution, where task-irrelevant information from multiple modalities can hinder the effective integration of their unique features. To tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with an innovative information aggregate-distributing mechanism. This approach redefines redundancy levels and integrates performance-aware elements into the fused representation, facilitating the learning of semantics in both forward and backward directions. MIVit stands out by significantly reducing redundancy in the empirical distribution of each modality’s separate and fused features. It employs oriented attention fusion (OAF) for extracting shallow local shape features across modalities in horizontal and vertical dimensions, and a Transformer feature extractor for extracting deep global features through long-range attention. We also propose an information aggregation constraint (IAC) based on mutual information, designed to remove redundant information and preserve complementary information within embedded features. Additionally, the information distribution flow (IDF) in MIVit enhances performance-awareness by distributing global classification information across different modalities’ feature maps. This architecture also addresses missing modality challenges with lightweight independent modality classifiers, reducing the computational load typically associated with Transformers. Our results show that MIVit’s bidirectional aggregate-distributing mechanism between modalities is highly effective, achieving an average overall accuracy of 95.56% across three multimodal datasets. This performance surpasses current state-of-the-art methods in MLCC. The code for MIVit is accessible at https://github.com/icey-zhang/MIViT.

Abstract:
At present, deep learning has demonstrated outstanding performance in the area of underwater image enhancement. However, these approaches often demand substantial computational resources and extended training time. Knowledge distillation is a widely used technique for model compression, and nowadays it has delivered outstanding results across various fields. However, it has not been utilized in the field of underwater image enhancement. To tackle the aforementioned issues, this paper introduces a knowledge distillation technique for underwater image enhancement for the first time. It is a semi-supervised self-inter feature distillation and unsupervised self-domain adversarial distillation approach. It specifically includes adaptive local self-feature distillation technique, information lossless multi-scale inter-feature distillation technique, and self-domain adversarial distillation approach in LAB-RGB space. Self-feature distillation enhances the performance of the student network by correcting other lossy feature maps with the maximum effective feature map. Inter-feature distillation enables the student network to maximize the potential information learned from the teacher network. Furthermore, an information loss-free pooling approach is suggested to achieve multi-scale loss-free information extraction. Self-domain adversarial distillation boosts the performance of student networks through unsupervised adaptive enhancement in LAB space and unsupervised domain adversarial distillation in RGB space. Finally, a self-inter alternate knowledge distillation training measure is proposed, aiming to maximize the respective benefits of self-inter knowledge distillation. Through extensive comparative experiments, it can be found that student networks with dissimilar structures trained using the knowledge distillation technique designed in this paper achieve outstanding underwater image enhancement results.

Abstract:
Automatic tooth instance segmentation on 3D dental models is crucial for digitizing dental treatments and enabling computer-assisted treatment planning. However, It is challenging since the tight arrangement of dental structures and the consequential impact of dental ailments on their morphological characteristics. To address these challenges, we propose a novel method called THISNet. Unlike existing methods, THISNet focuses on highlighting tooth regions rather than relying on bounding box detection, leading to improved accuracy in tooth segmentation and labeling. By incorporating the highlighted tooth regions with a tooth object affinity module, our method effectively integrates global contextual information, considering the relationships between neighboring teeth and their surrounding structures. THISNet adopts an end-to-end learning approach, reducing complexity and enhancing segmentation efficiency compared to multi-stage training methods. Experimental results demonstrate the superiority of THISNet over existing approaches, highlighting its potential in various dental clinical applications.

Abstract:
Feature pyramid representations have been widely adopted in the object detection literature for better handling of variations in scale, which provide abundant information from various spatial levels for classification and localization sub-tasks. We find that inter sub-task feature disentanglement and intra sub-task feature re-fusion are crucial for final prediction performance, but are hard to be achieved simultaneously considering the computational efficiency. We find this issue can be addressed by delicate module design. In this paper, we propose an Efficient Task-specific Feature Re-fusion (ETFR) module to mitigate the dilemma. ETFR disentangles inter sub-task features, reduces the output channels of multi-scale features based on their importance and re-fuses intra sub-task features via concatenation operation. As a plug-and-play module, ETFR can remarkably and consistently improve the well-established and highly-optimized object detection and instance segmentation methods, such as RetinaNet, FCOS, BlendMask and CondInst, with neglectable extra computation cost. Extensive experiments demonstrate that ETFR has good generalization ability on various changeling datasets, including COCO, LVIS and Cityscapes.

Abstract:
Benefited from the high temporal resolution and high dynamic range, spike cameras have shown great potential in recognizing high-speed moving objects. However, the computer vision community has not explored this task due to the lack of spike data and annotations of high-speed moving objects. This paper contributes a novel dataset, named SpiReco (Spiking datasets for Recognition), by recording high-speed moving objects using a spike camera. To annotate the dataset, image labels from established datasets such as MNIST, CIFAR10, and CALTECH101 are utilized. Based on this new dataset, this paper proposes the first spike-based object recognition framework. The proposed framework includes a denoise module, which is designed to suppress spike noise by learning spatio-temporal correlation from neighbouring pixels. Additionally, a motion enhancement module is introduced to address high-speed and random motions. Afterwards, binarized neural networks are adopted to save computation costs. These efforts result in a fast and efficient processing framework for spiking data. Experimental results demonstrate the effectiveness of the proposed methods. For example, the proposed spike-based recognition framework achieves 80.2% accuracy in recognizing 101 classes of high-speed moving objects using only 2.2ms of spike streams. The SpiReco is available at https://github.com/Evin-X/SpiReco.

Abstract:
Multifocus image fusion is an effective method to overcome the limitations of optical lenses. The fused results can be obtained from some existing methods by generating decision maps. However, such methods assume that the focused areas of the two source images are complementary, making it impossible to achieve the simultaneous fusion of multiple images. Additionally, existing methods ignore the impact of hard pixels on the fusion performance, limiting the visual quality improvement of fusion images. To address these issues, a combined generation and recombination model called GRFusion is proposed. In GRFusion, the focus property detection of each source image can be independently implemented, enabling the simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. It renders the GRFusion free from the limitation of the number of input images. Furthermore, GRFusion investigates the detection of hard pixels with ambiguous focus properties by analyzing the inconsistencies among the detection results of the focus areas in the source images. This allows the hard pixels to be distinguished from the source images. Besides, a multidirectional gradient embedding method is proposed for generating full-focus images. Subsequently, a hard-pixel-guided recombination mechanism for constructing the fused result is devised to integrate the complementary advantages of feature reconstruction-based and focused pixel recombination-based methods. Extensive experimental results demonstrate the effectiveness and superiority of the proposed method. The source code of the proposed method is available at: https://github.com/lhf12278/GRFusion.

Abstract:
Recently, semantic segmentation has made promising progress, but the high cost of processing still limits its application. With focusing on removing the parameters of the networks, filter pruning using the importance criterion is a straightforward and effective technique to obtain the lightweight sub-network. However, we argue that the long-tail distribution in segmentation datasets poses two significant problems which are ignored in existing pruning algorithms: 1) The importance criterion is dominated by head classes which contain numerous positive samples, where the knowledge of tail classes is easily degenerated. 2) The degenerated knowledge of tail classes is hard to recover as their samples are also insufficient during fine-tuning. To address these issues, we propose a Distribution Calibrated Filter Pruning (DCFP) framework for segmentation. Firstly, a gradient-based Equalization Importance Criterion (EIC) is designed to generate a class-balanced pruning procedure. It avoids the bias on head classes by discarding the imbalanced positive gradients. Secondly, we introduce a Geometric-Semantic Re-balanced Loss (GSRL) to emphasize the learning on tail classes during fine-tuning. The GSRL consists of two cooperative components to calibrate the imbalanced optimization on geometric and semantic domains dynamically. Compared with previous methods, DCFP explores a novel distribution-aware pruning framework to obtain lightweight architectures with accurate results. Extensive experiments proved that DCFP achieves impressive performance on four popular segmentation benchmarks.

Abstract:
Deep image hiding is a challenging image processing task that aims to hide a secret image into a cover image of equal size perfectly. How to improve the imperceptibility of deep image hiding while ensuring high computational efficiency is a primary challenge. Where imperceptibility means not being visually perceived while not being perceived by the steganalysis model. In this paper, we propose a novel deep image hiding framework called DIH-OAIN (Deep Image Hiding based on One-way Adversarial Invertible Networks) to address it. Firstly, an image cascade framework is introduced to extract image semantics and details with dual-resolution branches, and reduces computation complexity by balancing image resolution and model complexity. Secondly, a hidden probability guided module is designed to constrain the secret image to be hidden in the texture region, utilizing the image texture complexity as prior knowledge. The above two points can effectively improve visual imperceptibility. Finally, a one-way adversarial training strategy is proposed to enhance the model imperceptibility. A series of experimental results show that the proposed method is significantly improved in imperceptibility comparing to state-of-the-art deep image hiding algorithms, while maintaining a low computation complexity.

Abstract:
Convolutional neural networks (CNN) have achieved remarkable performance in image denoising. However, most existing CNNs cannot accurately capture and remove tiny noises during the denoising process and lose edge detail information easily. In this paper, we propose a fine-grained residual network guided by wavelet and adaptive coordinate attention (WACAFRN) for image denoising. Firstly, we propose an adaptive coordinate attention mechanism and combine it with cascaded Res2Net residual blocks to form an encoder network for more accurate noise removal. Secondly, we propose a wavelet attention mechanism that combines global and local residual blocks to form a decoder network, aiming to address the problem of edge detail information loss. At last, we complement the noise information through a noise estimation block to further enhance the model’s ability to adapt to noise. Extensive experiment results demonstrate that our proposed method outperforms existing denoising methods in both qualitative and quantitative aspects. Notably, our method significantly improves real-world noise removal tasks on the CC dataset, with an average increase of 2.08 dB in PSNR and 0.0264 in SSIM over the state-of-the-art methods. Additionally, WACAFRN exhibits faster inference speeds, underscoring its efficiency in real-world applications.

Abstract:
Deep convolutional neural networks (CNNs) have increasingly become a prominent method for blind image quality assessment (BIQA). The process of quality assessment typically involves feature extraction, average-based pooling, and quality regression. Based on this process, as well as the consensus that the visual quality of an image mainly relies on its content and distortions, this work improves CNNs for BIQA in two ways. First, considering the content-awareness of visual quality perception, we incorporate content-awareness via a dynamic filtering module to extract content-adaptive features and a dynamic regression module to learn content-adaptive perception rules based on local content and global semantics. Second, considering distortion-sensitivity in visual quality perception, we introduce second-order global variance pooling and combine it with global average pooling (GAP). First-order pooling methods like GAP are limited in distinguishing complex distortions that cause local degradation while preserving global features. Thus, pooling with dual-order statistics enables a more distortion-sensitive and discriminative global representation. These two improvements result in a content-adaptive BIQA model with a dual-order global pooling mechanism, improving generalization on diverse images with varying contents and distortion types. Extensive experiments on synthetic and authentic distortion datasets demonstrate state-of-the-art performance of the proposed approach.

Abstract:
The rate-distortion optimized quantization (RDOQ) provides significant coding gain in the third generation of Audio Video coding Standard (AVS3). However, the high computational complexity and strong data dependency in RDOQ impede the hardware implementation. To address these issues, we propose a zig-zag scanline-level parallelized RDOQ algorithm and its fully pipelined hardware architecture for AVS3 video coding. For algorithm optimization, we update the run-level context for rate estimation in the inner zig-zag scanline and propose an efficient RD cost calculation form in the optimal coefficient level (OCL) decision step. In the last significant coefficient (LSC) position decision step, a greedy strategy based algorithm is proposed to optimize the determination process in parallel. Moreover, the proposed parallelized RDOQ algorithm is accelerated by single instruction multiple data (SIMD) on the Intel X86 platform. For hardware architecture design, a fully pipelined hardware architecture is proposed with nine pipeline stages. This design can process multiple transform units in parallel when the height is less than 32. Experimental results show that the proposed algorithm achieves 31.37%, 28.58%, and 28.53% time-saving by 0.25%, 0.26%, and 0.27% Bjøntegaard delta rate (BD-Rate) increase on average under all intra (AI), random access (RA), and low delay B (LDB) configurations, respectively. The hardware implementation achieves 32 coefficients per cycle, and the area consumption is 1223.2-K logic gates when working at 471.2-MHz. It is proven that the proposed algorithm and hardware architecture design achieve a good trade-off between coding efficiency and hardware throughput.

Abstract:
Video compression performance is closely related to the accuracy of inter prediction. It tends to be difficult to obtain accurate inter prediction for the local video regions with inconsistent motion and occlusion. Traditional video coding standards propose various technologies to handle motion inconsistency and occlusion, such as recursive partitions, geometric partitions, and long-term references. However, existing learned video compression schemes focus on obtaining an overall minimized prediction error averaged over all regions while ignoring the motion inconsistency and occlusion in local regions. In this paper, we propose a spatial decomposition and temporal fusion based inter prediction for learned video compression. To handle motion inconsistency, we propose to decompose the video into structure and detail (SDD) components first. Then we perform SDD-based motion estimation and SDD-based temporal context mining for the structure and detail components to generate short-term temporal contexts. To handle occlusion, we propose to propagate long-term temporal contexts by recurrently accumulating the temporal information of each historical reference feature and fuse them with short-term temporal contexts. With the SDD-based motion model and long short-term temporal contexts fusion, our proposed learned video codec can obtain more accurate inter prediction. Comprehensive experimental results demonstrate that our codec outperforms the reference software of H.266/VVC on all common test datasets for both PSNR and MS-SSIM.

Abstract:
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition and have achieved promising results. However, current GCN-based methods are limited by their inability to refine semantic-guided joint relations and perform adaptive multi-scale analysis. These limitations impair their performance, particularly for analogical actions involving the interaction of the same body parts (e.g., drinking water and eating) as well as deficient actions with limited spatial-temporal information (e.g., subtle action writing and transient action sneezing). To solve these problems, we propose Part-level Refined Spatial Graph Convolution (PR-SGC) and Scale-aware Temporal Graph Convolution (Sa-TGC) for optimal action representation. The PR-SGC divides the skeleton into body parts and embeds this high-level semantics to refine the physical adjacency matrix. The Sa-TGC leverages the dynamic scale-aware mechanism to extract context-dependent multi-scale features. On this basis, we develop a novel Scale-aware Graph Convolutional Network with Part-level Refinement (SaPR-GCN), which is on par with state-of-the-art benchmarks on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets.

Abstract:
Existing monocular depth estimation methods have achieved satisfactory performance on wild datasets. However, these methods are usually trained and tested on a single dataset, which makes them difficult to generalize to other scenarios. To learn diverse scene priors from multiple datasets, we propose a hierarchical framework with adaptive bins for robust monocular depth estimation, which consists of two critical components: a group-wise query generator to assign hierarchical bins and a correlation-aware transformer decoder to generate adaptive bin features. The proposed HA-Bins enjoys several merits. First, the group-wise query generator progressively increases the number of bin queries for multi-scale image features, resulting in a hierarchical bin distribution robust to diverse scenarios. Second, the correlation-aware transformer decoder refines the correlation of bin queries and image features, effectively improving adaptive image feature aggregation. We visualize the query activation maps on NYUDepthv2 dataset, showing that the proposed network effectively suppresses the depth-irrelevant regions. Experiments on KITTI, Sintel, and RabbitAI benchmarks show that without any fine-tuning, our model jointly trained on multiple datasets achieves competitive performance with the state-of-the-art and solid robustness toward diverse scenarios. In addition, our method wins second place in Robust Vision Challenge 2022 towards challenging scenarios with different characteristics.

Abstract:
Semantic segmentation on 3D point clouds is an important task for 3D scene understanding. While dense labeling on 3D data is expensive and time-consuming, only a few works address weakly supervised semantic point cloud segmentation methods to relieve the labeling cost by learning from simpler and cheaper labels. Meanwhile, there are still huge performance gaps between existing weakly supervised methods and state-of-the-art fully supervised methods. In this paper, we propose Dense Supervision Propagation (DSP) to train a semantic point cloud segmentation network with only a small portion of points being labeled. We argue that we can better utilize the limited supervision information as we densely propagate the supervision signal from the labeled points to other points within and across the input samples. Specifically, we propose a cross-sample feature reallocating module to transfer similar features and therefore re-route the gradients across two samples with common classes and an intra-sample feature redistribution module to propagate supervision signals on unlabeled points across and within point cloud samples. We conduct extensive experiments on public datasets S3DIS and ScanNet. Our weakly supervised method with only 10% and 1% of labels can produce competitive results with the fully supervised counterpart.

Abstract:
Skeleton-based methods have recently achieved good performance in deep learning-based gait emotion recognition (DL-GER). However, the current methods have two drawbacks that limit the ability to learn discriminative emotional features from gait. First, these methods do not exclude the effect of the subject’s walking orientation on emotion classification. Second, they do not sufficiently learn the implicit connections between the joints during human walking. In this paper, an augmented spatial-temporal graph convolutional neural network (AST-GCN) is introduced to solve these two problems. The interframe shift encoding (ISE) module acquires interframe shifts of joints to make the network sensitive to changes in emotion-related joint movements regardless of the subject’s walking orientation. A multichannel implicit connection inference method learns more implicit connection relations related to emotions. Notably, we unify current skeleton-based methods into a common framework that validates the most powerful feature representation capability of our AST-GCN from a theoretical perspective. In addition, we extend the skeleton-based gait dataset using posture estimation software. Experiments demonstrate that our AST-GCN outperforms state-of-the-art methods on three datasets on two tasks.

Abstract:
Person re-identification (Re-ID) has played an extremely crucial role in ensuring social safety and has attracted considerable research attention. 3D shape information is an important clue to understand the posture and shape of pedestrians. However, most existing person Re-ID methods learn pedestrian feature representations from images, ignoring the real 3D human body structure and the spatial relationship between the pedestrians and interferents. To address this problem, our devise a new point cloud Re-ID network (PointReIDNet), designed to obtain 3D shape representations of pedestrians from point clouds of 3D scenes. The model consists of modules, namely global semantic guidance module and local feature extraction module. The global semantic guidance module is designed by enhancing the point cloud feature representation in similar feature neighborhoods and to reduce the interference caused by 3D shape reconstruction or noise. Further, to provide an efficient representation of point clouds, we propose space cover convolution (SC-Conv), which efficiently encodes information on human shapes in local point clouds by constructing anisotropic geometries in the coordinate neighborhoods. Extensive experiments are conducted on four holistic person Re-ID datasets, one occlusion person Re-ID dataset and one point cloud classification dataset. The results exhibit significant improvements over point-cloud-based person Re-ID methods. In particular, the proposed efficient PointReIDNet decreases the number of parameters from 2.30M to 0.35M with an insignificant drop in performance. The source code is available at: https://github.com/changshuowang/PointReIDNet.

Affiliations: Department of Automation, Nanjing University of Science and Technology, Nanjing, China; Department of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, Jiangsu, China; State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, Henan, China; Shandong Provincial Key Laboratory of Computer Networks, Jinan, Shangdong, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen University, Guangzhou, China

Abstract:
Detection of aligned double Joint Photographic Experts Group (JPEG) compressed images is a crucial area of research within the field of digital image forensics. The detection tasks for aligned double JPEG compression can be categorized into two sub-tasks, namely detecting double JPEG images with the same quantization matrix (DJSQM) or double JPEG images with different quantization matrices (DJDQM). Existing methods for one of these sub-tasks may not be effective for the other. To address this issue, a novel approach is proposed by recompressing both DJDQM and DJSQM using modified quantization coefficients. The perturbation in the recompression process results in a perturbed error image, which is valid for both DJDQM and DJSQM. Subsequently, the relative change rate is used to combine the perturbed error image, the original error image, and the quantization error to derive the interference error and the interference quantization error. The interference error and interference quantization error further expand the difference between single and double compressed images by preserving the general validity of the original image information. Furthermore, the recompression process of DJDQM and DJSQM results in the conversion of truncation and rounding errors at the pixel level, which can be represented by the pixel state map. The pixel state map characterizes the differing transformation relationships between single and double compressed images and provides additional valid features, thereby enhancing the performance of the proposed method. The empirical results demonstrate that the proposed method outperforms existing methods on detecting aligned double JPEG compressed images.

Abstract:
Textured meshes are widely used in computer graphics to represent 3D scenes, with UV mapping playing a crucial role in establishing a bijective mapping between the 3D mesh surface and a 2D texture. This mapping not only allows for the enhancement of rendering quality but also enables the compression of mesh textures using standard 2D image or video codecs. However, when reconstructing meshes from real-world multiview images, the resulting UV texture maps often suffer from fragmentation due to geometric inaccuracies and excessive tessellation of the reconstructed surfaces, leading to decreased compression performance. In this paper, we propose a novel and effective preprocessing approach for UV texture map compression based on rate-rendering distortion (R-RD) optimization. Unlike existing methods that rely on padding or smoothing, our method iteratively updates the texture map using the gradient of a joint cost of bitrate and rendering distortion. This cost is estimated through a differentiable image encoder and a differentiable texture sampling. Experimental results with lossless compressed mesh geometry demonstrate that our preprocessing method outperforms existing texture padding methods, achieving BD-rate reductions of at least 10.23%, 15.24%, and 12.10% when combined with JPEG, HEVC, and VVC, respectively. We also validate the effectiveness of our method with lossy compressed meshes using Google Draco, showing improved compression efficiency compared to the lossless geometry scenario. Subjective evaluations further confirm that our method enhances both color and structural continuities in the texture map by automatically eliminating high-frequency components unfavorable to compression. The paper provides comprehensive experiments and analyses, including rate estimation with different choices of differentiable image encoders, texture map distortion vs. rendering distortion, and complexity comparison with existing methods.

Abstract:
Lane detection, one of the crucial foundations of the autonomous driving of Rubber-Tired Gantries (RTGs), plays a vital role in automating manual container terminals. Deep-learning-based lane detection methods have robust and generalized global feature extraction capabilities to deal with complex scenarios well. However, the high preparation cost of large-scale labeled data has limited their application in RTG lane detection. Therefore, this paper presents a cost-effective, scalable incremental learning-based detection method. Specifically, some lane images are collected online, with reliable segmentation labels generated by an image-processing-based lane detection method. Next, a semi-supervised clustering approach is employed to construct a dynamically expanding sample pool, ensuring that samples are representative and diverse. Finally, a lane detection network model is self-trained by using all labeled and unlabeled samples. Extensive experimental results show that our proposed method outperforms existing methods and achieves a lane detection accuracy of 94.87% and a detection success rate of 99.06%, with the potential for further performance improvement as data size increases.

Abstract:
Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.

Abstract:
Collocated clothing synthesis using generative networks has become an emerging topic in the field of fashion intelligence, as it has significant potential economic value to increase revenue in the fashion industry. In previous studies, several works have attempted to synthesize visually-collocated clothing based on a given clothing item using generative adversarial networks (GANs) with promising results. These works, however, can only accomplish the synthesis of one collocated clothing item each time. Nevertheless, users may require different clothing items to meet their multiple choices due to their personal tastes and different dressing scenarios. To address this limitation, we introduce a novel batch clothing generation framework, named BC-GAN, which is able to synthesize multiple visually-collocated clothing images simultaneously. In particular, to further improve the fashion compatibility of synthetic results, BC-GAN proposes a new fashion compatibility discriminator in a contrastive learning perspective by fully exploiting the collocation relationship among all clothing items. Our model was examined in a large-scale dataset with compatible outfits constructed by ourselves. Extensive experiment results confirmed the effectiveness of our proposed BC-GAN in comparison to state-of-the-art methods in terms of diversity, visual authenticity, and fashion compatibility.

Affiliations: School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, China; Department of Internet of Things Engineering, Hohai University, Changzhou, China; School of Artificial Intelligence, China University of Mining and Technology, Beijing, China; School of Computer Science and Engineering, Northeastern University, Shenyang, China; School of Computer Science, Shenyang Aerospace University, Shenyang, China

Abstract:
Although semantic segmentation methods have made remarkable progress so far, their long inference process limits their use in practical applications. Recently, some two-branch and three-branch real-time segmentation networks have been proposed to improve segmentation accuracy by adding branches to extract spatial or border information. For the design of extracting spatial information branches, preserving high-resolution features or adding segmentation loss to guide spatial branches are commonly used methods to extract spatial information. However, these approaches are not the most efficient. To solve the problem, we design the spatial information extraction branch as an AutoEncoder structure, which allows us to extract the spatial structure and features of the image during the encoding and decoding process of the AutoEncoder. Border, semantic and spatial information are all helpful for segmentation tasks, and efficiently fusing these three kinds of information can obtain better feature representation compared to the fusion of two types of information in the dual-branch network. However, existing three-branch networks have yet to explore this aspect deeply. Therefore, this paper designs a new three-branch network based on this starting point. In addition, we also propose a feature fusion module called the Unified Multi-Feature Fusion module (UMF), which can fuse multiple features efficiently. Our method achieves a state-of-the-art trade-off between inference speed and accuracy on the Cityscapes, CamVid, and NightCity datasets. Specifically, BSSNet-T achieves 78.8% mIoU at 115.8 FPS on the Cityscapes dataset, 79.5% mIoU at 170.8 FPS on the CamVid dataset, and 52.6% mIoU at 172.3 FPS on the NightCity dataset. Code is available at https://github.com/SXQ-STUDY/BSSNet.

Abstract:
Face enhancement aims to improve low-quality face images to a higher-quality level. However, in real-world nighttime scenes, complex degradation factors often affect these images, making it challenging to preserve important facial details. Existing image enhancement algorithms typically focus on independently conducting image super-resolution and brightness enhancement, assuming a fixed degradation level based on simulated training datasets. Nonetheless, real nighttime scenes involve complex degradation processes, where degradation factors dynamically and variably manifest. Therefore, achieving effective face enhancement in such scenarios is particularly daunting. This work analyzes and unveils the multiple factors of low resolution and low illumination during degradation. Based on this analysis, we propose a Bi-factor Degradation Decoupling network. Our method leverages a decoupling network to generate qualitative and quantitative features corresponding to each factor’s degradation degree in the low-quality environment. These features are then combined with robust facial feature constraints to recover the details of low-quality faces. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches in both enhancement and face super-resolution.

Abstract:
In recent years, many incomplete multi-view clustering methods have been proposed to address the challenging and new clustering task on incomplete multi-view data whose part of view representations are not fully collected for some samples. Although extensive experiments have validated the effectiveness of these methods for handling the incomplete learning issue, a common issue exists, i.e., these methods all ignore the discriminative/important difference of discriminative features and noisy features. In this paper, to address the above issue, a new incomplete multi-view clustering model, called Graph Regularized and fEature Aware maTrix Factorization (GreatF), is proposed. Different from the existing methods, we introduce an adaptive feature weighting constraint to the matrix factorization-based multi-view representation learning model. With this weighting constraint, the effect of the discriminative features can be enhanced while the negative effect caused by the redundant and noisy features can be eliminated for the model optimization; thus, the robustness of the model can be enhanced. In addition, in this work, we designed a new graph-embedded consensus representation learning term in which consensus representation learning and structure information preservation are integrated into a joint model with one term. In particular, this term provides a more concise approach to obtain the structured consensus representation from incomplete multi-view data. Experimental results on four well-known datasets demonstrate that GreatF performs better than the state-of-the-art incomplete multi-view clustering methods.

Abstract:
Deep learning techniques have largely solved the problem of rail surface defect detection (SDD), however, two aspects have yet to be addressed. In most existing approaches, two red–green–blue and depth (RGB-D) streams are indiscriminately fused across modalities, ignoring the fact that RGB and depth images produce different feature qualities in different scenes. Additionally, in their focus on performance, previous studies have overlooked the fact that models produce several parameters, resulting in unrealistic practical applications. To address these challenges, we designed a modal evaluation network (MENet) via knowledge distillation (KD) (MENet-S) for a no-service rail SDD to adaptively manage information in each scenario and achieve model compression. First, to dynamically adjust the feature distribution and quality, dynamic and static feature coding ideas are introduced. Second, modal evaluation distillation is introduced, which allows a compact model (MENet-S) to learn the feature evaluation process of a complex model (MENet-T). Third, to enable MENet-S to learn the dynamic encoding process of MENet-T and to improve the feature representation of MENet-S, we propose accessible knowledge distillation. Furthermore, multitiered KD is introduced to facilitate the learning of MENet-S. Based on extensive experiments using the industrial RGB-D dataset NEU RSDDS-AUG, we observed that MENet-S (MENet-S with KD) outperformed 16 state-of-the-art methods. In addition, to demonstrate the generalization capability of MENet-S, we evaluated the proposed network on three additional public datasets, and MENet-S achieved competitive results. The source codes and results are available at https://github.com/hjklearn/MENet-KD.

Abstract:
Implicit scene representations have recently shown promising results in photo-realistic 3D reconstruction and view synthesis based on calibrated views. However, their applications face several challenges, including unknown camera pose, boundary ambiguity, and observation noise. This paper proposes a novel online scene representation method that simultaneously learns to represent the target scene and estimates the camera poses from an RGB-D stream. An implicit scene representation function built with scale-encoded cascaded grids is proposed to represent scenes online from incremental observations. This implicit function is optimized in a reparameterized domain that provides defined boundaries. In this reparameterized domain, the cascaded grids are progressively distilled under geometric and photometric supervision to improve their model capacity and geometry accuracy. A radiance field deblurring module based on the physical imaging process is further proposed to restore a photo-realistic reconstruction against camera motion blur, which is the main component of the observation noise. The proposed method can produce sharp and photo-realistic representations of scenes under various shooting conditions without known camera poses. Experiments on multiple datasets have demonstrated the effectiveness of the proposed method in improving view synthesis and camera tracking results for online scene representation tasks.

Abstract:
Depth estimation from light field (LF) images is a fundamental step for numerous applications. Recently, learning-based methods have achieved higher accuracy and efficiency than the traditional methods. However, it is costly to obtain sufficient depth labels for supervised training. In this paper, we propose an unsupervised framework to estimate depth from LF images. First, we design a disparity estimation network (DispNet) with a coarse-to-fine structure to predict disparity maps from different view combinations. It explicitly performs multi-view feature matching to learn the correspondences effectively. As occlusions may cause the violation of photo-consistency, we introduce an occlusion prediction network (OccNet) to predict the occlusion maps, which are used as the element-wise weights of photometric loss to solve the occlusion issue and assist the disparity learning. With the disparity maps estimated by multiple input combinations, we then propose a disparity fusion strategy based on the estimated errors with effective occlusion handling to obtain the final disparity map with higher accuracy. Experimental results demonstrate that our method achieves superior performance on both the dense and sparse LF images, and also shows better robustness and generalization on the real-world LF images compared to the other methods.

Abstract:
Imbalanced label distribution is usually the case for real-world data, which poses a challenge for training unbiased recognition model. In this paper, we study two underlying mismatches, i.e., distribution mismatch and probability space mismatch, present in class-imbalanced learning. Firstly, we analyze the label distribution mismatch between imbalanced training data and balanced test data, and introduce a distribution unified framework to unify the two distributions through probability conversion. Secondly, we analyze that the utilization of cross-entropy loss under the proposed framework may lead to probability space mismatch, where the conversion of the predictive probability is implemented in softmax probability space while the comparison with one-hot label is implemented in true probability space. To alleviate this dilemma, we involve a teacher model and formulate a teacher-student learning strategy, which contains two novel techniques. The Teacher Guided Label Smoothing (TGLS) is first proposed to relax the one-hot label to smoother pseudo softmax probability, which is more aligned with the softmax probability space. Additionally, we propose Distribution Unified Knowledge Distillation (DU-KD) under the proposed framework to further reduce both the mismatches. Experiments on several benchmarks confirm the top-level performance of the proposed method.

Abstract:
We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.

Abstract:
HEVC (High Efficiency Video Coding) provides abundant embedding carriers for video steganography, leading to rapid development in the field of video steganography while increasing the urgent demand for video steganalysis. However, existing steganalysis methods against PU (prediction unit) based steganography primarily use the extraction of video statistical features, which ignore the potential information of each frame and fail to effectively detect different PU-based steganography methods. In this paper, a video steganalysis method based on PU maps and multi-scale convolutional residual network is proposed. Firstly, the effects of PU-based steganography on the spatial domain and the compressed domain are analyzed. It is observed that steganography has less impact on the spatial domain, whereas it significantly disrupts the connection between PU blocks in the compressed domain, leaving distinct steganographic traces. Consequently, the PU partition modes containing local connections are introduced to generate PU maps for steganalysis. Secondly, a video steganalysis network called PUSN (Prediction Unit Steganalysis Network) is constructed. The network takes PU maps as input and consists of three parts: feature extraction, feature representation, and binary classification. Additionally, a multi-scale module is proposed to enhance the detection performance. Finally, the detection result of the steganographic video is obtained by the voting mechanism. The experimental results show that compared with the existing steganalysis methods, the proposed method could effectively detect multiple PU-based steganography methods and achieve higher detection accuracy across various embedding rates.

Abstract:
Domain generalization in person re-identification (ReID) aims to design a generalizable model, which is trained under the supervision of a set of labeled source domains and can be directly deployed on unknown domains. Existing approaches simply treat each identity as a distinct class and ignore the differences among cameras. We argue that the camera information is crucial for learning discriminative representations, as people’s behavior usually varies between cameras. In this paper, we present Multi-Centroid Memory (MCM) to capture different camera information for each identity and Soft Triple Hard (ST-Hard) loss to align the information of the same identity across cameras. Furthermore, in contrast to the traditional approaches of training a single model using a parallel training mechanism, we propose the Recurrent Implicit Lifelong Learning (RILL) that feeds the source domains into the model in a continuous loop to train an expert for each domain. To make each expert further generalized to other source domains, during the training on the current domain, RILL adopts a style replay-based method to simulate the training of the previous domain, encouraging each domain’s expert to extract generalizable features. We also present Earth Mover’s Test-time Adaption (EMTA) to be used in conjunction with RILL, which enables source domains that are more similar to the test domain to play a more significant role in the test. This is achieved by our proposed Earth Mover’s Similarity (EMS), which helps model the similarities between the source and test domains. Extensive experiments on two evaluation protocols fully demonstrate our framework’s generalization and competitiveness.

Abstract:
Document key information extraction (DKIE) is a crucial topic that aims at automatically comprehending documents with complex formats and layouts (invoices, business insurance, etc.). While pre-trained approaches have shown high performance on many DKIE tasks, they suffer from three major challenges. First of all, these approaches ignore the ambiguity resulting from similar text representations before cross-modal interaction. Secondly, they do not consider cross-modal representation alignment before cross-modal interaction. Finally, self-attention layers in cross-modal interaction incur significant computing costs, making it hard to perform joint representation learning from all negative samples. To address these issues, we propose a Dynamical Cross-Modal Alignment Interaction framework (DCMAI). To be more specific, 1) A prior knowledge-guided module is designed to adaptively mine fine-grained visual information to disambiguate similar text representations. 2) A crossover alignment loss is formulated to align cross-modal representations before cross-modal interaction. 3) A hierarchical interaction sampling scheme is introduced to obtain a small but efficient subset of cross-modal negative samples, and a contrastive loss is employed to improve joint representation learning. Comprehensive experiments show that the proposed DCMAI achieves state-of-the-art performance than competitive baselines on several public downstream benchmarks. Code will be open to the public.

Abstract:
In natural language processing, relation extraction (RE) is to detect and classify the semantic relationship of two given entities within a sentence. Previous RE methods consider only the textual contents and suffer performance decline in social media when texts lack contexts. Incorporating text-related visual information can supplement the missing semantics for relation extraction in social media posts. However, textual relations are usually abstract and of high-level semantics, which causes the semantic gap between visual contents and textual expressions. In this paper, we propose RECK - a neural network for relation extraction with cross-modal knowledge representations. Different from previous multimodal methods training a common subspace for all modalities, we bridge the semantic gaps by explicitly selecting knowledge paths from external knowledge through the cross-modal object-entity pairs. We further extend the paths into a knowledge graph, and adopt a graph attention network to capture the multi-grained relevant concepts which can provide higher level and key semantics information from external knowledge. Besides, we employ a cross-modal attention mechanism to align and fuse the multimodal information. Experimental results on a multimodal RE dataset show that our model achieves new state-of-the-art performance with knowledge evidence.

Abstract:
Pixel-value-ordering (PVO) is one of the most popular methods in reversible data hiding (RDH). In PVO based methods, pixels are processed in a block-wise way, so that the local similarities of the images are considered but the global statistical characteristics are ignored. To better utilize the correlations of pixels, this paper proposes a global sorting strategy to combine utilizations of local and global characteristics of the images. For each pixel, its prediction value and local complexity are first calculated based on its local characteristics. Then the image pixels are sorted globally according to their prediction values to generate a single-sorted pixel sequence, in which the pixels with the equal prediction values are sorted again by referring to their local complexities. In such a way, the spatial distances of image pixels are broken so that the global statistical characteristics can be well exploited. With the proposed sorting strategy, we can obtain a more regular 2D histogram by segmenting the sorted sequence for the location-based PVO (LPVO) predictor. Owe to the regular 2D histogram, we have designed an efficient 2D mapping to achieve perfect performance for all the tested images. With the proposed RDH scheme, the PSNR of the image Lena is as high as 61.86 dB and the average PSNR of the Kodak dataset reaches 63.55 dB after embedding 10,000 bits. The superiority of the proposed method has been verified by comparing with recent state-of-the-art RDH methods.

Abstract:
We observe that a natural image tends to exhibit similar histograms for color channels in the RGB color space and consistent statistical estimates for color channels in the Lab color space. We refer to these observations as natural color consistencies. In contrast, we discover that an underwater image does not always follow the natural color consistencies. Different color channels in an underwater image tend to give rise to very different distributions, regardless of whether the channels are in the RGB color space or Lab color space. We refer to these observations as underwater color disparities. To enhance an underwater image to make it appear more natural, it is necessary to correct its underwater color disparities to align with the natural color consistencies. To this end, we develop an adaptive attenuated channel compensation method based on optimal channel precorrection and a salient absorption map-guided fusion method for eliminating the color deviation in the RGB color space. We then develop a method to enhance the contrast of channel L and an adaptive color distribution specification method for improving the contrast and matching the color distribution in the Lab color space. Additionally, we develop an edge-enhanced mask fusion method for correcting blurry details. Our method is not a deep learning method but can effectively be applied to a single underwater image. The qualitative and quantitative empirical results validate that our method outperforms state-of-the-art underwater image enhancement methods. We release the reproducible code at https://gitee.com/wanghaoupc/Underwater_Color_Disparities for public evaluation.

Abstract:
Facial expression recognition (FER) in the wild is challenging due to various unconstrained conditions, i.e., occlusions and head pose variations. Previous methods tend to improve the performance of facial expression recognition through resorting to holistic methods or coarse local-based methods, while ignoring the local fine-grained feature structure knowledge and the correlation between features. In this paper, we propose a Fine-Grained Association Graph Representation (FG-AGR) framework which can capture the local fine-grained facial expression representation. Firstly, an Adaptive Salient Region Induction (ASRI) is designed for adaptively highlighting the local saliency regions of facial expressions combined with spatial location information. Based on this, a Local Fine-grained Feature Extraction (LFFE) based on Visual Transformers is introduced to further extract fine but discriminative fine-grained features of saliency regions. Thirdly, an Adaptive Graph Association Reasoning (AGAR) based on Graph Convolutional Network is constructed to learn associated fine-grained feature combinations. Extensive experiments demonstrate that our FG-AGR achieves superior performance compared to the state-of-the-art methods with 90.81% on RAF-DB, 64.91% on AffectNet-7, 60.69% on AffectNet-8 and 91.09% on FERPlus.

Abstract:
Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for 3D human pose estimation (3D-HPE) in the video. Recent studies have shown the promising potential of modeling the pose relation with feature-mixing operations on the temporal domain. However, they seldom consider the interaction across poses in the frequency domain. This paper studies a Frequency-Temporal Collaborative Module (FTCM) to explore the feasibility of encoding the cross-pose correlations in both frequency and temporal domains. FTCM aims to jointly capture the global and local cross-pose correlations with a more lightweight network model. Specifically, FTCM splits the pose features into two groups along the channel dimension and separately models the frequency and temporal interactions across poses with different feature-mixing operations in parallel. To achieve this goal, we purposely design two pose-mixing units, i.e., the frequency pose-mixing (FPM) and the temporal pose-mixing (TPM). Particularly, FPM is designed to reap the global correlations among different pose frequencies with the representation obtained by converting the original pose signals with Fast Fourier transform (FFT). Unlike the pose-mixing used by previous methods like Transformers that influences an individual pose with all other poses, TPM locally calibrates the pose with dynamics aggregated within several adjacent poses in the temporal domain, explicitly weighting neighboring poses more with respect to the far-away ones so as to enforce a strict locality constraint. Besides, the group strategy significantly reduces the model complexity. To verify the effectiveness of FTCM, we conduct extensive experiments on two benchmarks (i.e., Human3.6M and MPI-INF-3DHP). Experimental results not only exhibit favorable accuracy/complexity trade-offs of our FTCM but also show superior or comparable performance to state-of-the-art methods on both datasets. The code and model are publicly available at: https://github.com/zhenhuat/FTCM.

Abstract:
Rich and complex events in sports have led to the development of a wide-variety of techniques for interpreting content of sports videos in terms of players’ actions, poses, gait, performance, etc. This is due to the requirements from coaches, trainers and players who expect to analyze actions in top sports events, as well as sports fans who practice to imitate professional playing skills, e.g., dribbling, shooting, etc. However, this poses two key challenges for automated sports analysis community. Firstly, there are extremely limited public sports datasets. Secondly, recent advances in interpretations of sports activities, e.g., soccer, are predominantly made through analyzing coarse-grained contents. Players’ fine-grained skills analysis still remains under-explored. To alleviate these problems, this paper (a) collects the dataset of highlight videos of soccer players, including two coarse-grained action types of soccer players and six fine-grained actions of players. Detailed annotations are provided for the collected dataset, in terms of action classes, bounding boxes, segmentation maps, and body keypoints of soccer players, and positions of a soccer ball in a game. (b) leverages the understanding of complex highlight videos by proposing an energy-motion features aggregation network-EMANet to fully exploit energy-based representation of soccer players movements in video sequences and explicit motion dynamics of soccer players in videos for soccer players’ fine-grained action analysis. Experimental results and ablation studies validate the proposed approach in recognizing soccer players actions using the collected soccer highlight video datasets.

Abstract:
Existing cross-domain keypoint detection methods always require accessing the source data during adaptation, which may violate the data privacy law and pose serious security concerns. Instead, this paper considers a realistic problem setting called source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain. For the challenging problem, we first construct a teacher-student learning baseline by stabilizing the predictions under data augmentation and network ensembles. Built on this, we further propose a unified approach, Mixup Augmentation and Progressive Selection (MAPS), to fully exploit the noisy pseudo labels of unlabeled target data during training. On the one hand, MAPS regularizes the model to favor simple linear behavior in-between the target samples via self-mixup augmentation, preventing the model from over-fitting to noisy predictions. On the other hand, MAPS employs the self-paced learning paradigm and progressively selects pseudo-labeled samples from ‘easy’ to ‘hard’ into the training process to reduce noise accumulation. Results on four keypoint detection datasets show that MAPS outperforms the baseline and achieves comparable or even better results in comparison to previous non-source-free counterparts. The code is available at https://github.com/YuheD/MAPS.

Abstract:
With the increasing popularity of intelligent surveillance systems, abnormal behavior detection of human beings based on computer vision is attracting more attention. It aims to classify and locate the abnormal behaviors and coordinates of human beings, respectively, and is a fundamental technology for intelligent security. Existing approaches mainly focus on exploring abnormal behavior features through object detectors. However, in office scenarios, almost all abnormal behaviors are closely associated with the fine-grained feature around the nose, wrist, elbow, and other human joint points regions. Detectors for generic objects cannot adequately capture such differences between abnormal behaviors, resulting in sub-optimal performance. In this paper, we focus on human joints and take one step further to enable effective behavior characteristics learning in office scenarios. In particular, we propose a novel Adaptive Joints Enhancement Network (AJENet), which includes two closely-related components, Joints Predict block (JP) and Adaptive Key Joints Enhancement block (AKJE). JP block is used to predict the human joints and facilitates the feature learning around them implicitly. By inputting the features around joints, the AKJE block enhances the feature representations of key joints according to the abnormal behavior characteristics adaptively. Experimental results demonstrate that our method outperforms other state-of-the-art methods on the collected real office scenario Office Behavior Dataset. Besides, to verify the generalization capabilities and potential of AJENet, we construct comparisons on another generic dataset PASCAL VOC 2012 Action.

Abstract:
Partial domain adaptation (PDA) assumes that target domain class label set is a subset of that of source domain, while this problem setting is close to the actual scenario. At present, there are mainly two methods to solve the overfitting of source domain in PDA, namely the entropy minimization and the weighted self-training. However, the entropy minimization method may make the distribution prediction sharp but inaccurate for samples with relatively average prediction distribution, and cause the model to learn more error information. While the weighted self-training method will introduce erroneous noise information in the self-training process due to the existence of noise weights. Therefore, we address these issues in our work and propose self-training contrastive partial domain adaptation method (STCPDA). We present two modules to mine domain information in STCPDA. We first design self-training module based on simple samples in target domain to address the overfitting to source domain. We divide the target domain samples into simple samples with high reliability and difficult samples with low reliability, and the pseudo-labels of simple samples are selected for self-training learning. Then we construct the contrastive learning module for source and target domains. We embed contrastive learning into feature space of the two domains. By this contrastive learning module, we can fully explore the hidden information in all domain samples and make the class boundary more salient. Many experimental results on five datasets show the effectiveness and excellent classification performance of our method.

Abstract:
In recent years, stereo image super-resolution based on convolutional neural network has been extensively researched and achieved impressive performance by introducing complementary information from another view. However, most existing methods still cannot fully capture both intra- and cross-view information due to the neglect of multi-scale information perception, multi-scale binocular alignment and the excitation of large scale to small scale in human vision system. And they generated blurry results due to the consideration of irrelevant information in search for cross-view information. To address these issues, we propose a multi-scale visual perception based progressive feature interaction network (MS-PFINet) for stereo image super-resolution. Specifically, to exploit comprehensive intra- and cross-view information for image reconstruction, we design a two-stream network with multi-branch structure to extract multi-scale features and progressively use cross-view interaction at larger scales to guide that at smaller scales. Moreover, to explore more proper and accurate cross-view information, we propose a feature transformer module (FTM) to search and transfer the most relevant features from another view by hard attention maps and soft attention maps, which are calculated by patch-wise similarity rather than pixel-wise. In addition, in order to encourage a more effective way to transfer texture features for the target view, we propose a perceptual texture matching loss to supervise the accuracy of feature transformer modules. Experimental results show that our proposed method is superior to the state-of-the-art methods in most cases.

Abstract:
Light field image (LFI) now is becoming increasingly popular in immersive media applications. Unlike traditional 2D and 3D images, images taken by light field cameras can capture both angular and spatial information. However, the spatial and angular information of LFI is highly inter-twined with varying disparities, which poses a higher challenge to the quality assessment of LFI. To address this issue, this paper proposes a full-reference light field image quality assessment (LFIQA) index that attempts to disentangle the coupling information from macro-pixel image (MacPI) to accurately evaluate the entire LFI quality. The proposed framework can be divided into three steps. Firstly, the LFIs are converted into the MacPIs, and then the spatial and angular feature maps are disentangled by using the spatial, angular and epipolar plane image (EPI) convolutions in the MacPI mode. Secondly, the structural similarity (SSIM) maps are calculated between the disentangled feature maps of the original and distorted LFIs. Furthermore, the quality-aware features of LFIs are extracted on the SSIM maps by utilized local binary patterns (LBP) and natural scene statistics (NSS). Finally, support vector regression (SVR) is utilized to predict the qualities of LFIs. Extensive experiments show that the proposed model outperforms multiple classical and state-of-the-art methods.

Abstract:
Limited by huge computational cost, high inference latency and large memory consumption, existing 3D point cloud object detection methods are hard to be deployed on Internet of Things (IoT) edge devices. To handle this challenge, we present an extremely tiny framework termed TinyPillarNet. This framework leverages innovative pillar encoder to represent point cloud as immensely tiny pseudo-maps for extremely shrinking the input 3D sensing data. Moreover, a compact dual-stream feature extraction network is put forward to respectively extract intrinsic feature and distributional saliency map, which jointly boosts the detection precision with the lowest hardware cost. Extended experiments on KITTI benchmark demonstrated that our TinyPillarNet yields applicable precision with a record tiny weight size of 1.69 MB at a high inference speed of 1.67 times faster than the current record. Furthermore, the specially designed prototype verification system achieves a superior energy efficiency, which outperforms the similar deep learning based point cloud processing solutions on FPGA with a big margin.

Abstract:
This paper presents a Unified Facial image and video Restoration method based on the Diffusion probabilistic model (UniFRD), designed to effectively address both single- and multi-type image degradation. The noise predictor in UniFRD consists of a ViT-based encoder and a novel Separation Fusion Decoding Module (SFDM). The flexible feature optimization strategy allows for decoding complex conditional noise without being limited by degradation patterns. Specifically, SFDM adjusts and refines the channel correlation and expressive power of high-dimensional features step by step, enabling the network to more accurately perceive and enhance the interaction between posterior probabilities and conditional inputs. This process is crucial for improving the visual quality and stability of the restoration results. Extensive experiments demonstrate that even when facial images suffer from both pixel-level and image-level degradation, UniFRD can still guarantee the restoration of rich details and maintain attribute consistency. In summary, compared to existing methods, the solution proposed in this study for facial restoration tasks offers greater generality and adaptability. Moreover, it has high practical value for applications involving faces in complex and unconstrained outdoor scenarios.

Abstract:
Video-based facial expression recognition (FER) is a challenging task due to the dynamic emotional changes with variant frames in video sequences. This paper proposes a novel coarse-fine aware network with static-dynamic adaptation (CFAN-SDA) for in-the wild video-based FER. From coarse to fine, our method leverages cross-domain static FER database to boost video-based FER performance, and then explore hierarchical spatial-temporal feature learning. Specifically, different from existing methods, we design a static-dynamic adaptation learning to explore the knowledge transfer from labeled static images to unlabeled frames of video, which captures the features of coarse-grained emotion to find those important expression-related frames. Furthermore, we present hierarchical spatial-temporal transformers to better learn features of fine-grained expression, which consist of multi-view spatial transformer and frame-clip temporal transformer. The former captures multi-view spatial regions information from global to local, and the latter achieves cross-frame and cross-clip temporal interaction to select the key frame-level and clip-level multi-scale temporal information for fusing. Extensive experimental results on dynamic FER databases indicate that CFAN-SDA achieves superior performance compared to the state-of-the-art models.

Abstract:
3D Masked Point Modeling (MPM) typically involves randomly or blockly discarding points or patches and then reconstructing them, offering a promising avenue for exploring geometric representation. By surveying current masking strategies, we have found that random-masked regions are provided with excessive context, reducing modeling difficulty but impeding knowledge transfer. While, block-masked regions lack sufficient guidance, resulting in significant generated noise. To address these issues, we propose PTM, a novel Transformer-style 3D MPM method employing a torus masking strategy. Specifically, a high-density area is chosen as the masked region, forming a torus by retaining small-radius neighborhoods around the center point. To mitigate torus modeling noise, the designed robust teacher model captures density scale to construct noise embedding, utilizing a reverse fit function for reconstruction assistance. Furthermore, the proposed trusted teacher model defines the multi-modal global descriptor as subjective evidence. On a semantic level, we form semi-subjective trusted evidence to guide reconstruction by evaluating the contribution of each subjective evidence to 3D representation. Downstream fine-tuning tasks validate the state-of-the-art performance of PTM in multi-scale point cloud classification and segmentation.

Abstract:
The Depth-Image-Based Rendering (DIBR) algorithm is pivotal in the advancement of Virtual Reality (VR) and Augmented Reality (AR) technologies due to its capacity to generate virtual views from arbitrary perspectives. Nonetheless, the generation process is often marred by the occurrence of holes due to sharp depth transitions, significantly degrading the quality of the synthesized view. To mitigate this issue, this study introduces a layered hole-filling method to enhance the quality of virtual views. The effectiveness of our proposed method is ensured through three key techniques: Firstly, a depth-aware decomposition method is employed to precisely segregate foreground objects from the background within a reference view. This is achieved by leveraging both the reference image and its corresponding depth map, facilitating accurate instance-level separation of foreground objects. Secondly, a Generative Adversarial Network (GAN)-enhanced background reconstruction method is proposed to generate hole-free target views devoid of foreground objects. Lastly, the integration of Masked 3D Image Warping (M3DIW) and Layered Mergence (LM) algorithms facilitates filling holes with foreground or background textures in a layered manner. Comprehensive experimental results demonstrate the superiority of our proposed method compared to state-of-the-art methods. Notably, our method demonstrates an improvement of 7.5% in mean average Peak Signal-to-Noise Ratio (PSNR) and 1.8% in mean average Structural Similarity Index Measure (SSIM) compared to existing techniques. Additionally, it impressively lowers mean average Learned Perceptual Image Patch Similarity (LPIPS) by 28.8% and significantly reduces mean average Fréchet Inception Distance (FID) by 28.9% for all sequences tested. These results affirm the effectiveness of our approach in enhancing the quality of virtual view synthesis within DIBR applications. Source code is available at https://github.com/threedteam/dibr.

Abstract:
The trace of double compression can serve as a crucial evidence of image manipulation for forensic investigation. With the ever-increasing popularity of WebP format, a new type of double compression case, WebP-JPEG transcoding, has emerged. However, distinguishing it from two common compression cases, single JPEG (SJPEG) and double JPEG (DJPEG) has not yet been studied. In this paper, we propose a specialized method for the new task. Firstly, a detailed analysis is conducted to reveal the differences in compression artifacts between WebP-JPEG and SJPEG/DJPEG, which manifests in the distributions of 4× 4 / 8× 8 DCT coefficients and the high-frequency portions of image spectrum. Then, multi-modality DCT histograms (MMDH) and high-pass-filtered image residuals (HPFIR) are proposed as front-end dual-domain forensic features to expose the above differences. An indispensable part of these features are extracted through a novel frequency-isolation module (FIM), offering additional information based on the derived relationship between 4× 4 and 8× 8 DCT coefficients. Finally, a CNN-ViT (Convolutional Neural Network-Vision Transformer) dual-stream network is designed to learn back-end deep features for a reliable detection, where a CNN stream is used to process statistical features in MMDH while a ViT stream to learn spatial correlations in HPFIR. Extensive experimental results demonstrate that the proposed method significantly outperforms state-of-the-art double compression detection methods in distinguishing WebP-JPEG from SJPEG/DJPEG and is more effective in tampering localization. In specific, the proposed method achieves an average detection accuracy of 0.942 for small images of size 128× 128 .

Abstract:
Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.

Abstract:
When data labels are scarce, contrastive learning is often used to learn representations in a weakly-supervised or unsupervised way. In contrastive learning, not only the learning mechanism, but also the designs of positive and negative sets are critical. While most previous works of Temporal Action Segmentation (TAS) focus on designing new segmentation methods, we investigate the importance of positive and negative set designs in contrastive learning and verify that better representations can be learned to enhance performance of existing TAS methods. Specific to timestamp-supervised TAS and unsupervised TAS, respectively, we propose positive/negative set designs, associated with the ideas of ambiguous frames and the set expansion process to make learned representations more effective. In the evaluation, we demonstrate that performance of timestamp-supervised TAS can be boosted by 8% to 15% in terms of F1@10 across three different datasets, and the performance of unsupervised TAS can be boosted by 3% to 5% in terms of F1 scores, achieving new state-of-the-art TAS results.

Affiliations: National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China; School of Rail Transportation, Soochow University, Suzhou, China; Centre for Frontier AI Research (CFAR) and the Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Fusionopolis, Singapore; Computer Science and Engineering Department, University at Buffalo (The State University of New York), Buffalo, NY, USA

Abstract:
In this work, we pay the first effort to address one-shot 3D action recognition in point cloud sequence, without skeleton information. The main contribution lies in two folders. First, a novel one-shot classification approach that considers the feature distribution of 3D action is proposed. We find that, for different 3D actions their dimensional-wise feature distributions are generally in Gaussian form and similar action categories hold approximate feature distributions. Accordingly, K-nearest base classes’ mean value and covariance matrix information help to form one-shot novel class’s pseudo feature distribution. To alleviate the potential ambiguous problem within nearest neighbor search, we divide the base classes into subsets via C-means clustering to facilitate the similarity measure to novel class. Meanwhile, the feature distribution of base class’s whole set and subsets will be jointly considered for generating novel class’s pseudo feature distribution. Multi-dimensional Gaussian sampling is conducted on the acquired pseudo feature distribution for feature-level data augmentation, to make one-shot novel class “never walk alone” for leveraging classifier training. Secondly to better characterize fine-grained 3D action, a temporal attention method is proposed, via introducing vision Transformer (ViT) to capture action’s discriminative short-term motion pattern with densely sampled short-term 3DV (3D dynamic voxel) features along temporal dimension. Experiments on NTU RGB+D 120 and 60 verify superiority of our approach. It outperforms state-of-the-art skeleton-based methods by 13.9% at most. The source code is available at https://github.com/Tong-XY/YNWA.

Abstract:
In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp.

Abstract:
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.

Abstract:
The visual commonsense reasoning (VCR) task is to choose an answer and provide a justifying rationale based on the given image and textural question. Representative works first recognize objects in images and then associate them with key words in texts. However, existing approaches do not consider exact positions of objects in a human-like three-dimensional (3D) manner, making them incompetent to accurately distinguish objects and understand visual relation. Recently, multi-modal large language models (MLLMs) have been used as powerful tools for several multi-modal tasks but not for VCR yet, which requires elaborate reasoning on specific visual objects referred by texts. In light of the above, an MLLM enhanced pseudo 3D perception framework is designed for VCR. Specifically, we first demonstrate that the relation between objects is relevant to object depths in images, and hence introduce object depth into VCR frameworks to infer 3D positions of objects in images. Then, a depth-aware Transformer is proposed to encode depth differences between objects into the attention mechanism of Transformer to discriminatively associate objects with visual scenes guided by depth. To further associate the answer with the depth of visual scene, each word in the answer is tagged with a pseudo depth to realize depth-aware association between answer words and objects. On the other hand, BLIP-2 as an MLLM is employed to process images and texts, and the referring expressions in texts involving specific visual objects are modified with linguistic object labels to serve as comprehensible MLLM inputs. Finally, a parameter optimization technique is devised to fully consider the quality of data batches based on multi-level reasoning confidence. Experiments on the VCR dataset demonstrate the superiority of the proposed framework over state-of-the-art approaches. The source code of this work can be found in https://mic.tongji.edu.cn.

Abstract:
Predicting the saliency map of a 360-degree video is the key for various downstream tasks, such as saliency-based compression and tile-based adaptive streaming. Besides static salient objects, the moving target will also contribute to the saliency map. Therefore, the joint exploitation of spherical spatio-temporal information is necessary for an accurate saliency prediction. The spherical spatial feature extraction, however, is hindered by the non-Euclidean geometric nature of spherical data, which imposes difficulty on direct extraction of the spatial features with traditional convolutional neural networks (CNNs). While the efficient exploitation of temporal correlation between these spherical spatial features remains another challenge, which requires the extraction of spherical optical flows for explicit motion information. To address these, in this paper, we first propose a spherical graph-based Farneback algorithm to extract the spherical optical flows directly in the sphere domain, by leveraging the GICOPix uniform sampling scheme. We then design a 3D separable graph convolutional network-based saliency prediction framework, named 360Spred, by taking both the spherical frames and spherical optical flows as input. The proposed 360Spred framework is based on the U-Net structure, with a 3D separable graph convolution (3DSGC) operator that directly extracts the visual and motion features in the sphere domain and exploits temporal correlation of both the high-level and low-level spatial features. Experimental results on two public datasets show that 360Spred can achieve a better performance than other baseline models in terms of the saliency prediction accuracy for 360-degree videos.

Abstract:
In the field of object detection, detecting small objects is an important and challenging task. However, most existing methods tend to focus on designing complex network structures, lack attention to global representation, and ignore redundant noise and dense distribution of small objects in complex networks. To address the above problems, this paper proposes a small object detection method based on global multi-level perception and dynamic region aggregation. The method achieves accurate detection by dynamically aggregating effective features within a region while fully perceiving the features. This method mainly consists of two modules: global multi-level perception module and dynamic region aggregation module. In the global multi-level perception module, self-attention is used to perceive the global region, and its linear transformation is mapped through a convolutional network to increase the local details of global perception, thereby obtaining more refined global information. The dynamic region aggregation module, devised with a sparse strategy in mind, selectively interacts with relevant features. This design allows aggregation of key features of individual instances, effectively mitigating noise interference. Consequently, this approach addresses the challenges associated with densely distributed targets and enhances the model’s ability to discriminate on a fine-grained level. This proposed method was evaluated on two popular datasets. Experimental results show that this method outperforms state-of-the-art methods in small object detection tasks, demonstrating good performance and potential applications.

Abstract:
Removing reflection from a single image is an ill-posed problem, while exploiting physics priors can ease this inverse problem. In this paper, we integrate a physics prior of reflection-free images derived from flash illumination into deep learning. The algorithm first estimates an approximation of the transmission scene (i.e., flash-only image) from a pair of images captured with and without flash illumination. We design two collaborative neural networks to make recurrent recovery of the transmission and reflection scenes at increasing resolutions under the guidance of the physics prior. The neural networks learn the cues for scene separation and reconstruction by embedding multi-scale feature extraction components into a nested topology. We also propose a focal perceptual loss for penalizing the artifacts in output images, where the perceptual distances computed in different feature spaces are weighted adaptively to emphasize the hard-to-restore visual attributes. The comparative experiments demonstrate that the proposed algorithm performs better than the state-of-the-art methods in real-world scenarios, and it outperforms the currently best-performing algorithm by 1.5 dB in PSNR. To understand the mechanism behind the performance enhancement brought by the physics prior, we use the attribution-based model interpretation approach to quantify the pixel-to-pixel influence of the flash-only image on the results of reflection removal. The results of model interpretation reveal that the physics prior plays a significant role in dealing with non-uniform and strong reflections.

Abstract:
Lifelong person re-identification (LReID) is developed for dynamic domains where domain distribution is constantly changing due to climate changes, scene changes, etc., and the data can only be collected for a specific scenario over a period of time. With the development of ReID, the issue of clothing changes has also attracted attention. Clothing change itself should be solved more from the perspective of lifelong learning because pedestrians may wear new clothes and the time span of their appearance can be long which can also cause domain changes. Meanwhile, it is difficult to know in advance whether a pedestrian is cloth-changing or cloth-consistent. However, current LReID tasks overlook these issues. To overcome these limitations, we introduce a more practical LReID task, denoted as L4C-ReID (Lifelong Person Re-Identification in Cloth-Changing and Cloth-Consistent Scenarios). This novel task empowers ReID models capable of adapting to incrementally encountered cloth-changing and cloth-consistent domains without prior knowledge of the scenario type and generalizing to unseen domains. A key challenge supposed to be fixed for LReID is the stability-plasticity dilemma. Unlike current LReID methods, which implement plasticity and stability by two contradictory loss items to achieve a sub-optimal balance, we propose an effective scheme termed Unified Stability and Plasticity (USP) that unifies these seemingly disparate concepts to achieve both harmoniously. Taking inspiration from the cognitive processes in the human brain, we decompose the cognitive processes into two independent processes: knowledge representation and knowledge operation. We then design a Knowledge Representation and Operation (KRO) framework to represent and operate the knowledge like the human brain which can better learn new knowledge and consolidate old knowledge to coordinate plasticity and stability. Additionally, we introduce Plasticizing with Stability (PWS) to generalize and optimize the learned knowledge, which integrates the implementation of plasticity and stability into one common objective item to achieve both simultaneously. To simulate the L4C-ReID setup, we gather existing cloth-changing and cloth-consistent datasets to provide a new benchmark. Extensive experiments conducted both on this new benchmark and previous benchmarks established for previous LReID setup, demonstrate the superiority of our method.

Abstract:
Adversarial example-based steganographic methods that utilize the gradients of target steganalyzer to update symmetric costs are emerging. The existing adversarial adjustment strategies for costs still have limited improvements in steganographic security. The existing gradient selection scheme, which sets a fixed gradient selection ratio for all images, is not delicate enough. To address the above problems, this paper proposes an iterative two-stage probability adjustment strategy with a progressive incremental searching mechanism (ITPA-PIS) to further improve the security of updated asymmetric distortions. Unlike previous works that adopted the cost as the adjustment object, we explore a new adjustment object, i.e., probability, and then design an iterative two-stage probability adjustment strategy (ITPA) to obtain a more secure asymmetric distortion, thereby improving the anti-detection performance of the traditional symmetric distortion algorithms against deep learning-based steganalyzers. In addition, we specifically design a progressive incremental searching mechanism (PIS) to select partially efficient gradients to guide the probability adjustment. Unlike existing gradient selection schemes that manually set a fixed selection ratio, PIS adopts a progressive searching method to dynamically determine the gradient selection ratio suitable for each image, thereby enhancing the overall performance of the proposed ITPA again. The experimental results show that our proposed ITPA-PIS achieves outstanding security performance on the CNN-based steganalysis models XuNet, YedroujNet, SRNet, and EfficientNet and hand-crafted feature-based steganalysis models SRM and MaxSRMd2 under the adversary unawareness and adversary awareness scenarios.

Abstract:
Learning based image compression has achieved impressive rate-distortion performance in recent years. However, due to the disposable learning strategy and rigid network architecture, existing methods perform poorly for compressing the images of different domains when they emerge with the expanding real-world applications, such as, natural, oil painting, medical images and so on. To cope with this open-world challenge, this paper proposes a continual cross-domain image compression method based on entropy prior guided knowledge distillation and scalable decoding network, which perform well in balancing the plasticity, stability and compatibility. Firstly, we generate pseudo-samples of old domains by reusing their entropy priors. These pseudo-samples serve as guides for knowledge distillation in the old domains, ensuring that the bit rate and reconstruction of the new model align with those of the old model. This approach assists the updated model in retaining its capability to compress and reconstruct old images. Secondly, we develop a scalable decoding network via dynamic pruning and masked recovery, which could effectively infer an old entropy decoder from the latestly updated model. It ensures that the updated model could decode image features from binary strings encoded by old entropy encoders. Experiments on five image datasets with different domains demonstrate the effectiveness of the proposed method and its superiority over representative continual learning methods. Code of the proposed method is available at https://github.com/wuchenhaoo/Continual_Cross-domain_Image_Compression/.

Abstract:
Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition. However, previous GCN-based methods rely on elaborate human priors excessively and construct complex feature aggregation mechanisms, which limits the generalizability and effectiveness of networks. To solve these problems, we propose a novel Spatial Topology Gating Unit (STGU), an MLP-based variant without extra priors, to capture the co-occurrence topology features that encode the spatial dependency across all joints. In STGU, to learn the point-wise topology features, a new gate-based feature interaction mechanism is introduced to activate the features point-to-point by the attention map generated from the input sample. Based on the STGU, we propose the first MLP-based model, SiT-MLP, for skeleton-based action recognition in this work. Compared with previous methods on three large-scale datasets, SiT-MLP achieves competitive performance. In addition, SiT-MLP reduces the parameters significantly with favorable results. The code will be available at https://github.com/BUPTSJZhang/SiT-MLP.

Abstract:
Blurring and noise degrade the performance of image processing. To mitigate this effect, various regularization-based deblurring methods have been proposed. Total variation regularization is widely used owing to its excellent ability in preserving the salient edges, but it also tends to smooth the image details. In this paper, we propose a local extremum-constrained total variation (LECTV) framework for image deblurring. In the developed deblurring framework, we integrate prior knowledge of the dark channel with the structural features of the image into a single regularization term. Furthermore, unlike most existing methods that focus on the overall sparsity of the dark channel, the defined regularization term allows for a pixel-wise adaptive description of the image to restore its inherent spatial texture structure. Finally, a majorization-minimization-based method is designed to solve the developed LECTV framework. Experimental results on natural and hyperspectral images show that the designed framework exhibits excellent performance in removing multiple types and degrees of blurring. Extensive evaluations also further show its superiority compared to other advanced methods.

Abstract:
The Just Noticeable Difference (JND) refers to the smallest distortion in an image or video that can be perceived by Human Visual System (HVS), and is widely used in optimizing image/video compression. However, accurate JND modeling is very challenging due to its content dependence, and the complex nature of the HVS. Recent solutions train deep learning based JND prediction models, mainly based on a Quantization Parameter (QP) value, representing a single JND level, and train separate models to predict each JND level. We point out that a single QP-distance is insufficient to properly train a network with millions of parameters, for a complex content-dependent task. Inspired by recent advances in learned compression and multitask learning, we propose to address this problem by 1) learning to reconstruct the JND-quality frames, jointly with the QP prediction; and 2) jointly learning several JND levels to augment the learning performance. We propose a novel solution where first, an effective feature backbone is trained by learning to reconstruct JND-quality frames from the raw frames. Second, JND prediction models are trained based on features extracted from latent space (i.e., compressed domain), or reconstructed JND-quality frames. Third, a multi-JND model is designed, which jointly learns three JND levels, further reducing the prediction error. Extensive experimental results demonstrate that our multi-JND method outperforms the state-of-the-art and achieves an average JND1 prediction error of only 1.57 in QP, and 0.72 dB in PSNR. Moreover, the multitask learning approach, and compressed domain prediction facilitate light-weight inference by significantly reducing the complexity and the number of parameters.

Abstract:
Currently, the success of image processing relies heavily on large well-annotated datasets. However, collecting and labeling video data are significantly more labor-intensive, posing major challenges for training video algorithms and limiting their practical applications. While label-efficient techniques for image data have advanced, solutions for video data are still emerging. Unlabeled video data, with their inherent structured nature, offer valuable assets for label-efficient learning. Unlike image data, video data naturally captures realistic transformations, providing rich samples for learning. Moreover, from a border perspective, video tasks hold great potential for applications like autonomous driving and video surveillance but present unique challenges due to the need to understand both spatial and temporal aspects. Leveraging label-efficient learning is essential for comprehensively understanding visual content and enabling a wide range of real-world video applications. This Special Issue on “Label-Efficient Learning for Video Data” seeks to advance research in this area, offering new insights and solutions to benefit both researchers and practitioners.

Abstract:
Contrastive learning has been widely embraced for its notable success along with two augmentation methods—normal and strong augmentations—in skeleton action recognition. Existing methods gain performance largely by customizing normal augmentations while bypassing strong augmentations that riches in motion patterns. To make up for the blank, we propose a novel framework, called CStrCRL, acquiring view-invariant and discriminative features from strong augmentations by leveraging contrastive learning. Specifically, to avoid the fragility of skeleton data adversely affecting the model after applying strong augmentations, we use consistency learning to maximize the similarity between strongly and normally augmented views. Furthermore, we employ cross-view learning on strong and normal augmentations for eliminating uncertainty feature boundaries learned by the model. Moreover, we design a new backbone, termed GatedStrNet, for discriminating valid and invalid features contained in strong augmented views. Finally, extensive experiments on NTU 60/120 and PKUMMD II demonstrate that the proposed method bridges the performance gap between normal and strong augmentations on contrastive learning of skeleton recognition. Notably, with a single stream input, CStrCRL achieves accuracies of 78.93% and 84.04% on the NTU60 Xsub and Xview datasets. Our source code can be found at: https://github.com/RHu-main/CStrCRL.

Abstract:
Point cloud registration is a critical research area in computer vision with extensive applications. Recent studies have unveiled the significant potential of graph neural networks (GNNs) for point cloud registration. One key approach is to leverage the smoothness of graph convolutions to extract similarity information between points. However, as the number of convolution layers increases, the features between points tend to become consistent, and distinctiveness is always neglected, which contradicts point cloud registration. To this end, this paper presents a new GNN framework with 3D graph smoothing-sharpening convolution (GNN-GSSC) for point cloud registration. It includes two new convolutional strategies: graph smoothing convolution (SmoothGConv) and graph sharpening convolution (SharpGConv). The former utilizes Laplacian smoothing to aggregate similar information from neighbouring nodes, whereas the latter encourages each node to move away from its neighbours to obtain more discriminative information. Specifically, we calculate the difference information between the central node and neighbouring nodes to supplement the node feature information while aggregating the similarity information of the nodes. In addition, we devise a Transformer-based overlapping point scoring module, enhancing the emphasis on overlapping areas while weakening the focus on non-overlapping areas by scoring each point. Experiments reveal that the proposed method is optimal compared to other existing methods. More importantly, SharpGConv is a plug-and-play graph convolution module that is particularly advantageous for extracting distinctive information in point cloud registration.

Abstract:
Entropy estimation is essential for the performance of learned image compression. It has been demonstrated that a transformer-based entropy model is of critical importance for achieving a high compression ratio, however, at the expense of a significant computational effort. In this work, we introduce the Efficient Contextformer (eContextformer) – a computationally efficient transformer-based autoregressive context model for learned image compression. The eContextformer efficiently fuses the patch-wise, checkered, and channel-wise grouping techniques for parallel context modeling, and introduces a shifted window spatio-channel attention mechanism. We explore better training strategies and architectural designs and introduce additional complexity optimizations. During decoding, the proposed optimization techniques dynamically scale the attention span and cache the previous attention computations, drastically reducing the model and runtime complexity. Compared to the non-parallel approach, our proposal has ～ 145\textx lower model complexity and ～ 210\textx faster decoding speed, and achieves higher average bit savings on Kodak, CLIC2020, and Tecnick datasets. Additionally, the low complexity of our context model enables online rate-distortion algorithms, which further improve the compression performance. We achieve up to 17% bitrate savings over the intra coding of Versatile Video Coding (VVC) Test Model (VTM) 16.2 and surpass various learning-based compression models.

Abstract:
Image and point cloud registration (2D-3D registration) is an essential prerequisite for multi-modal feature fusion. However, due to the significant feature difference of point cloud and image, it is challenging to establish 2D-3D correspondences. Targeting for the background of autonomous driving, we propose 2D-3D registration method with object-level correspondence (OL-Reg) in this paper. Object-level correspondence consists of object bounding box and object contour in 2D image and 3D space. The first step is to match 2D-3D objects. Due to sensor pose and field of view (FoV) difference, object shape and occlusion is different in image and point cloud, causing the difficulty of object matching. To solve this issue, we represent object as 3D bounding box, and design 2D-3D object matching with 3D box projection (Box-Proj) constraint. It aligns object 3D bounding box in image and point cloud. After that, the next step is to build 2D-3D correspondence from the matched objects. To extract correspondence from object with irregular shape, we notice the distance constraint of object surface and rays back-projected from object contour, and present projection based iterative closest point (Proj-ICP). Towards the stability of Proj-ICP, object-level regularization term is designed. Experiment is conducted in KITTI object and odometry dataset. With the pre-trained 3D object detector, results suggest that OL-Reg has the better performance than current approaches in tasks of re-localization and extrinsic calibration. Source code will be released at https://github.com/anpei96/ol-reg-demo.

Abstract:
Generalized zero-shot learning (GZSL) is a challenging topic in both computer vision and machine learning. Recently, generative models (e.g., GAN and VAE) have attracted much attention for handling the GZSL task, however, they are sometimes prone to either model collapse or ambiguous distribution modeling. Inspired by the feature generation ability of denoising diffusion models in other visual tasks, we propose an Adaptive Conditional Denoising Diffusion Model to synthesize unseen-class visual features for GZSL on condition of a set of semantic features in this paper, called AC-DDM. Unlike traditional denoising diffusion models whose reverse process has both a fixed time interval and a fixed number of total denoising time steps, the proposed AC-DDM has a learnable distribution-constrained predictor which could adaptively learn the time interval and the number of total denoising time steps for each unseen class, so that it could synthesize more discriminative features for sample classification. In order to improve the discrimination ability of the synthesized visual features further, we also explore a hybrid affinity regularizer under the proposed AC-DDM, which forces the differences among the affinity matrices of the real and synthesized visual features to be small. Extensive experimental results on four public benchmark datasets demonstrate the superiority of the proposed model over 20 state-of-the-art models in both the ZSL and GZSL tasks.

Abstract:
In recent years, the dance entertainment industry has experienced significant growth, driven by the desire of consumers to learn and improve their dancing skills. To effectively improve their skills, dancers require evaluation and feedback, which traditionally relies heavily on professional dancers. To address this challenge, researchers have proposed objective assessment methods for dance performance via kinematic data captured by sensors. However, these existing methods primarily focus on assessing the rhythmic accuracy of movements synchronized to music. In this paper, we propose Dance Quality Assessment (DanceQA) Framework to evaluate dance performance, considering choreographic factors that are important criteria in subjective DanceQA. We find that kinematic diversity and rhythmic alignment are significant choreographic factors from human perception perspective. Based on these factors, we design two metrics: kinematic information entropy (KIE) and kinematic-music beat similarity (BSIM). Our study demonstrates that these metrics are closely related to specific body parts in each choreography. To validate the effectiveness of our metrics, we capture dance performance by OptiTrack system providing precise three-dimensional data at very high sampling rate. We then label their dance quality via subjective test. The metrics give strong correlation with subjective opinion, but it is difficult to tell which body part is the most correlated. To comprehensively understand the dance quality, we propose choreographic quality transformers (CQTs), which learn the aforementioned choreographic factors by embedding KIE and BSIM into attention matrices. In numerous experiments, the CQTs outperforms previous methods, graph convolutional networks and multimodal transformers, at least by up to 0.146 in correlation coefficient.

Abstract:
With the booming development of smart devices, mobile videos have drawn broad interest when humans surf social media. Different from traditional long-form videos, mobile videos are featured with uncertain human attention behavior so far owing to the specific displaying mode, thus promoting the research on saliency prediction for mobile videos. Unfortunately, the current eye-tracking experiments are not applicable for mobile videos, since the stationary eye-tracker and eye fixation acquisition are dedicated to the videos presented on computers. To tackle this issue, we propose performing the wearable eye-tracker to record viewers’ egocentric fixations and then devising a fixation mapping technique to project the eye fixations from egocentric videos onto mobile videos. Resorting to this technique, the large-scale mobile video saliency (MVS) dataset is established, including 1,007 mobile videos and 5,935,927 fixations. Given this dataset, we exhaustively analyze the characteristics of subjects’ fixations and obtain two findings. Based on the MVS dataset and these findings, we propose a saliency prediction approach on mobile videos upon Video Swin Transformer (MVFormer), wherein long-range spatio-temporal dependency is captured to derive the human attention mechanism on mobile videos. In MVFormer, we develop the selective feature fusion module to balance multi-scale features, and the progressive saliency prediction module to generate saliency maps via progressive aggregation of multi-scale features. Extensive experiments show that our MVFormer approach significantly outperforms other state-of-the-art saliency prediction approaches. Finally, we demonstrate the potential application of our MVFormer approach in the H.265 video coding standard by embedding it into the rate control scheme, such that the perceptual quality of compressed mobile videos can be significantly improved. The dataset and code are available at https://github.com/wenshijie110/MVFormer.

Abstract:
Conditional coding is a new video coding paradigm enabled by neural-network-based compression. It can be shown that conditional coding is in theory better than the traditional residual coding, which is widely used in video compression standards like HEVC or VVC. However, on closer inspection, it becomes clear that conditional coders can suffer from information bottlenecks in the prediction path, i.e., that due to the data processing inequality not all information from the prediction signal can be passed to the reconstructed signal, thereby impairing the coder performance. In this paper we propose the conditional residual coding concept, which we derive from information theoretical properties of the conditional coder. This coder significantly reduces the influence of bottlenecks, while maintaining the theoretical performance of the conditional coder. We provide a theoretical analysis of the coding paradigm and demonstrate the performance of the conditional residual coder in a practical example. We show that conditional residual coders alleviate the disadvantages of conditional coders while being able to maintain their advantages over residual coders. In the spectrum of residual and conditional coding, we can therefore consider them as “the best from both worlds.”

Abstract:
With the booming of streaming media platforms, viewers now get used to watching dramas and movies via online platforms with more intelligent services. Usually, character relationships may dynamically evolve with stories promoting in long videos. Therefore, automatic tools to capture the social relation evolution among characters are urgently required to enrich the viewing experience. However, most existing works mainly focus on shorter isolated video clips. Considering the development of the plot, they may fail to effectively summarize relationships as holistic semantic representations for the whole video. To deal with these challenges, in this paper, we propose a novel Dynamic-Evolutionary Graph Attention Network (DE-GAT) framework to generate the evolving social relation graph among characters and capture the characters’ relation evolutionary trajectory throughout the entire video. DE-GAT first integrates the multimodal cues, including visual and textual information in each video clip via the graph attention network (GAT). Expanding the temporal receptive field from clip-level to scenario-level, the most relevant factors of the evolution of social relationships can be explored. Eventually, all the scenario-level social graphs are merged to obtain the evolving global social graph for the entire movie. Extensive evaluations on the real-world MovieGraphs dataset have validated the positive impact of temporal receptive field expansion and multimodal cues on capturing evolving social relations.

Abstract:
Visual data coding is an enabling technology for various applications and is now ubiquitously adopted in modern image processing, communications, and computer vision systems. To enable interoperability between devices manufactured and services provided by different enterprises, a series of standards targeting visual data coding have been crafted in the past three decades. Several standardization organizations, such as ISO/IEC JTC 1/SC 29 consisting of Joint Picture Experts Group (JPEG) and Moving Picture Experts Group (MPEG),1 ITU-T SG 16 Video Coding Experts Group (VCEG),2 IEEE Data Compression Standards Committee Audio Video Coding Working Group (1857 WG),3 MPAI Community,4 have been creating these standards from many contributions of academia and industry. While most of these visual coding standards have been successfully deployed in many applications, there are more challenges nowadays, especially to accommodate the large volume of visual data in limited storage and limited bandwidth transmission links. Compression efficiency improvements are still needed, especially considering emerging data representation formats ranging from 8K/HDR image/video to rich plenoptic data.

Abstract:
The parallel branches with independent optimized classification and localization capabilities are widely used in single-stage object detection. Defects such as feature conflicts, low level of information interaction, and empirical sample allocation scheme lead to weak spatial consistency of the outputs from different branches. In this work, we propose a Progressive Decoupled Task Alignment (PDTA) that enhances the information interaction between tasks while reducing the degree of feature coupling, and adopts a strategy based on sample screening and learning to achieve task alignment. First, we design the Discrepant Feature Decoupling Module (DFDM) embedded with the novel Oriented Decoupling Convolution (ODC) for the coupled features of the shared input, and the features extracted by ODC are utilized for disentanglement through the feed-in scheme with differences. Second, the Probabilistic Mapping Interaction Head (PMI-Head) utilizes the probabilistic mapping method to enhance task-specific semantics by information interaction. Finally, the network’s common attention to the content and position of the target is enhanced through the metric in the proposed Relevance-Guided Adaptive Task Alignment (RATA), in which an exponentially decaying manner is used to preserve the training samples that are more efficient for both tasks. During training, task-aligned learning is performed by Relevance-Guided Loss. Experiments on MS COCO and DIOR datasets demonstrate the effectiveness of our method, PDTA achieves better performance for object detection.

Abstract:
Recently, many effective methods have emerged to address the robustness problem of Deep Neural Networks (DNNs) trained with noisy labels. However, existing work on learning with noisy labels (LNL) mainly focuses on balanced datasets, while real-world scenarios usually also exhibit a long-tailed distribution (LTD). In this paper, we propose an online category-aware approach to mitigate the impact of noisy labels and LTD on the robustness of DNNs. First, the category frequency of clean samples used to rebalance the feature space cannot be obtained directly in the presence of noisy samples. We design a novel category-aware Online Joint Distribution to dynamically estimate the category frequency of clean samples. Second, previous LNL methods were category-agnostic. These methods would easily be confused with noisy samples and tail categories’ samples under LTD. Based on this observation, we propose a Harmonizing Factor strategy to exploit more information from the category-aware online joint distribution. This strategy provides more accurate estimates of clean samples between noisy samples and samples with tail categories. Finally, we propose Dynamic Cost-sensitive Learning, which utilizes the loss and category frequency of the estimated clean samples to address both LNL and LTD. Compared to extensive state-of-the-art methods, our strategy consistently improves the generalization performance of DNNs on several synthetic datasets and two real-world datasets.

Abstract:
Weakly Supervised Video Salient Object Detection (WSVSOD) only requires coarse-grained manual annotations, which can achieve a good trade-off between labeling efficiency and detection performance. In this paper, a Multiple Pseudo Label Aggregation Network (MPLA-Net) is proposed for WSVSOD. Firstly, the video frames that can obtain high-quality pseudo labels are selected to generate multiple pseudo labels, so as to avoid the prejudice of the single label. Moreover, the pseudo label with fine edge information is used to generate the Edge Information Map (EIM). Secondly, MPLA-Net is designed to adequately excavate and utilize the comprehensive saliency cues in multiple pseudo labels to improve the detection accuracy, in which ResNet-50 is adopted as the backbone network. Edge loss, pseudo label loss, self-supervised loss and fusion loss are exploited to jointly supervise and optimize the network training to obtain a robust detection model. Experimental results on five benchmark datasets demonstrate that, compared with existing weakly supervised methods, the proposed method can achieve state-of-the-art detection accuracy with less model parameters and higher detection speed. And the detected salient objects have fine boundaries.

Abstract:
In this letter, we pioneer to propose a binarization embedded weakly-supervised video anomaly detection (BE-WSVAD) method by constructing a binarized GCN-based anomaly detection module. Compared to the existing weakly-supervised video anomaly detection (WS-VAD) methods, BE-WSVAD focuses on the detection efficiency, which is ignored by the existing literature yet vital in real applications. Specifically, to improve the detection performance of the binary anomaly detection module, we propose a binary network augmentation strategy in the training process. Due to the weakly supervision mechanism, the videos employed in the training process are usually lengthy, in which the lengthy-input dependencies tend to be exploited to improve the detection performance with extra memory consumption. Then, we propose the short-input inference modes, which can largely reduce the desired length of the input video. Experimental results demonstrate the superiority of our BE-WSVAD in terms of the memory and computational consumptions while giving comparable accuracies.

Abstract:
Nowadays the application of AR is expanding from small or medium environments to large-scale environments, where the visual-based localization in the large-scale environments becomes a critical demand. Current visual-based localization techniques face robustness challenges in complex large-scale environments, requiring tremendous number of data with groundtruth localization for algorithm benchmarking or model training. The previous groundtruth solutions can only be used outdoors, or require high equipment/labor costs, so they cannot be scalable to large environments for both indoors and outdoors, nor can they produce large amounts of data at a feasible cost. In this work, we propose LSFB, a novel low-cost and scalable framework to build localization benchmark in large-scale indoor and outdoor environments. The key is to reconstruct an accurate HD map of the environment. For each visual-inertial sequence captured in the environment, the groundtruth poses are obtained by joint optimization taking both the HD map and visual-inertial constraints. The experiments demonstrate the obtained groundtruth poses have cm-level accuracy. We use the proposed method to collect a localization dataset by mobile phones and AR glasses in various environments with various motions, and release the dataset as the first large-scale localization benchmark for AR.

Abstract:
Cross-modality visible-Infrared person re-identification (cm-ReID) is extremely challenging due to the huge modality discrepancy between RGB and IR modalities. Existing methods focus on the sample features themselves, trying to learn modality-invariant features and perform alignment to reduce the modality discrepancy in dataset-level, while the negative impact of specific features and the identity optimization are not specifically addressed. Moreover, most methods that only extracts modality-invariant appearance features cannot acquire enough discriminative matching information for identifying different persons since the information of invariant features is limited compared with original features. Accordingly, in this paper, we propose a Enhanced Invariant Feature Joint Learning Framework (EIFJLF) for cm-ReID to handle the above problems. First, we propose a specific feature confusion baseline with a novel channel-blended transformation, which confuses the visible color and infrared spectrum to alleviate the influence of specific features, so that model pays more attention to other discriminative invariant features. Second, we present an adaptive heterogeneous center loss for better identity optimization. The adaptive margin of the loss makes samples not too close to the center, avoiding losing effectiveness too early and overfitting meantime further boosting performance. Finally, we design a novel similarity feature refinement module to utilize intra-modality relations and achieve invariant information compensation. Intra-modality relations are valuable built-in invariant features and we model these relations with similarity between samples into affinities and then update the original features to achieve information compensation. EIFJLF works for more informative invariant feature learning and more stable alignment. For cm-ReID, our work is a brand new attempt. Extensive experimental results on two standard benchmarks have demonstrated superiority of the proposed method compared with state-of-the-art methods.

Abstract:
LiDAR and camera are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of these heterogeneous modalities. Currently, many methods achieve feature alignment through projection calibration, without accounting for the impact of sensors misalignment errors, resulting in sub-optimal performance. In this paper, we present GraphAlign++, a more accurate feature alignment framework for 3D object detection by graph matching. Specifically, we construct the nearest neighbor relationship by calculating Euclidean distances of point cloud features within the subspaces. Through the projection calibration between the image and point cloud pairs, we project the nearest neighbors of point cloud features onto the corresponding image. Then by matching the nearest neighbors of a single point-feature of the point cloud with multiple pixel-features of the image, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between these two heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of GraphAlign++. Notably, due to the more accurate feature alignment, which contributes to increase mAP by 3.10% on KITTI test hard level, our method is remarkably beneficial for long-range object detection.

Abstract:
Domain adaptation mitigates the decline in performance that occurs when models are utilized in a target domain. Models designed for a limited range of categories struggle to handle real-world scenarios where unknown classes, absent from the original domain, exist. Furthermore, it is probable that multiple source domains are annotated asynchronously by distinct agencies, each with its own data distributions. The practical challenges of multi-source open-set domain adaptation (MSOSDA) have not been thoroughly investigated, despite their relevance in real-world scenarios. The main difficulty in MSOSDA lies in developing a shared discriminative feature space across all domains, while effectively separating source classes from target-specific ones. In this study, we propose a method for MSOSDA using a self-supervised vision Transformer (ViT) combined with nearest neighbor classification. Our key insight is to leverage the powerful nearest neighbor classification property of self-supervised ViT, along with supervised contrastive learning. To explicitly align the domains and accurately identify unknown classes in the target domain, we employ straightforward strategies and an adaptive data-driven threshold. Our approach has been extensively evaluated on five multi-source domain adaptation benchmarks, showcasing its effectiveness. Among these benchmarks, two are fine-grained, and it is worth noting that one of them has been introduced for the first time in this paper. Through these experiments, we provide compelling evidence of the performance and efficacy of our proposed approach.

Abstract:
Aiming at the problem of most Robust Reversible Data Hiding (RRDH) schemes failing to anti geometric deformation attacks, a new RRDH algorithm based on Polar Harmonic Fourier Moments (PHFMs) is presented in this paper, thereby enhancing both the robustness of the embedded data and perceptual quality of the data-embedded image. Firstly, by leveraging the anti-geometric transformation and high-fidelity features of PHFMs, the image is transformed into its frequency domain for RRDH. Then, a quantitation index modulation (QIM) algorithm is designed to embed secret data into the integer part of PHFMs coefficients. By minimizing the differences between the secret-data-embedded image and the original image, the amount of compensation data is reduced. Meanwhile, a two-dimensional RDH scheme is further adopted to embed the compensation data, thus reducing the distortion of the full data-embedded image. Finally, the robustness of the embedded data and the fidelity of the full data-embedded image are both improved. The combination of PHFMs transformation and two-dimensional RDH enables the proposed RRDH algorithm to achieve high visual quality and strong resistance capability against geometric transformation attacks. Extensive experimental results demonstrate that the proposed RRDH algorithm outperforms other state-of-the-art techniques.

Abstract:
The classification of style data generally depends on both physical features of data and distinct styles originating from their own homogeneities. As a first attempt, a novel style Takagi-Sugeno-Kang (TSK) fuzzy classifier called STSK is developed in this study. STSK as an interpretable classifier consists of interpretable fuzzy rules with powerful representation ability on explicit physical features, implicit style features and the proposed style matrix of style data. By separating all features including the implicit style features in the antecedent parts of all fuzzy rules into fixed fuzzy partitions, followed by randomly assigned Gaussian membership functions with interpretable linguistic terms, the interpretable antecedents and hence the interpretable fuzzy rules can be assured for STSK. The training of STSK can be accomplished by solving the designed objective function through iteratively augmenting style features and updating the consequent parameters (including the style matrices) of all fuzzy rules in the form of their analytical solutions. Furthermore, by means of the Sherman-Morrison and Schur complement formulas, a fast learning algorithm F-STSK is derived to speed up the training of STSK to ensure at least comparable classification performance and simultaneously avoid successive yet time-consuming matrix inversion calculations caused by iterative style feature augmentation. Extensive experimental results on benchmark datasets validate the effectiveness of the proposed fuzzy classifier STSK and its fast learning algorithm. Moreover, six case studies about style data demonstrate the superiority of STSK over the comparison methods. The code of this study can be downloaded from https://github.com/gusuhang10/STSK.

Abstract:
Finger vein recognition is an emerging biometric technology with high security and various application scenarios. Most finger vein recognition methods are based on a single view. However, the inherent problems in single-view finger vein recognition, such as limited feature, sensitivity to finger translation and rotation, and the ambiguity issue in 2D projections, hinder the improvement of the system performance. To address these problems and enhance finger vein verification performance, we employ multi-view finger vein images that are capable of providing a more comprehensive feature of 3D finger vein. Specifically, we design a novel low-cost full-view finger vein imaging device that enables full-view capture of finger veins with only a single camera and establish a multi-view finger vein dataset, named THU-MVFV. In addition, we propose a Multi-view Finger Vein Feature Encoding and Selection Network (MFV-FESNet), which is based on an improved Transformer encoder that can learn the dependencies between different views. By fusing the extracted global context feature and local dominant feature, the network can generate a feature descriptor with high discrimination. Extensive experiments are conducted on THU-MVFV and demonstrate the superior performance of the proposed model. The THU-MVFV dataset will be publicly available at https://github.com/Finger-Vein-Dataset/THU-MVFV.

Abstract:
Visual object tracking has witnessed continuous improvements in performance, thanks to deep CNN learning that recently emerged. More complex CNN models invariably offer better accuracy. However, there is a conflict between the tracking efficiency and model complexity, which poses a challenge in balancing speed against accuracy. To optimize the trade-off between these two performance criteria, a distillation-ensemble-selection framework is proposed in this paper. Without any modification to the baseline network architecture, the proposed approach enables the construction of a Siamese-based tracker with improved capacity and efficiency. Specifically, multiple student trackers are designed by means of knowledge distillation from a given teacher tracking model. To manage the varying granularity of unknown targets, an ensemble module combines the outputs of the student trackers with the help of a learnable fine-grained attention module. Besides, in the online tracking stage, a selection module adaptively controls the complexity of the tracker by identifying an appropriate subset of the candidate tracker models. We verify the effectiveness of the proposed method in both anchor-based and anchor-free paradigms. The experimental results obtained on standard benchmarking datasets demonstrate the effectiveness of the proposed method, with an outstanding and balanced performance in both accuracy and speed.

Abstract:
Tiny object detection (TOD) remains a challenging problem due to the extremely small size and weak feature presentations of tiny objects. Many effective methods have improved the detection of small objects below 32× 32 pixels to some extent, but the performance is still poor for the tiny objects below 16× 16 pixels. In this paper, we find that the aliasing between the features and object scales, namely feature-scale-aliasing, leads to the misalignment between feature subspaces and detection subspaces, and thus results in the interference of features, especially for tiny objects. To alleviate this, we propose a Hierarchical Activation (HA) method to obtain scale-specific feature subspaces by activating object features at different scales hierarchically. To this end, we design a Scale-Guided Feature Activation (SGFA) to decompose the original object-aliasing feature spaces into a group of scale-specific feature subspaces by scale-guided activation maps. Then, Scale-Specific Feature re-Coupling (SSFC) is used to enhance the feature subspaces by adaptively aggregating the feature subspaces from different groups. In addition, we propose to complement the scale-specific detailed information by a designed Detailed Information Compensation (DIC) method. Implementing HA, a multi-scale keypoint-based detector is constructed to improve the tiny object detection, referred to as Hierarchical Activation Network (HANet). Extensive experiments are carried out on three tiny object detection datasets, e.g., TinyPerson, AI-TOD, and TinyCOCO. Our HANet achieves 58.45% AP_50^all , 22.1% AP , and 15.76% AP on TinyPerson, AI-TOD, and TinyCOCO, respectively, showing a significant performance gain over the competitors.

Abstract:
Fully supervised salient object detection (SOD) methods have made considerable progress in performance, yet these models rely heavily on expensive pixel-wise labels. Recently, to achieve a trade-off between labeling burden and performance, scribble-based SOD methods have attracted increasing attention. Previous scribble-based models directly implement the SOD task only based on SOD training data with limited information, it is extremely difficult for them to understand the image and further achieve a superior SOD task. In this paper, we propose a simple yet effective framework guided by general visual representations with rich contextual semantic knowledge for scribble-based SOD. These general visual representations are generated by self-supervised learning based on large-scale unlabeled datasets. Our framework consists of a task-related encoder, a general visual module, and an information integration module to efficiently combine the general visual representations with task-related features to perform the SOD task based on understanding the contextual connections of images. Meanwhile, we propose a novel global semantic affinity loss to guide the model to perceive the global structure of the salient objects. Experimental results on five public benchmark datasets demonstrate that our method, which only utilizes scribble annotations without introducing any extra label, outperforms the state-of-theart weakly supervised SOD methods. Specifically, it outperforms the previous best scribble-based method on all datasets with an average gain of 5.5% for max f-measure, 5.8% for mean f-measure, 24% for MAE, and 3.1% for E-measure. Moreover, our method achieves comparable or even superior performance to the state-of-the-art fully supervised models.

Abstract:
Through truncating the weights and activations of a deep neural network, conventional binary quantization imposes limitations on the representation capability of the network parameters, which hence deteriorates the detection performance of the network. In this paper, a joint-guided distillation binary neural network via dynamic channel-wise diversity enhancement for object detection (JDBNet) is proposed to mitigate the gap of quantization errors. Our JDBNet includes a dynamic channel-wise diversity scheme and real-valued joint-guided teacher assistance to enhance the representation capability of the binary neural network in the object detection tasks. In the dynamic diversity scheme, the learning channel-wise bias (LCB) layer supports adjusting the magnitude of the parameters in which the sensitivity of the model parameters to the arbitrary quantization method is reduced, thereby improving the diversity expression ability of the feature parameters. In the joint-guided strategy, the single-precision implicit knowledge from the guiding teacher in the multilevel layer is utilized to supervise and penalize the quantitative model, enhancing the fitting performance of parameters in the binary quantized model. Extensive experiments on the PASCAL VOC, MS COCO, and VisDrone-DET datasets demonstrate that our JDBNet outperforms the state-of-the-art binary object detection networks in terms of mean Average Precision.

Abstract:
This paper focuses on point-based single-stage 3D object detection from point clouds and proposes a novel elegant detector CPC-3Det. Pyramid and confidence-guided backbones are widely used in point-based methods. However, the limitation of neighborhood points and negative sample construction bring obstacles to the discriminative feature learning and cost. Also, Scene-level spatial information loss should be noted. This paper presents the repository-based backbone consisting of a feature repository and partial knowledge to meet the issues. Additionally, explicit class-aware statistics are designed to raise robust features. Moreover, statistics-embedded detection heads through feature modulation and parameter control enhance CPC-3Det performance. Furthermore, The misalignment in IoU optimization caused by center offset is explored in this paper. The paper proposes a center-weighted IoU and designs hybrid losses to drive network parameter optimization. Extensive experiments on both the KITTI and Waymo Open datasets demonstrate the superiority of CPC-3Det over state-of-the-art methods.

Abstract:
Anomaly segmentation is a critical task for safety-critical applications, such as autonomous driving in urban environments. Its objective is to detect out-of-distribution (OOD) samples with unseen categories, given a pre-trained segmentation model. The core challenge of this task is how to distinguish hard in-distribution samples from OOD samples, which has not been explicitly discussed in previous research. In this paper, we propose a simple yet effective approach named CosMe (Consensus Synergizes with Memory) to address this challenge. CosMe consists of two key components: 1) building a memory bank comprising seen prototypes extracted from multiple layers of the given segmentation model, and 2) training an auxiliary model that mimics the behavior of the given model and using the consensus of their mid-level features as complementary cues that synergize with the memory bank. The former serves as a baseline that can detect all potential outliers, including both OOD and hard in-distribution samples; the latter assists in distinguishing between these two types of outliers. Experimental results on several urban scene anomaly segmentation datasets demonstrate that CosMe outperforms previous approaches by a significant margin.

Abstract:
An always-on intelligent system comprising of an image sensor requires continuous functioning of each pixel. This includes sensing the illumination content of the scene and also the conversion of the analog values into their digital representations. Therefore, power consumption during analog to digital conversion and computational cost at the image sensor module become critical while designing a system that is always-on and incorporates intelligence near the sensor module. This work focuses on the inherent property of the ADC for converting the analog pixel values to digital values by taking a defined number of analog-to-digital converter (ADC) cycles. The design factors considered are 1) Power saving due to reduced ADC conversion cycles for each pixel; 2) The reduced bit-precision of the processing unit to reduce hardware cost; 3) The dataflow design through hls4ml, which produces parallel computational modes for low latency CNN architectures. The proposed work implements two lightweight CNN models with reduced parameters as compared to the original architectural models of VGG16 (like) and SqueezeNet (like) which are trained in Qkeras and deployed on Zynq UltraScale+ MPSoC board. In addition, the design pipeline is validated on the MobileNetV2 and GhostNet architectures to demonstrate its generalization ability. A detailed analysis shows that limiting the number of ADC bits from 8 to 4 reduces the mean accuracy merely from 50.3 to 49.17 for VGG16 (like) and 67.83 to 67.80 for SqueezeNet (like) model, however, the readout power is significantly reduced from 140.45 mW to 7.7 mW for STL-10 dataset with 96×96 image resolution. Additional experiments are conducted with CIFAR-10 and mini-ImageNet datasets for classification and with Oxford-IIIT Pet Dataset for segmentation. The proposed work, thus, provides empirical evidence that a reasonable performance for intelligent vision tasks with power saving can be achieved by tuning CNN models to work with reduced ADC bit precision.

Abstract:
As a fundamental technology in autonomous driving and robotic sensing system, 3D point cloud object detection has received increasing attention. In this paper, a novel 3D detection method that harnesses perspective information and proposal correlation (PIPC-3Ddet) is proposed for detecting 3D objects from point clouds. Specifically, a perspective information embedding module is designed to enhance the voxel features by capturing and embedding the perspective information of range images, so as to effectively distinguish the objects and backgrounds. Besides, by revealing the correlation among 3D proposals, a proposal correlation reasoning module is presented to learn high-quality proposal features for better 3D proposal refinement. With the designed perspective information embedding and proposal correlation reasoning modules, the proposed PIPC-3Ddet is able to better perceive the objects in the 3D scene, thus boosting the 3D object detection performance. Extensive experiments on the KITTI and Waymo benchmarks have demonstrated the superiority of the proposed PIPC-3Ddet.

Abstract:
Image quality assessment (IQA) has always been a popular research topic. There have been many methods proposed for predicting image quality, also known as the mean opinion score (MOS). However, it is worth noting that different people may assign different opinion scores to the same image. Image quality described by all subjective opinion scores can express rich subjective information about the image, such as diversity and uncertainty, which cannot be accurately described by a single MOS. Therefore, this paper proposes a fuzzy neural network to predict the opinion score distribution (OSD) of image quality. The fuzzy neural network includes three sub-networks: a feature extraction network, a feature fuzzification network, and a fuzzy learning network. First, a novel network is designed to extract image features. The extracted features are then fuzzified by fuzzy theory to model the epistemic uncertainty in the feature extraction process. Finally, the OSD of image quality is predicted using the fuzzy learning network by learning the mapping from fuzzy features to fuzzy uncertainty when rating image quality. In addition, to train the proposed fuzzy neural network, we employ a new loss function based on the quantile and the cumulative density function. We experimentally validate the feasibility and superiority of the proposed method in two aspects. On the one hand, we demonstrate the performance of the proposed method in predicting the OSD of image quality on the SJTU IQSD and KonIQ-10K databases. On the other hand, we also prove the feasibility of the proposed method in predicting the MOS of image quality on several popular IQA databases, including CSIQ, TID2013, LIVE MD, and LIVE Challenge.

Abstract:
Multi-grained cross-modal image-text retrieval models have demonstrated promising outcomes through the alignment of local and global features. However, this advancement often results in larger model sizes and higher computational requirements, which raises concerns regarding the balance between performance and efficiency. To address this challenge, we introduce a novel lightweight multi-grained (LMG) image-text retrieval paradigm aimed at enhancing model efficiency. Specifically, in our approach, we first re-frame the retrieval problem as a cascaded representation learning task. This involves leveraging only fine-grained features to capture coarse-grained constraints, thereby reducing computational burden while maintaining accuracy. Furthermore, we replace computationally expensive parametric feature aggregation methods with three efficient parameter-free alternatives: auto-correlation matrix, discrete linear convolution, and discrete Fourier transform. The proposed LMG model is extensively compared with state-of-the-art approaches on two benchmark datasets, i.e., Flickr30K and MSCOCO, and the experimental results highlight the superior performance of LMG. Additionally, we explore the impact of different feature aggregation methods on LMG and conduct a sensitivity analysis on the coarse and fine-grained constraints ratio hyper-parameter.

Abstract:
An automatic vision-based sewer inspection plays a vital role of sewage system in a modern city. Recent advances focus on modeling a deep learning-based method to realize the sewer inspection system, benefiting from the capability of data-driven feature extraction. Although the acceptable performances of sewer defect classification are achieved, there is still a gap between the emerged methods and actual application scenarios. The first issue is that the multi-focus complementarity is ignored to represent the sewer defect, resulting in capturing the multi-scale information of sewer defect inefficiently. Second, the inherent uncertainty of sewer defect is not considered, while the serious unknown sewer defect categories would be missed, resulting in the untrustworthy sewer inspection. In this paper, we focus on quick-view (QV)-based sewer inspection, while a trustworthy multi-focus fusion framework (TMFF) is proposed, jointly combining multi-label classification and uncertainty estimation. Specifically, focal segment module (FSM) is designed based on optical flow to split the QV sewer video into long-focus and short-focus segments, where the multi-focus segments can be modeled to represent the multi-scale information of sewer defect. Then, evidential deep learning (EDL) is introduced to quantify the uncertainty, while joint expert scheme (JES) is designed to aggregate the expert opinions of multi-focus segments. Moreover, evidential disambiguating strategy (EDS) is proposed to alleviate the ambiguity of uncertainty estimation. Extensive experiments are conducted on VideoPipe, in which the superiority of TMFF is demonstrated compared with the state-of-the-art methods. Furthermore, we validate the potential capability of TMFF against the unknown cases of sewer defects.

Abstract:
Due to the overarching similarities of ships, subtle information is imperative for fine-grained ship detection. However, this information is easily lost in adverse weather (e.g., fog, rain, snow, and cloud) or occlusion scenarios. Experts can quickly and accurately recognize fine-grained objects because they have the domain knowledge to help them find the most discriminative information (e.g., edge, structure, texture, and class semantics); thus, they do not need a lot of information to make an identification. Motivated by it, we propose a discriminative information enhancement method with cross-modal domain knowledge (DIE-CDK) for fine-grained ship detection. The core idea behind DIE-CDK is to enhance the discriminative information about fine-grained ships by fusing cross-modal domain knowledge. The introduced cross-modal domain knowledge comprises local and global knowledge: 1) local knowledge is the knowledge of visual shape (e.g., edge contour) which is extracted from the image domain; and 2) global knowledge is the knowledge of the class semantics which is obtained from the common sense domain. In addition, to further study fine-grained ship detection, we introduce a Fine-grained ship dataset (called FgShips). Experiments show that our proposed DIE-CDK method achieves impressive gains in detection performance and outperforms state-of-the-art methods on fine-grained ship and public datasets.

Abstract:
Neural networks for synthetic aperture radar (SAR) automatic target recognition often encounter overfitting challenges owing to limited training samples. Moreover, the azimuth angle of SAR, a vital parameter for improving network generalization, is frequently disregarded in most models. In response, we propose MIGA-Net, a classification neural network that effectively perceives azimuthal information using multi-view images to improve classification performance. Specifically, we quantize low-dimensional azimuthal values for sample-limited scenarios. Then, we utilize encoded image sequences as training data because they encompass spatial context information compared to individual images. After extracting features of the sequence samples through convolutional layers, we design a two-layer output module. One layer converts these sequence features into graph data. Then the dense graph attention network (GAT) extracts contextual features from the graph data for angle estimation. Simultaneously, another layer combines these features for target classification. During the network training, the GAT module can extract image azimuth features with powerful information aggregation capabilities. It supervises the convolutional layers to learn azimuth features, which are fused with class features from another layer to obtain a more structured feature domain. This feature domain significantly enhances the classification performance of the network. Experiments conducted on the moving and stationary target acquisition and recognition (MSTAR) dataset have proven the superior performance of the proposed method, achieving at least 1% higher accuracy compared to other state-of-the-art algorithms.

Abstract:
Robust tensor completion, which aims to recover a tensor from partial observations corrupted by Gaussian noise and sparse noise simultaneously, has a wide range of applications in visual data recovery. The existing approaches make use of convex or nonconvex relaxation based on transformed tensor nuclear norm, which may be challenged since only the global low-rankness of the underlying tensor data is utilized. In order to explore the global and local patterns simultaneously, in this paper, we propose a nonconvex model for robust tensor completion by combining the dictionary learning and nonconvex regularization. For the sake of exploring the global low-rankness, a family of nonconvex functions are employed onto the singular values of all frontal slices of the underlying tensor in the transformed domain. The dictionary learning is utilized to elucidate the local patterns of the underlying tensor data. Moreover, a family of nonconvex functions are used onto each entry of the sparse noise, which can obtain sparser solutions compared with tensor \ell _1 norm. A proximal alternating linearized minimization algorithm is designed to solve the proposed model, whose convergence is established under very mild conditions. Extensive numerical experiments on multispectral images, video, and magnetic resonance imaging datasets show that the proposed model outperforms other state-of-the-art approaches.

Abstract:
The promotion of the HEVC standard has significantly alleviated the burden of network transmission and video storage. However, its inherent complexity and data dependencies pose a significant challenge in achieving high compression efficiency hardware encoder. To tackle this challenge, we propose several hardware-oriented algorithms and achieve a hardware encoder supporting both intra and inter coding. In terms of algorithms, our optimizations focus on intra mode decision, motion estimation (ME), rate estimation, and merge mode estimation. These optimizations reduce the computational complexity and address the data dependencies within and between encoder modules while maintaining an acceptable compression efficiency. As for hardware, we propose an encoder architecture that supports not only 35 intra prediction modes but also ME with an extensive search range of [±64, ±64]. The uniform 4× 4 engine, 2-D data reuse, and timing schedule for intra and inter coding are presented in this architecture to optimize the hardware resource consumption and throughput. Compared with HM 15.0, the proposed hardware-oriented algorithms lead to a 1.88% and 14.57% increase in BD-Rate under the configurations of all intra and low delay P, respectively. Notably, the BD-Rate outperforms all existing hardware encoders supporting 4K resolution. In a GF 28nm fabrication process, the hardware design achieves a clock frequency of 550MHz, supporting 4K@30fps throughput with a hardware gate count of 3154K and memory usage of 1.02MB, and the proposed architecture demonstrates substantial advantages in terms of area, throughput, and power compared to other studies.

Abstract:
Traffic scene perception has a significant impact on driving safety. Inexperienced or distracted drivers usually do not allocate enough attention to the objects closely related to the driving task, which causes potential road hazards. In contrast, experienced drivers pay close attention to the objects highly relevant to the driving task under the guidance of visual selective attention, thus achieving driving safety. However, apart from traffic saliency prediction, few existing works have integrated human driver’s perception with computer models to detect the objects attracting the attention of experienced drivers in traffic videos. In this work, we aim to detect these objects, specifically referred to as traffic fixated objects. To achieve this goal, a new eye-tracking-based video fixated object detection dataset (ET-VFOD) is firstly built, which can be as a benchmark for researchers interested in attention-inspired fixated object detection. Then, we propose a traffic video fixated object detection network named VFOD-Net. VFOD-Net decodes the information closely related to the driving task from the reference frames. The information is used as a top-down prior to modulate the model’s encoding process of the current frame, thus improving the detection performance. Considering the high cost of manual annotation, a weakly supervised traffic video fixated object detection pipeline is developed. Experimental results on the ET-VFOD dataset show that our proposed weakly supervised method achieves detection performance close to that of the fully supervised model, which verifies the effectiveness of the proposed method. Our work combines bottom-up and top-down attention to detect the vital objects in traffic videos from the perspective of human drivers, showing potential applications in intelligent driving, such as driver monitoring and warning systems. The dataset and code are available in https://github.com/YiShi701/VFOD_Net.

Abstract:
The proper use of medical personal protective equipment (MPPE) is critical for frontline healthcare workers (HCWs) to handle highly contagious diseases. Due to the complexity of PPE donning and doffing protocols, public health organizations typically recommend having trained observers monitor the entire PPE donning and doffing process, preventing self-contamination and transmission. However, the high costs of manual monitoring impede the implementation of this practice, which makes AI-assisted PPE monitoring highly valuable. Some studies have applied computer vision techniques to PPE monitoring, but they have only focused on limited integrity checks at donning completion, which is unable to provide real-time warnings for abnormal actions during the doffing process. Furthermore, model practicality and user-friendliness are constrained by the lack of explainability. To address this, we propose an explainable and fine-grained dataset for MPPE doffing monitoring called the XFMP dataset. The dataset contains 3596 expert-annotated samples over three sub-tasks: doffing stage classification (DSC), abnormal action recognition (AAR), and critical region localization (CRL). Accordingly, we introduce multi-dimensional evaluation metrics for XFMP and a multitask human behavior semantic attention network (MHBSAN). Experiments demonstrate that MHBSAN outperforms alternative approaches, achieving 0.968/0.855 accuracy for stage/action classification and 0.791 Top-1 Loc@0.5 for CRL sub-task. Moreover, it demonstrates exceptional adaptability across different healthcare environments. Ablation studies and case analyses further validate the contributions and efficacy of the proposed model regarding classification and explainability.

Abstract:
In recent years, smart healthcare mode has been gradually maturing with the evolution of Internet cloud computing technology. However, due to the open and shared features of the Internet, there are a substantial amount of concerns about how to protect patient privacy security in the smart healthcare mode. Hence, this paper proposes a medical image encryption algorithm using the Josephus scrambling and dynamic cross-diffusion techniques for protecting patient privacy. Firstly, we design a novel hyperchaotic system with a broader chaotic interval that can generate chaotic sequences with strong pseudo-randomness. Then, a new dynamic Josephus scrambling methodology based on chaotic sequences is proposed. The scrambling scheme can effectively scramble the image pixel positions, thereby damaging the correlation between adjacent pixels of the target image. Finally, we devise a dynamic parallel cross-diffusion scheme combined with chaotic sequences to further encrypt the scrambled image. Meanwhile, the SHA-256 combined with the target image to generate the hash value is utilized for updating the initial key of the algorithm. Simulation experiments reveal that the devised encryption algorithm can convert the target image into a meaningless snowflake image. And security analysis results show that the presented algorithm is robust to common cryptanalysis approaches. In addition, we compare the corresponding performance indicators with those of some state-of-the-art image algorithms, and the results demonstrate that our algorithm can exhibit superior encryption performance.

Abstract:
Cross-view geo-localization aims to associate geographical location with different view images shot from different platforms. One of the critical challenges is how to effectively emphasize architectural features and reducing background interference to achieve robust cross-view matching. Most of the existing methods fail to adequately address the features of buildings, treating foreground and background equally. Leveraging prior knowledge to enhance the features of crucial architectural foreground yields greater benefits in Geo-Localization. A comprehensive three stage approach (TirSA) is proposed in this paper, which consists of three components: Pre-processing, Generate Feature Embedding, and Post-processing. In the Pre-processing stage, we employ a self-supervised feature enhancement method (SFEM) to obtain the building aware mask. Without adding additional auxiliary information, the model is guided to learn from discriminative building regions. Besides, in the Generate Feature Embedding stage, we propose an adaptive feature integration module (AFIM) to enhance feature representation capability. We also train the Siamese network using a novel improved cross-domain triplet loss to reduce the impact of inter-view domain gap. Finally, in the Post-processing stage, we employ a re-ranking method to optimize the initial retrieval list, further enhancing the matching accuracy. Remarkably, extensive experiments show that our proposed TirSA exceeds state-of-the-art by a large margin and achieves optimality in both drone-view target localization and drone navigation. Especially in the drone navigation task, our method is superior to the existing methods, achieving an improvement of approximately 5%. Code will be released at https://github.com/SunJ1025/TirSA.

Abstract:
Binocular autostereoscopic display requires a real-time eye localization system with high accuracy and robustness. However, despite a decade of development in deep learning, few models are suitable for real-time operation on an embedded device that can be assembled into the display. In this work, we propose a system-level design for real-time eye localization on a single ARM CPU, named Infrared Guiding Modal Multiuser Eye Localization v2 (IGM-MELv2). Aiming for effective solutions within a constrained computational budget, IGM-MELv2 deconstructs the complex multiuser eye localization challenge into two manageable tasks: thermal-infrared (TIR) face detection and face-region-based visible light eye localization. For TIR face detection, we search for a suitable accuracy-latency tradeoff and propose a tiny-lightweight detector named Yolo-Fastest-TF. For visible light eye localization, we transform a general object detector through a series of design choices, resulting in the tiny-lightweight Yolo-Fastest-VE. Additionally, an RGB-thermal Multiuser Low-resolution Eye Localization (RGBT-MLEL) dataset is collected to build the system. Experimental results reveal that IGM-MELv2 has a frame rate of 44.86 frames per second on a Raspberry Pi 4B and can successfully locate 97.4% of the eyes in the RGBT-MLEL test set. On the two separate tasks, the tiny-lightweight networks have accuracy and robustness comparable to complex models while maintaining high computational efficiency. Yolo-Fastest-TF achieves 91.46 AP50 on TIR face detection with 21MFLOPs computation. Yolo-Fastest-VE achieves a success rate of 100% under an error threshold of 0.15 on BioID with 37MFLOPs computation.

Abstract:
Cervical cancer is the fourth most common cancer in women and its subtyping requires examining histopathological slides or digital images, such as whole slide images (WSIs). However, manually inspecting WSIs with gigapixel sizes can be laborious and prone to errors for pathologists. To address this issue, computer-aided approaches based on weakly-supervised learning techniques have been proposed. These methods can predict disease types directly from WSIs and highlight diagnosis-relevant regions, which can help pathologists achieve faster and more accurate diagnoses. WSIs are divided into overlapping patches using a sliding window approach, and these patches are subsequently screened in a sequential zig-zag pattern to identify spatiotemporal dependencies. These dependencies are further analyzed to generate predictions at the WSI level. Therefore, effective patch feature learning and spatiotemporal aggregation are two key issues in the weakly-supervised WSI classification (WSWC) task. In this paper, we present a label-efficient WSWC method called spatiotemporal aggregation for cervical WSIs (SAC-Net), which jointly performs online feature extraction and feature aggregation to infer the WSI-level prediction in an end-to-end manner. The online feature extractor helps to learn cervical-cancer-specific features and obtain more accurate patch representations. The feature aggregator uses an online instance clustering method to learn proper weight parameters for each cluster, which generates the WSI embedding with enhanced spatiotemporal aggregation. SAC-Net is developed and evaluated on a public cervical WSI dataset (TissueNet) containing 1015 WSIs, which are also externally tested on three independent cervical WSI datasets. Our results demonstrate that SAC-Net achieves state-of-the-art classification performance and is robust. SAC-Net has the potential to be a useful tool for clinical cervical cancer detection.

Abstract:
As a fundamental task of 3D perception, point cloud recognition has shown significant progress in recent years. However, existing methods still face challenges when dealing with geometry differences, resulting in performance degradation when a distribution gap exists between the training and testing data, also known as domain generalization. In this work, we focus on this problem and propose a general training framework, named Push-and-Pull, aimed at effectively improving the generalization ability of models on unseen target domains. Specifically, our framework first introduces a learnable 3D data augmentor to generate new training point clouds, which helps to reduce the domain bias and enrich the source training set. Also, an adversarial training strategy is proposed to push the augmented samples away from the original ones in the latent space and meanwhile keep the geometric structure. Second, based on the original and augmented samples, a dual-level consistency regularization strategy on logits and feature spaces is designed to pull the deviated representations back to their original space as close as possible, and promote discriminative and domain-agnostic representations. These two steps are iteratively optimized to enhance the overall performance. Extensive experiments on the PointDA-10 and Sim2Real benchmarks consistently demonstrate the effectiveness of our proposed framework.

Abstract:
The current mainstream studies on Scene Graph Generation (SGG) devote to the long-tailed predicate distribution problem to generate unbiased scene graph. The long-tailed predicate distribution exists in VG dataset and is more severe during the SGG network training process. Most existing de-biasing methods solve the problem by applying re- sampling or re- weighting in a mini-batch, with the main idea being to provide unbiased attention to different predicate categories based on prior predicate distributions. During the training process of SGG models, existing training mode samples several images into a mini-batch to obtain training data, thus providing sparse and scattered predicate instances for training. However, sampling predicate instances from a limited set of predicate samples in terms of quantity and category poses difficulties in training unbiased SGG models. In order to provide a wider range for sampling predicate instances, this paper reorganizes the images in VG training set with a new form, i.e. object-pairs, and constructs VG-OP (VG Object-Pair) training set to save object-pairs. Meanwhile, this paper introduces a new SGG network training mode, which can realize unbiased SGG without re- sampling or re- weighting. In particular, a Predicate-balanced Sampling Network (PS-Net) is designed to validate the new training mode. Extensive experiments on VG test set demonstrate that our method achieves competitive or state-of-the-art unbiased SGG performance.

Abstract:
Camouflaged Object Detection (COD) is a challenging visual task due to its complex contour, diverse scales, and high similarity to the background. Existing COD methods encounter two predicaments: One is that they are prone to falling into local perception, resulting in inaccurate object localization; Another issue is the difficulty in achieving precise object segmentation due to a lack of detailed information. In addition, most COD methods typically require larger parameter amounts and higher computational complexity in pursuit of better performance. To this end, we propose a global localization perception and local guidance refinement network (PRNet), that simultaneously addresses performance and computational costs. Through effective aggregation and use of semantic and details information, the PRNet can achieve accurate localization and refined segmentation of camouflaged objects. Specifically, with the help of a Cascaded Attention Perceptron (CAP) designed, we can effectively integrate and perceive multi-scale information to localize camouflaged objects. We also design a Guided Refinement Decoder (GRD) in a top-down manner to extract context information and aggregate details to further refine camouflaged prediction results. Extensive experimental results demonstrate that our PRNet outperforms 12 state-of-the-art models on 4 challenging datasets. Meanwhile, the PRNet has a smaller number of parameters (12.74M), lower computational complexity (10.24G), and real-time inference speed (105FPS). Source codes are available at https://github.com/hu-xh/PRNet.

Abstract:
LiDAR and camera are two critical sensors that can provide complementary information for accurate 3D object detection. Most works are devoted to improving the detection performance of fusion models on the clean and well-collected datasets. However, the collected point clouds and images in real scenarios may be corrupted to various degrees due to potential sensor malfunctions, which greatly affects the robustness of the fusion model and poses a threat to safe deployment. In this paper, we first analyze the shortcomings of most fusion detectors, which rely mainly on the LiDAR branch, and the potential of the bird’s eye-view (BEV) paradigm in dealing with partial sensor failures. Based on that, we present a robust LiDAR-camera fusion pipeline in unified BEV space with two novel designs under four typical LiDAR-camera malfunction cases. Specifically, a mutual deformable attention is proposed to dynamically model the spatial feature relationship and reduce the interference caused by the corrupted modality, and a temporal aggregation module is devised to fully utilize the rich information in the temporal domain. Together with the decoupled feature extraction for each modality and holistic BEV space fusion, the proposed detector, termed RobBEV, can work stably regardless of single-modality data corruption. Extensive experiments on the large-scale nuScenes dataset under robust settings demonstrate the effectiveness of our approach.

Abstract:
Point cloud based 3D deep model has wide applications in many applications such as autonomous driving, house robot, etc. Inspired by the recent prompt learning in natural language processing, this work proposes a novel Multi-view Vision Fusion Network (MvNet) for few-shot 3D point cloud classification. MvNet investigates the possibility of leveraging the off-the-shelf 2D pre-trained models to achieve the few-shot classification, which can alleviate the over-dependence issue of the existing baseline models towards the large-scale annotated 3D point cloud data. Specifically, MvNet first encodes a 3D point cloud into multi-view image features for a number of different views. Then, a novel multi-view prompt fusion module is developed to fuse information from different views effectively to bridge the gap between 3D point cloud data and 2D pre-trained models. A set of 2D image prompts can then be derived to better describe the suitable prior knowledge for a large-scale pre-trained image model for few-shot 3D point cloud classification. Extensive experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification. The source code of this work is available at https://github.com/invictus717/MetaTransformer.

Abstract:
In reviewing the research progress in Point Cloud Quality Assessment (PCQA), two main pathways have emerged, i.e., 2D projections and 3D point descriptors. The former primarily focuses on visual information, while the latter concentrates on crucial geometrical information in three-dimensional space. However, the current studies lack a thorough investigation of the impact of visual components and seldom pay special attention to plane-point fusion strategies. To comprehensively represent features and effectively tackle various types of impairments, we propose an end-to-end learning paradigm, only considering plain visual and geometrical factors called Plain-PCQA, for quantitatively evaluating objective metrics of 3D dense point clouds associated with human perception. Firstly, we explore a sophisticated preprocessing technique. The entire point clouds are packaged into six projections by moving virtual cameras, which can conveniently increase the visual samples during the training stage. Given the high resolution of the projected image, we have opted for a relatively lightweight network, namely ResNet-18, as the backbone to enable higher resolution input data. Five cropped patches from the projected image are collectively fed into this network. In light of the presence of some invalid information in the projections, a mask weight is devised to calculate the significance of each patch based on its effective informational content. Secondly, dual neural networks, comprising of a No-Reference (NR) branch and a Degraded-Reference (DR) branch, are designed with fundamental visual components to provide quantitative quality metrics. Specifically, the NR branch utilizes the feature output of each block in the Vision Transformer (ViT) model to obtain long-range low-level and high-level visual NR quality. The DR branch employs KLT (Karhunen-Loève Transform) to acquire the principal component information of an image as the macro-structural image, and then feeds the difference between input images and macro-structural images into a network for DR quality extraction. Thirdly, a Plane-Point Interaction Transformer (P2IT) is presented by incorporating texture and semantic features in 2D projections and geometrical features in 3D spaces to characterize the complete features with a connected 2D-3D feature representation. With these elaborately designed deep features, the proposed model can achieve competitive performances relying solely on plain visual and geometrical components. The experimental results demonstrate the potential of the proposed approach in multiple representative databases, which surpasses existing state-of-the-art methods significantly.

Abstract:
Point clouds offer a novel 3D data representation that has proven pivotal in immersive visual media applications involving human perception. Developing objective point cloud quality assessment (PCQA) methods is imperative, as they can substantially reduce human evaluation costs and drive advancements for visual perceptual experiences in point cloud related applications. Point cloud quality assessment without reference remains challenging. Previous PCQA methods predominantly employ a fixed perceptual distance and often overlook the variability in quality perceived from different viewpoints, which impedes their effectiveness in multiscale or multi-granularity feature extraction and learning, particularly for deep neural networks. The single fixed observation distance fails to capture the multi-resolution characteristics intrinsic to human perception. Addressing this gap, in this paper, we introduce a novel no-reference PCQA method (MOD-PCQA) that integrates multiscale features to enhance point cloud quality perception across diverse scales and granularities. MOD-PCQA pioneers a viewpoint-aware feature learning framework, capable of capturing visual features across various granularity levels, from fine to coarse. Specifically, we process and project point clouds into images from different viewpoints. Then, we extract multi-scale features under corresponding perspectives through three branch networks. Finally, we design an alternate learning strategy to optimize the feature extraction network to continuously refine the learned feature information from both inter-scale and intra-scale perspectives. Comprehensive experiments conducted on the SJTU-PCQA and WPC databases validate the superiority of our proposed model over state-of-the-art PCQA methods. Our method achieves optimal performance on both benchmarks by a significant margin, which comprehensively validates its effectiveness for the challenging PCQA task. The source code will be available at https://openi.pcl.ac.cn/OpenPointCloud/MOD-PCQA Zhang et al.

Abstract:
Monocular object 6D pose estimation is a fundamental yet challenging task in computer vision. Recently, deep learning has been proven to be capable of predicting remarkable results in this task. Existing works often adopt a two-stage pipeline with establishing 2D-3D correspondences and utilizing a PnP/RANSAC or differentiable PnP algorithm to recover 6 degrees-of-freedom (6DoF) pose parameters. However, most of them hardly consider the geometric features in 3D space, and ignore the topological cues when performing differentiable PnP algorithms. To this end, we present an improved end-to-end monocular 6D pose estimation method (DGECN++) that incorporates depth estimation and a geometric-aware learnable PnP network. Our method is based on keypoints. First we detect the 2D keypoints that correspond to the 3D model. We then integrate differentiable PnP/RANSAC algorithm to create an end-to-end pipeline for 6D pose estimation. We focuses on the following three key aspects: 1) We utilize the estimated depth information to guide the process of extracting 2D-3D correspondences and refine the results using a cascaded differentiable PnP/RANSAC algorithm that incorporates geometric information. 2) We leverage the uncertainty of the estimated depth map to enhance the accuracy and robustness of the predicted 6D pose. 3) We propose a differentiable Perspective-n-Point (PnP) algorithm based on edge convolution and self-attention to explore the topological relationships between 2D-3D correspondences. Experimental results demonstrate that our proposed network surpasses existing methods in terms of both effectiveness and efficiency.

Abstract:
3D Panoptic perception is essential for the understanding of real-world environment and plays an increasingly important role in the field of robotics. However, most existing methods heavily rely on image panoptic segmentation networks to acquire panoptic information of the environment, which is time-consuming and susceptible to interference. In this paper, we propose a novel and efficient panoptic mapping method based on multi-source information. Specifically, to improve the real-time performance of the system, we first apply lightweight object detection and semantic segmentation to extract 2D semantic and instance information from images. Second, a panoptic inference algorithm is designed that fully utilizes multi-source information, including geometry-based and learning-based information, to simultaneously reason about background and foreground objects in the environment. Finally, we take advantage of the scalability of the framework by introducing a multi-object tracking algorithm into the framework, thus providing the temporal information among consecutive frames to the data association module. Based on two popular datasets, extensive comparison experiments are conducted to illustrate the effectiveness of the proposed method. Experimental results show that compared with state-of-the-art panoptic mapping methods, the proposed method achieves superior performance in accuracy, real-timeness and stability. Furthermore, we also evaluate our method in real-world scenarios and CPU-only device to demonstrate the feasibility of its practical deployment.

Abstract:
Blind Super-Resolution (BlindSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) images without prior knowledge of the image degradation process. This is a challenging problem in real-world applications, where the degradation can be complex and unknown. Recent unsupervised learning-based BlindSR methods can estimate the image degradation in an unsupervised manner, but they suffer from limited adaptability to different types and intensities of degradation. They tend to capture the average level of degradation across all training samples, resulting in over-smoothing or over-sharpening effects for some images. As a result, the final reconstruction may exhibit the mean effect. Moreover, existing synthetic datasets do not reflect the real-world degradation scenarios, making it difficult to evaluate the performance of BlindSR methods. To address these issues, we propose a novel Degradation Intensity Estimation Module (DIEM) method, which can estimate the pixel-level degradation information of the input image more specifically and use it to guide image reconstruction. Furthermore, we construct a benchmark dataset under real scenarios, which is closer to the real-world BlindSR problem than existing synthetic datasets, and can provide a more reasonable evaluation of BlindSR methods. Extensive experimental results demonstrate that our DIEM-guided BlindSR method can achieve state-of-the-art image reconstruction results. Our code and pre-trained models have been uploaded to GitHub for validation.

Abstract:
Active Task Cognition (ATC) requires the robot to comprehend the current scene using the image within the field of view, enabling them to reason about appropriate and executable tasks, thus allowing the robot to achieve service task scene discovery capability similar to humans. This capability is paramount for robots to provide comfort and intelligent service while performing their tasks. To enhance home service robots’ ATC capability, a multi-graph fusion mechanism based on Graph Attention Network (GAT) is proposed in this paper to model the semantic feature related to the task. First, a multi-graph fusion encoder is proposed to maximally capture the integrated features of objects, tasks, and scenes from the images, thereby obtaining a semantic representation related to the home service task from the robot’s perspective. Next, to enhance the interpretability of the model, we propose a multi-task scene understanding decoder based on the attention mechanism to utilize the integration features of multi-graph fusion efficiently. Lastly, we present a loss function for multi-task scene understanding in the proposed Encoder-Decoder network model for scene comprehension. Furthermore, a new dataset comprising various daily household tasks is constructed in the experiments. Extensive experimental results indicate that the proposed method significantly enhances the robot’s active cognitive abilities in service tasks, empowering it with advanced levels of intelligence.

Abstract:
Traditional affordance learning tasks aim to understand object’s interactive functions in an image, such as affordance recognition and affordance detection. However, these tasks cannot determine whether the object is currently interacting, which is crucial for many follow-up tasks, including robotic manipulation and planning task. To fill this gap, this paper proposes a novel object affrodance state (OAS) recognition task, i.e., simultaneously recognizing an object’s affordances and the partner objects that are interacting with it. Accordingly, to facilitate the application of deep learning technology, an OAS recognition task related dataset OAS10k is constructed by collecting and labeling over 10k images. In the dataset, a sample is defined as a set of an image and its OAS labels, each label is represented as \left \langle \rm \textit subject, subject's affrodance, interacted object \right \rangle . These triplet labels have rich relational semantic information, which can improve OAS recognition performance. We hence construct a directed OAS knowledge graph of affordance states, and extract an OAS matrix from it for modelling the semantic relationships of the triplets. Based on the matrix, we propose an OAS recognition network (OASNet), which utilizes GCN to capture the relational semantic embeddings, and uses a transformer to fuse them with the visual features from an image to recognize the affordance states of objects in the image. Experimental results on OAS10k dataset and other triplet label recognition datasets demonstrate that the proposed OASNet achieves the best performance compared to the state-of-the-art methods. The dataset and codes will be released on https://github.com/mxmdpc/OAS.

Abstract:
In the clinically widely used rating scale (MDS-UPDRS), the pronation-supination movement task of hands is required for assessment of bradykinesia, which is a typical clinical symptom of Parkinson’s disease (PD). Due to inter-rater variability in the task rating process, objective automated rating models are critically needed. Still, the performance of such models would be limited if prior knowledge of the clinical rating principles is not adequately accounted for. Therefore, we propose a clinically guided method which fully exploits the MDS-UPDRS rating principles to achieve consistent and accurate automated rating. First, a multi-scale framework which employs two graph convolutional networks (GCNs) as two streams is developed to extract transient and persistent features related to these rating principles. In particular, abnormal transient features are detected through a specialized multiple-instance-learning GCN. Moreover, the multiple-instance-learning GCN is equipped with an accumulation-aware ordered multiple-instance pooling module, which estimates sample-level severity by accounting for both the intra-instance intrinsic severity order and the inter-instance accumulation effects. Besides, an instance context encoding module is designed to combine the phase information in the pronation-supination movement cycles with instances’ motion features. This facilitates the differentiation between PD-induced halts and natural periodic halts. Our method demonstrated excellent performance on both a large clinical dataset and an additional independent test dataset. Our proposed scheme only requires consumer-level cameras, and therefore exhibits high potential for large-scale applications in PD telemedicine.

Abstract:
Correlation filter (CF)-based approaches have been widely applied in online object tracking tasks for unmanned aerial vehicles (UAVs) due to their high computational efficiency and low memory consumption. One of the key steps is to perform correlation operations between the appearance model (AM) and the filter. However, as the difficulty in controlling the learning rate of the AM, most existing trackers are prone to causing degradation. In this paper, we propose a novel complementary AM (CAM) consisting of a primary model (PM) and a secondary model (SM). Specifically, the learning rates of the PM and SM are approximately complementary, allowing the CAM to consider both past and current information. Moreover, in order to take full advantage of historical information, a CAM-based reversibility reasoning approach is proposed for CF training. It can robustly handle the variations in object appearance. Then we further create a deep tracker by fusing convolutional features which demonstrates more outstanding performance. We also embed the CAM into two advanced trackers to validate the scalability of the CAM. Comprehensive experiments on six challenging UAV tracking benchmarks have indicated the superiority of our method compared to other 36 state-of-the-art CPU- and GPU-based trackers, with a speed of 45 FPS running on a cheap CPU.

Abstract:
In recent years, vision transformers have been introduced into face recognition and analysis and have achieved performance breakthroughs. However, most previous methods generally train a single model or an ensemble of models to perform the desired task, which ignores the synergy among different tasks and fails to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, the SwinFace, consists of a single shared backbone together with a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 \epsilon -error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively.

Abstract:
Deep incremental hashing methods require a large number of original training samples to preserve old knowledge. However, the old training samples are not always available. This “data-free” setting poses great challenges for learning discriminative codes for new classes (plasticity) and maintaining the code invariance of old ones (stability). On the one hand, the presence of ambiguous data in new-emerging classes, which is highly similar to that in old classes, further aggravates catastrophic forgetting. On the other hand, although well-separated hash codes of new classes can be learned by forcing them towards fixed hash centers, it may significantly change the learned parameters of the old model, leading to severe forgetting on old classes. To alleviate the stability-plasticity dilemma in data-free situations, this paper presents a novel deep incremental hashing method called Data-Free Deep Incremental Hashing (DFIH) from the data to the optimization aspect. We start from the data aspect and propose a data disambiguation module to reveal and discard ambiguous data, especially pixels to alleviate the forgetting issues. Subsequently, we introduce a set of trainable hash proxies during the optimization process. These proxies are optimized adaptively as well as the hash codes, not only guiding the model to learn discriminative hash codes for new classes but also avoiding the dramatic modification of the model’s parameters, thus improving plasticity and maintaining stability. Extensive experiments on six widely-used image retrieval benchmarks and sixteen incremental learning situations show the superiority of DFIH. Ablation analysis further confirms the effectiveness of the components in DFIH. The code of this work is released at https://github.com/SuQinghang/DFIH.

Abstract:
Over recent years, deep learning has significantly boosted scene text detection performance, and current segmentation-based scene text detectors can achieve compact bounding boxes for irregular texts. However, it is also challenging to tackle crowded or overlapping texts for these existing methods due to conglutination between adjacent text instances in segmentation results. To address these issues, we propose a more accurate scene text detector, Text Position-Aware Pixel Aggregation Network, termed TPPAN. Specifically, a Gaussian threshold representation is adaptively learned instead of a constant setting in Adaptively Text Kernel Thresholding (ATKT) module to obtain more accurate text kernels. Then Text Position-Aware Region Pixel Aggregation (TPAR-PA) module predicts the text regions in relative positions and generates more accurate text contours. Adequate experiments have demonstrated that the resulting detector has achieved state-of-the-art performance on multi-oriented and curved scene text benchmarks.

Abstract:
The quantization step is a crucial parameter in JPEG compression, that can reveal the compression history of a JPEG image. Estimating the quantization steps for single compressed and recompressed images is attracting considerable interest in the field of image forensics and steganalysis. Several effective methods have been proposed, but the performance of these methods still needs to be improved on small-sized and low-quality images. To solve the above problems, feature enrichment is performed on images in the frequency domain, resulting in clustering discrete cosine transform (DCT) coefficients of the same frequency. Then, we construct a hierarchical connection within the residual blocks of the network to represent multi-scale features, enabling the network to learn deep features of the image. At the same time, we use multiple small-sized convolution kernels instead of one large-sized convolution kernel to minimize the impact of block artifacts. Based on the above two ideas, we construct a network model, Res2Net-C, to discover information about the quantization steps in the frequency domain. The integration of multi-channel information of color images is achieved by multi-channel convolution, and the quantization steps of the chrominance and luminance channels of the color images are estimated. The experimental results show that the accuracy of the proposed method for estimating the quantization steps is 29.97% better than that of the existing algorithm with a single compressed dataset and 4.87% better than that of the existing algorithm with a recompressed image dataset. In addition, the method has good performance with mixed datasets that contain both single compressed and recompressed images.

Abstract:
State-of-the-art Active Learning (AL) methods often encounter challenges associated with a hysteretic learning process and an expensive data sampling mechanism. The former implies that data selection in the ( i+1 )-th round is solely based on the learned model’s results in the i -th round. The latter involves using model inference to calculate data value (e.g., uncertainty estimation based on model inference), which can be cumbersome, particularly when working with large datasets or Deep Neural Networks (DNNs). To address these challenges, we propose FastAL, an efficient and dynamic deep AL framework. Our approach includes an efficient method for calculating data value from the frequency domain perspective, generating multiple candidates. Then, we introduce the Fast Evaluation Module, which directly calculates each candidate’s contribution to future model training and selects the best options. In addition, current AL methods, particularly those based on uncertainty, are susceptible to data bias, which implies that selected data may not represent the original unlabeled data adequately. To alleviate this issue, we propose the De-similar Module, which removes partially similar data. The above three modules are model-agnostic and thus can be seamlessly integrated into any Active Learning framework. We conducted rigorous experiments on various benchmark datasets to validate our approach’s effectiveness. Our results demonstrate that FastAL outperforms other state-of-the-art methods by a significant margin, including those based on uncertainty, diversity, and expected model change.

Abstract:
4D light field data record the scene from multiple views, thus implicitly providing beneficial depth cue for salient object detection in challenging scenes. Existing light field salient object detection (LF SOD) methods usually use a large number of views to improve the detection accuracy. However, using so many views for LF SOD brings difficulties to its practical applications. Considering that adjacent views in a light field are actually with very similar contents, in this work, we propose defining a more efficient pattern of input views, i. e., key sparse views, and design a network to effectively explore the depth cue from sparse views for LF SOD. Specifically, we firstly introduce a low rank-based statistical analysis to the existing LF SOD datasets, which allows us to conclude a fixed yet universal pattern for our key sparse views, including the number and positions of views. These views maintain the sufficient depth cue, but greatly lower the number of views to be captured and processed, facilitating practical applications. Then, we propose an effective solution with a key Complementary and Discriminative Interaction Module (CDIM) for LF SOD from key sparse views, named CDINet. The CDINet follows a two-stream structure to extract the depth cue from the light field stream (i. e., sparse views) and the appearance cue from the RGB stream (i. e., center view), generating features and initial saliency maps for each stream. The CDIM is tailored for inter-stream interaction of both these features and saliency maps, using the depth cue to complement the missing salient regions in RGB stream and discriminate the background distraction, to enhance the final saliency map further. Extensive experiments on three LF multi-view datasets demonstrate that our CDINet not only outperforms the state-of-the-art 2D methods, but also achieves competitive performance as compared with the state-of-the-art 3D and 4D methods. The code and results of our method are available at https://github.com/GilbertRC/LFSOD-CDINet.

Abstract:
Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning, as well as complexities in data collection and processing of 3D point cloud sources. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field, which hinders its progress. In this paper, we provide a comprehensive review of 3D dense captioning, covering task definition, architecture classification, dataset analysis, evaluation metrics, and in-depth prosperity discussions. Based on a synthesis of previous literature, we refine a standard pipeline that serves as a common paradigm for existing methods. We also introduce a clear taxonomy of existing models, summarize technologies involved in different modules, and conduct detailed experiment analysis. Instead of a chronological order introduction, we categorize the methods into different classes to facilitate exploration and analysis of the differences and connections among existing techniques. We also provide a reading guideline to assist readers with different backgrounds and purposes in reading efficiently. Furthermore, we propose a series of promising future directions for 3D dense captioning by identifying challenges and aligning them with the development of related tasks, offering valuable insights and inspiring future research in this field. Our aim is to provide a comprehensive understanding of 3D dense captioning, foster further investigations, and contribute to the development of novel applications in multimedia and related domains.

Abstract:
To make the video more attractive, original video materials usually need postprocessing by video editors, especially to eliminate low-quality abnormal clips, which seriously affect the visual effect. One of the main reasons for the low-quality abnormal clips is that there are occluders that accidentally break into the shot to occlude the protagonist, resulting in the loss of the video protagonist’s information. However, it is time-consuming and laborious to manually find shot occlusion clips, so computer vision technology can be used to assist editors in completing this work. The previous solutions directly utilize neural networks to detect shot occlusion, so their performance is affected by the size and quality of the dataset. In contrast, inspired by the change of depth information in the frame caused by the occluder breaking into the shot, we propose an algorithm for video shot occlusion detection based on the fluctuation of depth information. This algorithm does not need occlusion data training and can detect shot occlusion well only by capturing the abnormal fluctuations of the frame depth information. Additionally, to overcome the defect in that the first video shot occlusion detection (VSOD) dataset released in our conference publication can only verify the sensitivity of detection methods, we expand the VSOD dataset to evaluate the comprehensive performance of detection algorithms. The plentiful experimental results show that, compared with state-of-the-art occlusion detection methods and self-designed baseline methods, our algorithm significantly improves the comprehensive performance of video shot occlusion detection. Furthermore, through verification on datasets with different data types and distributions, our shot occlusion detection algorithm can maintain an occlusion event recall of over 95%, while the false positive rate does not exceed 3%, demonstrating good generalization ability. To promote reproducible research, the code and dataset are available at https://github.com/Junhua-Liao/VSOD.

Abstract:
Increasing artwork plagiarism incidents stresses the urgent need for proper copyright protection on behalf of the creators. The latest development in this context focuses on embedding watermarks via deep encoder-decoder networks. However, we find that deep watermarking has a serious vulnerability on its robustness when facing deliberate plagiarism. To manifest it, we construct an attack that misuses watermarking encoder as a plagiarism lookout for bypassing copyright detection. As a remedy, we propose a patch-level deep watermarking framework (DIPW) to retain copyright evidence in essential patches with plagiarism resistance, inspired by a user study observation that subject elements in artworks are the principal plagiarism entities. Technically, DIPW adaptively finds the embedding patches by identifying a subset of non-overlapping and feature-rich objects; and tailors the model with dual-distortion losses and adversarial plagiarism noise injection for robustness. Experimental results demonstrate the superiority of DIPW in facilitating better robustness, secrecy, and imperceptibility with acceptable time burden.

Abstract:
The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF.

Abstract:
Obtaining clear sonar images is crucial for ocean exploration applications, such as marine resource detection and underwater target searches. Traditional filtering methods cannot effectively eliminate the noise generated by the complex underwater environment in sonar images and can potentially result in problems such as image blurring. Existing methods that effectively filter sonar image noise often lack real-time performance, making them impractical for ocean exploration. To address these limitations, this study proposes a real-time denoising technique for forward-looking multi-beam sonar images based on a non-local means filtering algorithm. The integral image is used to calculate the mean square error (MSE), which improves algorithm efficiency and ensures that the runtime remains unaffected by the neighbourhood window size. To further improve real-time performance, the algorithm is migrated to a graphics processing unit (GPU) and a block-wise computation method is proposed to calculate the integral image. Simultaneously, to enhance GPU thread utilisation, the three-dimensional thread structure from the compute unified device architecture (CUDA) programming model is utilised and additional threads are allocated to enhance computation. The captured images are filtered using an M1200d sonar device manufactured by Oculus. Extensive experiments demonstrate that the proposed method achieves excellent performance regarding both denoising accuracy and efficiency. Specifically, the proposed method achieves a peak signal-to-noise ratio higher than 25 dB and a structural similarity index of more than 0.85 at 50 frames per second, thus demonstrating its significant potential for real-time sonar image denoising.

Abstract:
The Vision Transformer (ViT) models have demonstrated excellent performance in computer vision tasks, but a large amount of computation and memory access for massive matrix multiplications lead to degraded hardware performance compared to convolutional neural network (CNN). In this paper, we propose a ViT accelerator with a novel “Weight-Loop” dataflow and its computing unit, for efficient matrix multiplication computation. By data partitioning and rearrangement, the number of memory accesses and the number of registers are greatly reduced, and the adder trees are eliminated. A computation pipeline with the proposed dataflow scheduling method is constructed to maintain a high utilization rate through zero bubble switching. Moreover, a novel accurate dual INT8 multiply-accumulate (DI8MAC) method for DSP optimization is introduced to eliminate the additional correction circuits by weight encoding. Verified in the Xilinx XCZU9EG FPGA, the proposed ViT accelerator achieves the lowest inference latencies of 3.91 ms and 13.98 ms for ViT-S and ViT-B, respectively. The throughput of the accelerator can reach up to 2330.2 GOPs with an energy efficiency of 109 GOPs/W, showing a significant improvement compared to the state-of-the-art works.

Abstract:
Invertible secret image sharing with authentication (ISISA) distributes comprehensible stego images generated from secret images and cover images to involved participants. The secret image and cover image can be correctly recovered after authentication. However, existing ISISA schemes suffer from issues such as a single kind of image, limited embedding capacity, poor visual quality and a lack of authentication capability. To address these issues, this paper provides a novel invertible secret image sharing scheme with authentication for embedding color palette images into true color images. In this scheme, the pixels of the palette secret image and the bits of the cover pixel are used as coefficients of the polynomial. Share is embedded into a true color cover image to generate an intermediate stego image. Authentication information is then derived from the intermediate stego image and hidden in the cover image. The final stego images that resemble the cover image are obtained and sent to authorized participants. At the receiver end, once k stego images are verified, the secret image and cover image can be losslessly recovered for a (k, n)-threshold scheme. The experimental results and theoretical analysis demonstrated the superiority and practicality of the scheme.

Abstract:
Significant advancements in RGB-D semantic segmentation have been made owing to the increasing availability of robust depth information. Most researchers have combined depth with RGB data to capture complementary information in images. Although this approach improves segmentation performance, it requires excessive model parameters. To address this problem, we propose DGPINet-KD, a deep-guided and progressive integration network with knowledge distillation (KD) for RGB-D indoor scene analysis. First, we used branching attention and depth guidance to capture coordinated, precise location information and extract more complete spatial information from the depth map to complement the semantic information for the encoded features. Second, we trained the student network (DGPINet-S) with a well-trained teacher network (DGPINet-T) using a multilevel KD. Third, an integration unit was developed to explore the contextual dependencies of the decoding features and to enhance relational KD. Comprehensive experiments on two challenging indoor benchmark datasets, NYUDv2 and SUN RGB-D, demonstrated that DGPINet-KD achieved improved performance in indoor scene analysis tasks compared with existing methods. Notably, on the NYUDv2 dataset, DGPINet-KD (DGPINet-S with KD) achieves a pixel accuracy gain of 1.7% and a class accuracy gain of 2.3% compared with DGPINet-S. In addition, compared with DGPINet-T, the proposed DGPINet-KD (DGPINet-S with KD) utilizes significantly fewer parameters (29.3M) while maintaining accuracy. The source code is available at https://github.com/XUEXIKUAIL/DGPINet.

Abstract:
The Contrastive Language-Image Pre-training (CLIP) model achieves strong generalization by using a large number of text-image pairs for contrastive learning. However, when it is transferred to action recognition, the following two questions remain to be solved: 1) How to guide the model to focus more on human-body-related regions to better align actions and text, and 2) How to make the model strengthen itself in a targeted manner to deal with difficult-to-classify categories. To solve these problems, a Guided alignment and adaptive Boosting CLIP (GBC) is proposed, which employs visual prior knowledge and benefits from both feature and decision aggregation in a boosting manner. During early training, visual prior knowledge related to human body is adopted, which enables the model to better align human actions with category text to be robust to distribution shift. At the later stage of training, the CLIP encoder is frozen, and multiple downstream feature & decision aggregation modules are sequentially generated and trained. In such way, the model is able to boost the performance from different perspectives in the Boosting manner and at a linearly increasing cost. Moreover, a class-adaptive re-weighting strategy is proposed to make the model focus more on optimizing categories that are difficult to classify. The effectiveness of our model is validated on six action recognition datasets (Kinetics-600, Kinetics-400, Jester, HMDB-51, UCF-101, and Mini-Kinetics-200), including both fully supervised and zero-shot experiments. Our model achieves superior results compared to state-of-the-art methods on all datasets.

Abstract:
First-person view attention has been widely studied in computer science domain since 1990s while third-person view attention in natural scenarios begins to gain the intensive interest until 2015. This paper focuses on the problem of third-person view attention prediction in natural scenarios where a human freely performs daily activities without constraints. To handle the two insuffiencies of existing methods: 1) assuming some extra information (except for input images) are given in advance; and 2) ignoring the importance of human-scene interaction, this paper proposes a model with weak information dependency, which helps to alleviate annotation costs. In addition, a transformer-based human-scene interaction mechanism is proposed to explore the global and long-dependency contexts between the human and scene. The pipeline of the proposed model is firstly extracting human and scene features, then inferring human attention probability map by fusing human and scene features via a transformer-based network, and finally predicting human attention object based on human attention probability map and object detection. The experiments on two public datasets validate the effectiveness of our model.

Abstract:
Thermal infrared image super-resolution technology successfully solves the problems of low resolution and blurred texture details in infrared images. However, the problem of background thermal noise and streak interference in thermal infrared images has not been effectively solved. Therefore, in this paper, we analyze and model the generation of background thermal noise and streak interference, and propose a real-world super-resolution algorithm based on generative adversarial network with multi-structure fusion. We first statistically analyze the imaging principle and dataset of the thermal imager to better model the phenomenon of background thermal noise and streak interference present in thermal infrared images. Meanwhile, in order to better recover the details, we use grayed-out visible images to guide the network training and propose a novel generator with multi-structural fusion. In the generator, we design a dynamic dense-attention module that dynamically assigns weights to the attention branch and the densely connected branch to take full advantage of both branches. Compared to other state-of-the-art methods, our proposed method exhibits excellent visual effects, effectively eliminating the effects of noise and streaks while enhancing image texture information.

Abstract:
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

Affiliations: Qiushi Academy for Advanced Studies and the College of Computer Science and Technology, Zhejiang University, Hangzhou, China; School of Software Technology, the College of Computer Science and Technology, and the Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, China; School of Brain Science and Brain Medicine, Zhejiang University, Hangzhou, China; CCAI by MOE and Zhejiang Provincial Government, Hangzhou, China; Department of Neurobiology, and the Department of Neurology of the Fourth Affiliated Hospital, Affiliated Mental Health Center, Zhejiang University School of Medicine, Hangzhou, China

Abstract:
Accurately parsing (i.e., segmenting and recognizing) muscles of freely-moving animals such as Drosophila larva in light-sheet fluorescence microscopy images is necessary to study the relationship between muscle activity and animal motions. However, this task is challenging due to the large inter-class similarity and intra-class variance of muscles, as well as the in-homogeneous intensity and blurred boundaries of neighboring muscles. Existing semantic and instance segmentation methods cannot effectively overcome these challenges, resulting in poor segmentation and unreliable classification. In this work, we propose a novel framework named MuscleParseNet that explicitly utilizes sequential and spatial contexts to address these challenges. MuscleParseNet contains a deformable muscle candidate detector (D-CMD) to detect candidate muscles, and a sequential and spatial context-based fine muscle parser (SS-FMP) to refine the candidates. D-CMD boosts Mask RCNN with deformable convolutions to capture shape variations for more accurate muscle segmentation. Moreover, SS-FMP re-classifies the detected candidates by establishing a global spatial context to explicitly reflect spatial relative location, then optimizes the classification using the sequential associations of candidates in adjacent frames, which significantly improves muscle recognition accuracy. Experiments on the synchronized muscle-motion dataset of nearly freely-moving larvae show that MuscleParseNet produces promising results, outperforming state-of-the-art semantic and instance segmentation methods.

Abstract:
For autonomous ground vehicles, global localization with 3D LiDAR is an indispensable part of tasks such as navigation. Usually, global localization using LiDAR is subdivided into two sub-problems, place recognition and global registration. For place recognition, the recent emerging schemes based on deep learning either rely on 3D convolution with high complexity or need to learn features from various forward perspectives. To mitigate this, we propose a model with roll-pitch-yaw invariance that represents point clouds as probabilistic voxels and generates occupancy grids from a bird’s-eye view, fulfilling robust place recognition by learning aggregated embeddings from a fixed perspective. For low-overlap global registration, the traditional handcraft feature-based methods are mostly limited to dense object-level point clouds, while the state-of-the-art learning-based approaches often rely on complex 3D convolution and additional feature association learning. To fill this gap to some extent, we propose to estimate the relative roll-pitch angles and vertical translation by fitting and aligning the ground plane of the point clouds and to determine the horizontal translations and yaw angle by matching their projected occupancy grids. Extensive experiments corroborate the superior recall and generalization ability of our place recognition model, as well as the advanced success rate and accuracy of our 3D registration approach. Especially in the recognition and registration of hard samples, our results far exceed those of our counterparts by large margins. To ensure full reproducibility, the relevant codes and data are made available online at https://cslinzhang.github.io/GLoc/GLoc.html.

Abstract:
In practice, radar measurements are hindered by unavoidable noise, which lowers the signal-to-noise ratio (SNR) and raises the problem of radar signal denoising. Thanks to the development of deep learning techniques, recently proposed denoisers are progressively capable of blind denoising. On the other hand, due to the great fitting capacity of deep neural networks, the deep-learning-based denoising model would prefer to overfit on the training set, hence diminishing the generalization of a denoiser and impeding its use in a broader situation. This article focuses on this “blind universal denoising” problem for the first time and introduces a novel generative-adversarial-network-based (GAN-based) denoiser for radar spectrograms. The core idea of the proposed model lies in minimizing the generalization error during the model’s training, and to this end, our model incorporates a proposed identical dual learning (IDL) scheme and a reciprocal adversarial training (RAT) strategy to avoid the overfitting risk in the denoiser’s training. We perform the radar simulation using a motion capture database, and verify our model’s effectiveness under three different setups of training and testing datasets. For each setup, the noise level in the training and testing sets is configured to be different so to simulate the unknown measurement situations. Eleven algorithms are selected as comparisons, and the experimental results on two criteria illustrate that our method outperforms the others with a significant improvement.

Abstract:
The visual quality of an image mainly relies on its content and its distortions. However, the adaptability between their contributions to the image quality has not be well investigated yet. Besides, albeit of many promising efforts, lacking sufficient labeled data still hinders the robust representation of quality-related information. In this work, we first design a self-supervised architecture, named collaborative autoencoder (COAE), to separately represent the content and the distortion information, and then develop a Self-Adaptive Weighting based quAlity predictoR (SAWAR) to balance the individual representations of the content and the distortions in the prediction of image quality. Specifically, the COAE is trained with large-scale unlabeled data, consisting of a content autoencoder (CAE) and a distortion autoencoder (DAE) that work collaboratively and individually. While the CAE is a standard autoencoder for the content representation, the design of the DAE is unique. We introduce the CAE-encoded content representation as an extra input to the decoder of the DAE to learn to reconstruct distorted images, thus effectively forcing it to extract the distortion representation. The SAWAR, whose parameter number is much smaller than that of the COAE, is trained with labeled data in existing IQA datasets. It takes advantage of the interaction between the image content and the distortions to adaptively balance their contributions. Extensive experiments show that the COAE effectively extracts quality-related representations and the SAWAR achieves the state-of-the-art performance.

Abstract:
Although object detection has achieved significant progress in the past decade, detecting small objects is still far from satisfactory due to the high variability of object scales and complex backgrounds. The common way to enhance small object detection is to use high-resolution (HR) images. However, this method incurs huge computational resources which grow squarely with the resolution of images. To achieve both accuracy and efficiency, we propose a novel reinforcement learning framework that employs an efficient policy network consisting of a Spatial Transformation Network to enhance the state representation learning and a Transformer model with early convolution to improve feature extraction. Our method has two main steps: (1) coarse location query (CLQ), where an RL agent is trained to predict the locations of small objects on low-resolution (LR) (down-sampled version of HR) images; (2) context-sensitive object detection where HR image patches are used to detect objects on the selected coarse locations and LR image patches on background areas (containing no small objects). In this way, we can obtain high detection performance on small objects while avoiding unnecessary computation on background areas. The proposed method has been tested and benchmarked on various datasets. On the Caltech Pedestrians Detection and Web Pedestrians datasets, the proposed method improves the detection accuracy by 2%, while reducing the number of processed pixels. On the Vision meets Drone object detection dataset and the Oil and Gas Storage Tank dataset, the proposed method outperforms the state-of-the-art (SotA) methods. On MS COCO mini-val set, our method outperforms SotA methods on small object detection, while also achieving comparable performance on medium and large objects.

Abstract:
The latest video coding standard, versatile video coding (VVC), was developed to achieve higher video compression efficiency and support more media applications than its predecessor, high-efficiency video coding (HEVC). To address nontranslational motion, such as rotation and zooming, affine motion compensation has been employed in VVC during interframe prediction. However, the complexity increases significantly due to a large number of linear equation solving steps during affine motion estimation (AME). To address the problem, this paper proposes a fast linear equation solving algorithm and an accompanying pipelined hardware architecture design. To the best of our knowledge, our work is the first attempt to address the hardware architecture design of the linear equation solving algorithm in affine mode. First, an integer-based division-free algorithm (I-DFA) is proposed to achieve fast equation solving. Then, a novel dynamic scaling algorithm is proposed to compensate for integer computation errors due to overflow problems. Finally, a pipelined and interleaved hardware architecture is proposed to minimize the number of iteration clock cycles and improve the throughput. The proposed algorithm achieves average time savings of 5.3% and 5.7% with only 0.03% and 0.07% increase in the Bjøntegaard delta bit rate (BD-BR) under low-delay P (LDP) and random access (RA) configurations, respectively. The proposed hardware architecture can solve 16.7M six-parameter affine systems of linear equations per second under a working frequency of 100MHz, which represents a 21x improvement compared to the existing methods.

Abstract:
The concept of videowise just noticeable difference (JND) was recently proposed for determining the lowest bitrate at which a source video can be compressed without perceptible quality loss with a given probability. This bitrate is usually obtained from estimates of the satisfied used ratio (SUR) at different encoding quality parameters. The SUR is the probability that the distortion corresponding to the quality parameter is not noticeable. Commonly, the SUR is computed experimentally by estimating the subjective JND threshold of each subject using a binary search, fitting a distribution model to the collected data, and creating the complementary cumulative distribution function of the distribution. The subjective tests consist of paired comparisons between the source video and compressed versions. However, as shown in this paper, this approach typically overestimates or underestimates the SUR. To address this shortcoming, we directly estimate the SUR function by considering the entire population as a collective observer. In our method, the subject for each paired comparison is randomly chosen, and a state-of-the-art Bayesian adaptive psychometric method (QUEST+) is used to select the compressed video in the paired comparison. Our simulations show that this collective method yields more accurate SUR results using fewer comparisons than traditional methods. We also perform a subjective experiment to assess the JND and SUR for compressed video. In the paired comparisons, we apply a flicker test that compares a video interleaving the source video and its compressed version with the source video. Analysis of the subjective data reveals that the flicker test provides, on average, greater sensitivity and precision in the assessment of the JND threshold than does the usual test, which compares compressed versions with the source video. Using crowdsourcing and the proposed approach, we build a JND dataset for 45 source video sequences that are encoded with both advanced video coding (AVC) and versatile video coding (VVC) at all available quantization parameters. Our dataset and the source code have been made publicly available at https://database.mmsp-kn.de/flickervidset-database.html.

Abstract:
Lithography stands as a critical step in the manufacturing of integrated circuits, where the precise control of focus and exposure dose parameters is vital for optimal results. The conventional methodologies for defining lithography process windows often face difficulties with managing measurement errors, detecting printed defects, and exploiting visual features from Scanning Electron Microscope (SEM) images. This paper proposes LithoPW, a novel framework that utilizes visual features of SEM images for the determination of process windows. This approach is comprised of a denoising module, a Transformer-based visual memory encoder, and a defect-aware process window optimization module. The denoising module incorporates a Transformer architecture to mitigate the impact of noise, thereby enhancing the efficiency of downstream tasks in leveraging information embedded within SEM images. The transformer-based visual memory encoder discerns each SEM image as a Query, maintaining neighbouring SEM images in memory as Key and Value elements, thereby facilitating precise lithography quality classification associated with the query image. The defect-aware process window optimization module heightens the reliability of the results by adjusting the process window according to the defects identified within the SEM images. Experimental results confirm the efficacy of our framework, highlighting its promising application in lithography production for accurate process window determination.

Abstract:
The structural complexity, material diversity, and defect concealment in industrial detection scenes pose challenges of robustness, multi-information, and effectiveness to optical imaging systems. Partially blurred images due to the limited depth of field (DoF) of industrial imaging systems, shadow occlusions due to simple illumination conditions, and material and texture interference due to multiple compositions have become key issues affecting imaging quality in complex scenes. This paper proposes a systematic scheme fusing the DoF expansion approach, light source optimization, and polarization information (DLP-Fusion) to comprehensively improve imaging quality. Herein, a DoF fusion algorithm and a liquid zoom lens are used to increase the DoF from 2.5 mm to 40 mm. Moreover, a combination of ring light and freely rotatable strip light sources is introduced to improve the uniformity and robustness of the illumination, resulting in an average enhancement of 56.46% in the contrast of the target features. Furthermore, a polarization selection fusion network (PSFNet) is constructed to achieve flare suppression and complex material characterization, with the image naturalness improving by 32.05%. The experimental results with diverse scenes demonstrate that DLP-Fusion considerably improves the DoF range, image uniformity, and target feature contrast. DLP-Fusion exhibits remarkable robustness in various environments and was seamlessly deployed in real-world industrial settings with good performance. This paradigm may open a path toward intelligent imaging systems for sophisticated applications, including multimaterial detection and target recognition under harsh conditions.

Abstract:
Physiological studies have confirmed that there are differences in facial activities between depressed and healthy individuals. Therefore, while protecting the privacy of subjects, substantial efforts are made to predict the depression severity of individuals by analyzing Facial Keypoints Representation Sequences (FKRS) and Action Units Representation Sequences (AURS). However, those works has struggled to examine the spatial distribution and temporal changes of Facial Keypoints (FKs) and Action Units (AUs) simultaneously, which is limited in extracting the facial dynamics characterizing depressive cues. Besides, those works don’t realize the complementarity of effective information extracted from FKRS and AURS, which reduces the prediction accuracy. To this end, we intend to use the recently proposed Multi-Layer Perceptrons with gating (gMLP) architecture to process FKRS and AURS for predicting depression levels. However, the channel projection in the gMLP disrupts the spatial distribution of FKs and AUs, leading to input and output sequences not having the same spatiotemporal attributes. This discrepancy hinders the additivity of residual connections in a physical sense. Therefore, we construct a novel MLP architecture named DepressionMLP. In this model, we propose the Dual Gating (DG) and Mutual Guidance (MG) modules. The DG module embeds cross-location and cross-frame gating results into the input sequence to maintain the physical properties of data to make up for the shortcomings of gMLP. The MG module takes the global information of FKRS (AURS) as a guidance mask to filter the AURS (FKRS) to achieve the interaction between FKRS and AURS. Experimental results on several benchmark datasets show the effectiveness of our method.

Abstract:
In the context of label-efficient learning on video data, the distillation method and the structural design of the teacher-student architecture have a significant impact on knowledge distillation. However, the relationship between these factors has been overlooked in previous research. To address this gap, we propose a new weakly supervised learning framework for knowledge distillation in video classification that is designed to improve the efficiency and accuracy of the student model. Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages. We also employ the progressive cascade training method to address the accuracy loss caused by the large capacity gap between the teacher and the student. Additionally, we propose a pseudo-label optimization strategy to improve the initial data label. To optimize the loss functions of different distillation substages during the training process, we introduce a new loss method based on feature distribution. We conduct extensive experiments on both real and simulated data sets, demonstrating that our proposed approach outperforms existing distillation methods in terms of knowledge distillation for video classification tasks. Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.

Abstract:
Rehearsal methods based on knowledge distillation (KD) have been widely used in continual learning (CL). However, given memory constraints, few exemplars contain limited variations of previously learned tasks, impeding the effectiveness of KD in retaining long-term knowledge. The decision boundaries learned by the typical KD strategy overfit the limited exemplars, leading to “shrunk boundaries” of the old classes. To tackle this problem, we propose a novel KD strategy, called One-to-Many Information Matching method (O2MIM), which generates interpolated data by mixing samples between old and new classes, disentangles the supervision information from them and assigns supervision information to them in favor of the old classes. By doing so, the supervision information from a single exemplar can be matched with multiple information from different interpolated images. Moreover, O2MIM utilizes one trainable parameter to create an adaptive KD loss, thereby facilitating a flexible matching process with the designated supervision information. Consequently, O2MIM exploits the exemplar corset more effectively, expanding the shrunk decision boundaries towards the new classes. Next, to incorporate new classes into our classification model, we apply an effective classification training strategy to train a debiased classifier. Combining it with O2MIM, we propose the method of Expanding the Shrinking Decision Boundaries (ESDB), which simultaneously transfers knowledge from the old model via O2MIM and learns new classes by the classification training strategy. Extensive experiments demonstrate that ESDB achieves state-of-the-art performance on diverse CL benchmarks. We also confirm that O2MIM can be used with various label-mixing methods to improve overall performance in CL. The code is available at: https://github.com/CSTiger77/ESDB.

Abstract:
In this paper, a new four-dimensional chaotic system derived from the continuous Hopfield neural network (CHNN) model is designed, and the weight parameters are optimized to achieve superior dynamics. Furthermore, we verify the superior performance of the system through an analysis of its dissipation and other aspects. Meanwhile, to address the issues of low reconstruction quality and unsatisfactory security performance of the current compressed sensing (CS)-based image encryption algorithm, this paper introduces a compression encryption algorithm based on the chaotic system. Specifically, this algorithm designs a new fractal curve based on the Hilbert curve by incorporating a unique rotation and connection in the iterative process, which allows for effective displacement of the image. Additionally, a new measurement matrix with low spectral norm is constructed utilizing the QR decomposition based on the Householder transform to improve the compression performance. Finally, this paper introduces a bidirectional Z-shaped diffusion method based on chaotic sequences and optimized multiple logical operations (BZCM). By leveraging the optimized logic operation rules and logic key matrix proposed in this paper, this method enhances the diffusion effect. Experimental analyses demonstrate that the proposed algorithm achieves high security and reconstruction performance.

Abstract:
Two-stage detectors, which consist of the multi-scale feature representations and the prediction of region proposal boxes, have been recognized as an effective paradigm for tiny object detection in Unmanned Aerial Vehicle (UAV) images. Although most previous methods primarily concentrated on developing efficient feature fusion strategies within the feature pyramid network (FPN), few studies elaborated on improving the performance of region proposal network (RPN). Conventional RPNs exhibit two key weaknesses in the majority of existing two-stage object detection approaches. Firstly, the quality of proposal boxes generated by the RPN is heavily reliant on rich feature representations extracted from the FPN backbone. Secondly, the fixed number of generated proposal boxes limits adaptability to the distribution of tiny person objects. To mitigate the aforementioned problems, in this paper we propose a novel adaptive region proposal network (ARPN) to improve the quality of the proposal boxes and generate particularly compact yet accurate proposal boxes. On one hand, a progressive attention mechanism is devised to make the ARPN focus more on prospective object regions, where a series of multi-scale front attention modules (FAM) are applied to coarsely filter out most of irrelevant background areas and a group of top-to-bottom back attention modules (BAM) aid the ARPN to finely pinpoint tiny objects of interest in a coarse-to-fine manner. On the other hand, a mini-density map, which is inspired by the philosophy of crowd counting, is elaborately designed to adaptively determine the number of region proposal boxes. This approach significantly reduces redundancy while maintaining high-quality proposal boxes. Extensive experiments verify the superiority of proposed ARPN and show obvious improvement over other competitors in terms of two performance indicators of average precision (AP) and average recall (AR). The code will be available at https://github.com/kbzhang0505/ARPN.

Abstract:
The Object goal Navigation (ObjectNav) task requires an agent to navigate through a previously unknown domestic scenario using spatial and semantic contextual information, where the goal is specified by a semantic label (e.g., find a TV). Such a task is especially challenging as it requires formulating and understanding the complex co-occurrence relations among objects in diverse settings, which is critical for long-sequence navigational decision-making. Existing methods learn to either explicitly represent co-occurrence relationships as discrete semantic priors, or implicitly encode them from raw observations, thus can not benefit from the rich environmental semantics. In this work, we propose a novel Deep Reinforcement Learning (DRL) based ObjectNav strategy by actively imagining spatial and semantic clues outside the agent’s Field of View (FoV) and further mining Continuous Environmental Representations (CER) using self-supervised learning. Additionally, the illusion of spatial and semantic patterns allows the agent to perform Multi-Step Forward-Looking Planning (MSFLP) by considering the temporal evolution of egocentric local observations. Our approach is thoroughly evaluated and ablated in the visually realistic environments of the Matterport3D (MP3D) dataset. The experimental results reflect that our method combining CER and imagination-based MSFLP facilitates learning complicated semantic priors and navigation skills, thus achieving state-of-the-art performance on the ObjectNav task. In addition, adequate quantitative and qualitative analyses validate the excellent generalization ability and superiority of our method.

Abstract:
Large-scale flapping wing robots (FWRs) with airborne vision have important applications in visual navigation, aerial surveying, fire warning and power-line inspection. However, airborne vision and its videos suffer from strong jitters due to periodic wing flapping, which lowers the success rate of detection and measurement precision. In this paper, a robust digital video stabilization (DVS) method based on periodic jitters is proposed to provide continuous stable monitoring video without pan-tilt camera assistance. First, the periodic motion model of the FWR is established for video jitter analysis. Second, jitter frequencies in different flight states are estimated by continuous jitter acceleration. Then, feature trajectories generated from the video are adjusted adaptively for jitter frequency consistency and smoothed individually by the sampling-interpolation-averaging strategy, including the short trajectories. The stabilized video is generated by guidance from the original and smoothed trajectories. Finally, the proposed method is tested in outdoor flights with a 2.2-meter wingspan FWR and is found to outperform traditional, commercial, and deep learning DVS methods in terms of stability and robustness in various scenes and flight states.

Abstract:
Weakly-supervised fine-grained visual categorization (FGVC) aims to achieve subclass classification within the same large class using only label information. Compared to general images, fine-grained images have similar appearances and features, and are often affected by disturbances such as viewpoint, lighting, and occlusion during data collection, resulting in significant intra-class variance and small inter-class variance. To achieve FGVC, carefully designed models are often needed to explore the locally discriminative regions of the image. This paper revisits high-quality FGVC publications based on deep learning and analyzes from two new perspective: fine-grained image data and backbone. We address two ignored but interesting problems in FGVC. First, we argue that the reasons for exacerbating intra-class variance are not the same in data of animal, plant, and commodity types, and it is necessary to consider the effects of posture, covariate shift, and structural changes. Additionally, the “soft boundary” between subclasses intensifies the difficulty of classification. Second, we highlight that convolutional networks and self-attention networks have different receptive fields and shape biases, leading to performance differences when processing different types of fine-grained data. Overall, our analysis provides new insights into recent advances, challenges, and future directions for FGVC based on deep learning, which can help researchers develop more effective models for FGVC.

Abstract:
Cross-modal retrieval with noisy labels has attracted much attention. This state-of-the-art method trains a network to increase weights for clean labels in the loss. However, we have found that the network is eventually overfitted to the remaining noisy labels as training progresses. Motivated by this finding, this paper proposes a method called Label Correction using Network prediction based on Memorization Effects (LCNME) to correct noisy labels. This is unlike the state-of-the-art method, which leaves noisy labels on training. We assume that noisy labels are irrelevant to data features and realize label correction using predicted labels (obtained by network prediction) instead of given labels. However, because of memorization effects (the property whereby the network first learns clean labeled data then learns noisy labeled data), predicted labels are contaminated by noisy labels from the certain epoch called the change epoch. Although the change epoch is unknown in advance, we find that it can be identified by observing the loss of the noisy validation set. Using the change epoch, predicted labels can be generated without being affected by noisy labels. Extensive experiments show that LCNME accurately corrects noisy labels and achieves better cross-modal retrieval than existing methods.

Abstract:
Compressed sensing (CS) has become a widely employed technique in the field of image encryption. Despite achieving a high level of data encryption security, the resulting decrypted images often lack satisfactory quality. In this paper, we develop an adaptive enhanced approximate message passing (AMP) block CS (AE-AMP-BCS) algorithm for image BCS and its application to image encryption. Initially, we present an adaptive energy-based anti-aliasing filtering strategy (AAFS) for preprocessing the input image, mitigating the noise effect during reconstruction. The filtered image is then transformed into the Haar domain for enhanced security, followed by a twofold perturbation operation and a piece-wise linear chaotic system-based measurement generation approach for ratio allocation and security consideration. Finally, the cipher measurements are decrypted using an existing AMP-based algorithm. It is noteworthy that the keys for perturbation operations and the chaotic system are derived using the SHA512 hash. Comprehensive experimental results demonstrate that the proposed AE-AMP-BCS significantly outperforms state-of-the-art BCS methods in terms of image reconstruction quality. Its application in image encryption showcases competitive encryption capabilities, strong robustness, and outstanding image decryption quality compared to other CS-based image encryption algorithms.

Abstract:
Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifier, where the attention scores can be used for instance-level classification. However, the pseudo instance labels constructed by the former usually contain a lot of noise, and the attention scores constructed by the latter are not accurate enough, both of which affect their performance. In this paper, we propose an instance-level MIL framework based on contrastive learning and prototype learning to effectively accomplish both instance classification and bag classification tasks. To this end, we propose an instance-level weakly supervised contrastive learning algorithm for the first time under the MIL setting to effectively learn instance feature representation. We also propose an accurate pseudo label generation method through prototype learning. We then develop a joint training strategy for weakly supervised contrastive learning, prototype learning, and instance classifier training. Extensive experiments and visualizations on four datasets demonstrate the powerful performance of our method. Codes are available at https://github.com/miccaiif/INS.

Abstract:
Face forgery detection has become a new research hotspot. Though existing detection works have achieved impressive performance, they are difficult to achieve a proper trade-off between detection accuracy and model complexity. To solve this problem, we design some low-complexity modules and construct a lightweight dynamic fusion network (LDFnet) to achieve high accuracy and lightweight face forgery detection. Firstly, we regard significant local visual artifacts as a correct semantic feature needed for detection. A spatial group-wise enhance (SGE) module is introduced as a supervision to suppress possible noise and capture local artifacts. Secondly, we design a manipulation trace extraction block (TraceBlock), which can replace vanilla convolution to achieve global inference, thus capturing the texture information in the global scope. Based on TraceBlock, we construct a global texture representation (GTR) network to extract global manipulation features hierarchically. Finally, we design a dynamic fusion mechanism (DFM) to fully fuse local and global clues, and dynamically generate a more discriminating feature representation. Extensive experimental results show that the proposed LDFnet is significantly superior to the previous detection works on some popular face forgery datasets, such as FF++, DFDC, CelebDF and HFF. In particular, LDFnet only uses 963k model parameters and 801M FLOPs, which is far lower than the calculation cost of face forgery detection based on large model, and achieves better detection results.

Abstract:
With the rapid development of streaming media technology, the Quality of Experience (QoE) of streaming videos becomes crucial to optimize the video compression and transmission algorithms, such as adaptive bitrate (ABR). However, the complexity of human perceptual mechanisms, particularly in relation to temporal distortions, poses substantial challenges to effective QoE monitoring. In recent years, many efforts in video quality assessment (VQA) and video QoE evaluation have highlighted the influence of a broad spectrum of features—from Quality of Service (QoS) metrics to video content understanding—on viewer experience. On this basis, we believe that there is also a dynamic relationship among these features varying with the broadcasting content. Furthermore, research indicates a significant correlation between real-time and retrospective assessments of QoE for individual videos. In response to these insights, we introduce a novel approach leveraging a unified learnable network that incorporates dual-stage attention, the temporal and cross-feature attention, to accurately predict both continuous and overall QoE for streaming videos. The results of experiments conducted on several publicly available databases demonstrate the superiority of our proposed method over the state-of-the-art metrics.

Abstract:
Video watermarking based on frequency domain is proved to have good invisibility and robustness. However, most of the existing video watermarking schemes embed watermarks in the frequency domain based on subblock segmentation, while ignoring the variation relationship between video space and frequency domain features. Therefore, it is difficult to achieve robust authentication in complex application scenarios. In this paper, a ring subband is constructed in DT-CWT domain as the watermark embedding region by analyzing the relationship between video space and frequency domain characteristics under multiple attacks. Subsequently, the double watermark is embedded by modifying the DCT coefficient of the ring subband, with the copyright watermark alternately and repeatedly embedded within the ring subband, and the synchronous watermark is embedded in the outermost concentric circle. In addition, visual encryption (VC) and piecewise linear chaotic mapping (PLCM) methods are used to encrypt the watermark before it is embedded in the concentric rings, and two shared images are generated, one for the watermark embedding stage and the other for the watermark extraction stage. Experimental results demonstrate that the proposed scheme can resist common attacks, such as noise, JPEG compression, rotation, scaling, time synchronization attacks, and its robustness surpasses existing discrete wavelet transform (DWT) and DT-CWT based video watermarking schemes under complex attack scenarios.

Abstract:
Volumetric medical images are extensively employed in medical diagnosis, treatment, and research, necessitating a significant demand for coding. Currently, JP3D and HEVC are the prevailing coding standards in practical applications. In recent years, volumetric medical image coding has been extensively studied with researches falling into categories such as video-based, learning-based, and learned wavelet-like transform-based methods. However, these methods are plagued with either inadequate performance or excessive complexity. As such, the pursuit for a more efficient method of coding volumetric medical images remains an urgent and critical issue. Recognizing these requirements, the Audio Video coding Standard Workgroup of China (AVS) initialized a volumetric medical image coding standard and issued a Call for Evidence (CFE) in 2022. In response to this CFE, this paper presents an end-to-end volumetric medical image coding framework aiWave-Lite, which is an upgraded version of aiWave. To be more specific, aiWave-Lite integrates a three-dimensional (3-D) context model with enhanced parallelism and an optimized post-processing module. Leveraging the fully reversible 3-D wavelet-like transform, aiWave-Lite supports high-bit-rate lossy and lossless coding simultaneously. Extensive experimental results reveal that aiWave-Lite exhibits outstanding performance in both lossy and lossless coding and satisfies the multiple technical requirements for volumetric medical image coding. Consequently, it is a highly competitive solution within the CFE.

Abstract:
Block-based video codecs such as Versatile Video Coding (VVC)/H.266, High Efficiency Video Coding (HEVC)/H.265, Advanced Video Coding (AVC)/H.264 etc. inherently introduces compression artifacts. Although these codecs have in-loop filters to correct these distortions, they are not always effective due to the complexity of the noise. Recently, deep-learning approaches emerged as a promising solution for in-loop filtering. However, most of the previous approaches were designed solely for learning from images and neglected the high-frequency signals present in the reconstructed video frames. Furthermore, some previous methods employed a multi-level feature-extraction and feature-fusion strategy to enhance performance. However, they utilized complex feature-extractors while relying on naive feature-fusion methods. In this article, we propose a novel framework called TSF-Net, which jointly learns from both the pixel (spatial) and frequency-decomposed information and through powerful capability of a channel-wise transformer, it fuses both these information to improve performance. Our approach deviates from previous approaches by employing a simple feature-extractor coupled with an advanced transformer-based feature-fusion module. Simultaneously, TSF-Net introduces a few fundamental modifications in the multi-head self-attention module of the channel-wise transformer to make it computationally efficient. Our experimental results show that the proposed TSF-Net achieves a Bjøntegaard Delta (BD) - bitrate saving of up to 10.258% for the luma (Y) component under all-intra (AI) profile outperforming the VVC baseline and other state-of-the-art methods. Moreover, the proposed TSF-Net with an efficient channel-wise transformer is twice as efficient as TSF-Net with a vanilla channel-wise transformer.