TMM2025

Abstract:
Solving the Multi-View Stereo (MVS) problem is a cornerstone in computer vision, with depth map estimation and fusion being one of the most critical approaches. The depth confidence map is pivotal in ensuring the precision and completeness of the reconstruction outcomes. These algorithms frequently encounter a trade-off between completeness and accuracy in the confidence map, which can significantly impair the final reconstruction results. This paper analyzes the causes and phenomena of these issues, namely Confidence Jitter, Confidence Gap, and Confidence Disappearance. From these insights, a multi-view stereo network named CF-MVSNet is introduced, comprising three essential components. Firstly, the method mitigates the Confidence Jitter problem through two confidence fusion strategies. Secondly, it narrows the depth sampling space to near sub-pixel levels, addressing the Confidence Gap through neighborhood-average pooling. Lastly, the algorithm tackles the Confidence Disappearance problem resulting from multi-scale classification and regression with a loss function named CL. Our proposed method demonstrates superior performance across two critical metrics: the completeness of the depth map and the accuracy of the reconstructed point cloud, outperforming current state-of-the-art MVS methods.

Abstract:
In this article, we explore the zero-shot capability of the Segment Anything Model (SAM) for food image segmentation. To address the lack of class-specific information in SAM-generated masks, we propose a novel framework, called FoodSAM. This innovative approach integrates the coarse semantic mask with SAM-generated masks to enhance semantic segmentation quality. Besides, we recognize that the ingredients in food can be supposed as independent individuals, which motivated us to perform instance segmentation on food images. Furthermore, FoodSAM extends its zero-shot capability to encompass panoptic segmentation by incorporating an object detector, which renders FoodSAM to effectively capture non-food object information. Drawing inspiration from the recent success of promptable segmentation, we also extend FoodSAM to promptable segmentation, supporting various prompt variants. Consequently, FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity. Remarkably, this pioneering framework stands as the first-ever work to achieve instance, panoptic, and promptable segmentation on food images. Extensive experiments demonstrate the feasibility and impressing performance of FoodSAM, validating SAM's potential as a prominent and influential tool within the domain of food image segmentation.

Abstract:
This paper introduces a novel local fine-grained visual tracking task, aiming to precisely locate arbitrary local parts of objects. This task is motivated by our observation that in many realistic scenarios, the user demands to track a local part instead of a holistic object. However, the absence of an evaluation dataset and the distinctive characteristics of local fine-grained targets present extra challenges in conducting this research. To tackle these issues, first, this paper constructs a local fine-grained tracking (LFT) dataset to evaluate the tracking performance for local fine-grained targets. Second, this paper designs a cutting-edge solution to handle the challenges posed by properties of local objects, including ambiguity and high-proportion backgrounds. It consists of a hierarchical adaptive mask mechanism and foreground-background differentiated learning. The former adaptively searches for and masks ambiguity, which drives the network to concentrate on the local target instead of the holistic objects. The latter is constructed to distinguish foreground and background in an unsupervised manner, which is beneficial to mitigate the impacts of high-proportion backgrounds. Extensive analytic experiments are performed to verify the effectiveness of each submodule in the proposed fine-grained tracker.

Abstract:
Recently, multi-view subspace clustering has attracted extensive attention due to the rapid increase of multi-view data in many real-world applications. The main goal of this task is to learn a common representation of multiple subspaces from the given multi-view data, and most existing methods usually directly merge multiple groups of features by the single-step integration. However, there may exist large disparities among different views of the data, and thus the conventional single-step practice can hardly obtain a generally consistent feature representation for the multi-view data. To overcome this challenge, we present a novel approach dubbed “Asymptotics-Aware Multi-view Subspace Clustering (A^2MSC)” to pursue a consistent feature representation in a multi-step way, which iteratively conducts the data recovery to gradually reduce the differences between pairwise views. Specifically, we construct an asymptotic learning rule to update the feature representation, and the iteration result converges to a consistent feature vector for characterizing each instance of the original multi-view data. After that, we utilize such a new feature representation to learn a clustering-oriented similarity matrix via minimizing a self-expressive objective, and we also design the corresponding optimization algorithm to solve it with convergence guarantees. Theoretically, we prove that the learned asymptotic representation effectively integrates multiple views, thereby ensuring the effective handling of multi-view data. Empirically, extensive experimental results demonstrate the superiority of our proposed A^2MSC over the state-of-the-art multi-view subspace clustering approaches.

Abstract:
The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through large language model-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks.

Abstract:
Multi-view hashing is a crucial technology for multimedia retrieval because it transforms heterogeneous data from many viewpoints into binary hash codes. However, the existing approaches focus mostly on the complementarity among multiple views while being without confidence fusion. Furthermore, redundant noise is present in the single-view data in real-world application contexts. We present an innovative Adaptive Confidence Multi-View Learning (ACMVL) method to perform confidence fusion and remove extraneous noise. Initially, a confidence network is constructed to eliminate noise data and extract useful information from various single-view features. Moreover, an adaptive confidence multi-view network is utilized to quantify the confidence of each view and further fuse multiple view features using a weighted summation. Here, we propose an Automatic View Confidence Metric (AVCM) as a score for evaluating the confidence of views. Finally, to improve the semantic representation of the fused feature, a dilation network is created. Based on ACMVL, we introduce a novel Adaptive Confidence Multi-View Hashing (ACMVH) method. To our knowledge, we are the pioneers in using confidence learning for multimedia retrieval. Comprehensive experiments on three publicly available datasets demonstrate that our ACMVH outperforms the state-of-the-art methods (maximum improvement of 3.24% on mAP).

Affiliations: School of Computer Science and Technology, Tongji University, Shanghai, China; Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
Unsupervised Cross-Modal Hashing (UCMH) models the intrinsic semantic correlations across different modalities to generate binary hash codes, facilitating efficient cross-modal retrieval. This technology offers notable advantages, such as independence from labeled data and superior generalization capabilities compared to supervised methods. However, most UCMH methods are designed for closed-set retrieval scenarios and have difficulty generalizing to open multi-modal data, which is common in real-world retrieval settings. This limitation hampers their performance in open retrieval tasks, particularly when these tasks involve novel categories. To address the above issue, we propose an Open-set Cross-Modal Hashing (OCMH) method, which enhances the generalization capability of trained UCMH models in an efficient plug-in manner for open cross-modal retrieval. Our method enables the model to learn from novel categories in open-set scenarios by increasing the pre-defined hash code length, while simultaneously preventing the catastrophic forgetting of trained knowledge from the closed-set domain using basic hash codes. Additionally, we introduce a historical-category detection module and an asymmetric optimization strategy to support the joint learning of basic and increased hash codes by replaying detected samples related to historical categories. By plugging our proposed method into several representative UCMH methods on three widely used datasets, experimental results show that the enhanced UCMH methods achieve superior retrieval performance in both open-set and closed-set scenarios.

Abstract:
Mostcross-modal methods assume that training and testing data come from the same domain, which is often not the case in real-world scenarios due to cross-modal domain shifts and potential unknown concepts. Moreover, cross-modal shifts hinder the capture of unknown concepts, and the presence of unknown concepts can in turn exacerbate the cross-modal shifts. To address these challenges, this paper proposes a new paradigm called Active Cross-Modal Domain Adaptation (ACM-DA), wherein only cross-modal data from the source domain and uni-modal data from the target domain are utilized. To concurrently mitigate the adverse effects of both cross-modal domain shifts and unknown concepts, we propose a Curiosity-Driven Active Adaptation Network (CD-A2N), selectively annotating samples to maximize performance gain. First, we present Curiosity Arousal within Cross-modal Domain Adaptation (CA-CDA) to explore the complexity and novelty characteristics of target samples, while reducing cross-modal discrepancy and aligning source and target domains. Second, Curiosity-driven Active Learning (CAL) is devised to strategically select a subset of target samples for annotation, aiming to achieve more valuable data selection at a small labeling cost. Finally, we jointly train CA-CDA and CAL with the newly labeled target domain sub-dataset to alleviate the above issues. Extensive experiments demonstrate that CD-A2N provides an effective solution for achieving ACM-DA.

Abstract:
Multimedia recommendations aim to use rich multimedia content to enhance historical user-item interaction information, which can not only indicate the content relatedness among items but also reveal finer-grained preferences of users. In this paper, we propose a Knowledge-aware Diffusion-Enhanced architecture using contrastive learning paradigms (KDiffE) for multimedia recommendations. Specifically, we first utilize original user-item graphs to build an attention-aware matrix into graph neural networks, which can learn the importance between users and items for main view construction. The attention-aware matrix is constructed by adopting a random walk with a restart strategy, which can preserve the importance between users and items to generate aggregation of attention-aware node features. Then, we propose a guided diffusion model to generate strongly task-relevant knowledge graphs with less noise for constructing a knowledge-aware contrastive view, which utilizes user embeddings with an edge connected to an item to guide the generation of strongly task-relevant knowledge graphs for enhancing the item’s semantic information. We perform comprehensive experiments on three multimedia datasets that reveal the effectiveness of our KDiffE and its components on various state-of-the-art methods. Our source codes are available1.

Abstract:
Person search is a challenging task in computer vision and multimedia understanding, which aims at localizing and identifying target individuals in realistic scenes. State-of-the-art models achieve remarkable success but suffer from overloaded computation and inefficient inference, making them impractical in most real-world applications. A promising approach to tackle this dilemma is to compress person search models with knowledge distillation (KD). Previous KD-based person search methods typically distill the knowledge from the re-identification (re-id) branch, completely overlooking the useful knowledge from the detection branch. In addition, we elucidate that the imbalance between person and background regions in feature maps has a negative impact on the distillation process. To this end, we propose a novel KD-based approach, namely Disaggregation Distillation for Person Search (DDPS), which disaggregates the distillation process and feature maps, respectively. Firstly, the distillation process is disaggregated into two task-oriented sub-processes, i.e., detection distillation and re-id distillation, to help the student learn both accurate localization capability and discriminative person embeddings. Secondly, we disaggregate each feature map into person and background regions, and distill these two regions independently to alleviate the imbalance problem. More concretely, three types of distillation modules, i.e., logit distillation (LD), correlation distillation (CD), and disaggregation feature distillation (DFD), are particularly designed to transfer comprehensive information from the teacher to the student. Note that such a simple yet effective distillation scheme can be readily applied to both homogeneous and heterogeneous teacher-student combinations. We conduct extensive experiments on two person search benchmarks, where the results demonstrate that, surprisingly, our DDPS enables the student model to surpass the performance of the corresponding teacher model, even achieving comparable results with general person search models.

Abstract:
This article develops a Scalable Point Cloud Attribute Compression solution, termed ScalablePCAC. In a two-layer example, ScalablePCAC uses the standard G-PCC at the base layer to directly encode the thumbnail point cloud that is downscaled from the original input, and a learning-based model at the enhancement layer to compress and restore the full-resolution input point cloud conditioned on the base layer reconstruction. As such, the base layer provides a coarse reconstruction of the input point cloud and the enhancement layer further improves the quality. We then adopt a cross-layer rate allocation strategy that flexibly determines the resolution downscaling factor, the quantization parameter of the base layer, and the quality controlling factor of the enhancement layer to adapt the bitrate of the two layers for approximately optimal Rate-Distortion (R-D) performance. We conduct extensive experiments on popular point clouds following the MPEG common test conditions. Results demonstrate that the proposed ScalablePCAC achieves >10% BD-BR reduction against the latest G-PCC version 22 (TMC13v22) on the Y component; it also significantly outperforms existing learning-based solutions for point cloud attribute compression, e.g., compared with a recent work showing state-of-the-art performance, it achieves >20% BD-BR reduction.

Abstract:
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.

Affiliations: College of Computer Science, Wuhan University of Science and Technology and Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China; School of Computer Science, China University of Geosciences, Wuhan, China; School of Computer, National University of Defense Technology, Changsha, China; School of Information Science and Technology, Wuhan University of Science and Technology, Wuhan, China; School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China; School of Electrical and Information Engineering, Tianjin University, Tianjin, China

Abstract:
Multi-view clustering (MVC) exploits the information captured from diverse views to partition data into different groups and attracts much attention recently. Despite significant progress, most MVC methods fuse multi-view information via one-stage fusion while neglecting the merits of multi-stage fusion which causes insufficient in utilizing rich information within data and therefore degrades the clustering performance. To this end, designing a functional framework that can fully exploit multi-view information becomes a key challenge in multi-view clustering research. In this paper, we propose a novel multi-stage fusion method, which elegantly unifies the late and early fusion into one unified framework, to capture sufficient information underlying the multi-view data and to effectively reduce the effect of low-quality views. Specifically, we construct a low dimensional latent representation from multi-view data by learning proper correlation among multi-view data in the early fusion stage. The late fusion establishes a new optimal combinational data partition from base partitions constructed by spectral clustering, which suppresses the influence of low-quality basic partitions. Then we couple the low dimensional latent representation with the learned combinational data partition to share the same cluster structure by k-means and maximization alignment. As a result, we collaboratively learn an accurate and robust partition representation for the following clustering task. Besides, the late fusion and early fusion are jointly learned to achieve mutual collaboration for better performance. Finally, an alternating optimization algorithm is designed to solve the resultant optimization problem. Extensive experiments conducted on eight datasets show the superiority of our method in terms of effectiveness and efficiency.

Abstract:
Images captured on rainy days often contain rain streaks that can obscure important scenery and degrade the performance of high-level vision tasks, such as image segmentation in autonomous vehicles. As a result, image deraining, a low-level vision task focused on removing rain streaks from images, has gained popularity over the past decade. Recent advancements have primarily concentrated on supervised image deraining methods, which rely on paired rain-clean image datasets to train deep neural network models. However, collecting such paired real data is challenging and time-consuming. To address this, our method introduces a novel self-supervised approach that leverages the proposed locally dominant gradient prior and non-local self-similarity stochastic sampling. This approach extracts potential rain streaks and generates stochastic derained references for image deraining. Experimental results on public benchmark image-deraining datasets show that our proposed method performs favorably against state-of-the-art few-shot and self-supervised image deraining methods.

Abstract:
Deep video compression has made impressive process in recent years, with the majority of advancements concentrated on P-frame coding. Although efforts to enhance B-frame coding are ongoing, their compression performance is still far behind that of traditional bi-directional video codecs. In this article, we introduce a bi-directional deep contextual video compression scheme tailored for B-frames, termed DCVC-B, to improve the compression performance of deep B-frame coding. Our scheme mainly has three key innovations. First, we develop a bi-directional motion difference context propagation method for effective motion difference coding, which significantly reduces the bit cost of bi-directional motions. Second, we propose a bi-directional contextual compression model and a corresponding bi-directional temporal entropy model, to make better use of the multi-scale temporal contexts. Third, we propose a hierarchical quality structure-based training strategy, leading to an effective bit allocation across large groups of pictures (GOP). Experimental results show that our DCVC-B achieves an average reduction of 26.6% in BD-Rate compared to the reference software for H.265/HEVC under random access conditions. Remarkably, it surpasses the performance of the H.266/VVC reference software on certain test datasets under the same configuration. We anticipate our work can provide valuable insights and bring up deep B-frame coding to the next level.

Abstract:
When photographing through glass, reflections are often observed, which negatively impact the quality of the captured images or videos. In this article, we summarize and rethink depth guided reflection removal methods and, inspired by the human binocular vision system, investigate how to utilize depth for effective binocular video reflection removal. We propose an end-to-end learning-based reflection removal method that learns the transmission depth and designs a unified structure to achieve depth guided, cross-view, and cross-frame feature enhancement in a cascaded manner. Within the unified structure, different gating controllers are custom-designed to emphasize the direction of feature interaction. A dataset containing synthetic and real binocular mixture video dataset is built for network training and testing. Experimental results on both synthetic and real data from the proposed dataset demonstrate that the proposed method achieves superior performance in binocular video reflection removal.

Abstract:
Vision-language tracking is a crucial branch of multi-modal object tracking, aiming to jointly locate an object by utilizing visual information and language descriptions. Typically, existing vision-language trackers employ language and visual encoders to extract features from language descriptions and visual information, respectively. Based on these extracted visual and language features, a cross-modal interaction module is used to extract multi-modal features to locate the targets. However, they ignore the differences between visual and language modalities. Due to the lack of pixel-level position information in language descriptions, the positional information of the multi-modal features is greatly weakened by the cross-modal interaction modules. As a result, the vision-language trackers cannot effectively capture subtle changes in the target's positions. To address this problem, we propose a multi-modal hybrid interaction vision-language tracking method (named MHITrack), in which a multi-modal hybrid interaction decoder is designed to enhance the positional information of multi-modal features. The proposed multi-modal hybrid interaction decoder consists of a visual-language interaction module, a multi-level position interaction module, and a hybrid interaction module. Firstly, the multi-level position interaction module is utilized to capture fine-grained position information of the target from multi-level features. Meanwhile, the visual-language interaction module performs cross-modal interaction between visual and language features to obtain multi-modal features. Furthermore, the hybrid interaction module is employed to integrate the multi-modal features with target position information, enhancing the positional information of the multi-modal features. Finally, the proposed tracker can effectively capture subtle changes in the target's positions. Through extensive experiments on four benchmark datasets, namely TNL2k, LaSOT, OTB-Lang, and LaSOText, we demonstrate that the proposed vision-language tracker achieves promising performance compared to existing state-of-the-art vision-language trackers.

Abstract:
Cross-modality object detection aims to fuse complementary information from different modalities to improve model performance, which achieves a wider range of applications. However, traditional cross-modality fusion methods, based on CNN or Transformer, inadequately address the issue of pseudo-target information, which causes model attention dispersion to degrade object detection performance. In this paper, we investigate a novel cross-modality fusion approach by associating cross-modal features in a hidden state space based on an improved Mamba with a gating attention mechanism. We propose the Fusion-Mamba Block(FMB), designed to map cross-modal features into a hidden state space for interaction, thereby refining the model’s attention on true target areas and enhancing overall performance. The FMB comprises two key modules: State Space Channel Swapping (SSCS) module, which facilitates the fusion of shallow features, and Dual State Space Fusion (DSSF) module, which enables deep fusion and effectively suppresses pseudo-target information within the hidden state space. Our proposed method outperforms state-of-the-art approaches, achieving improvements of 5.9%, 3.5% and 2.1% mAP on M^3FD, DroneVehicle and FLIR-Aligned, respectively. To the best of our knowledge, this work establishes a new baseline for cross-modality object detection, providing a robust foundation for future research in this area.

Abstract:
Visual saliency modelling is of fundamental importance in modern video processing and its applications. Our previous eye-tracking study revealed that signal distortions caused by editing, compression, or transmission alter gaze patterns and consequently induce saliency shifts in both spatial and temporal domains. Saliency shifts provide crucial insights into viewers’ behavioural responses to video distortions, facilitating the perception-based optimisation of video algorithms. However, the spatio-temporal saliency shifts and their measurable effects on perception related applications remain largely unexplored. In this paper, we first investigate the measurement of distortion-induced saliency shifts (DSS) in videos and analyse DSS behaviours as functions of video content, time order and critical distortion disruption. Second, based on our findings, we construct three vision models to quantitatively simulate distinct DSS behaviours and integrate them into a comprehensive DSS behaviour model. Finally, we demonstrate that the computational DSS model can enhance emerging video technologies.

Abstract:
With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.

Abstract:
With the increasing popularity of autonomous driving and 3D reconstruction, keypoint detection, as a key link in visual localization, has become a hot topic in current research. However, existing keypoint detection methods rarely pay attention to the difficulty differences of samples and lack a progressive learning mechanism, which often leads to overfitting for simple samples and underfitting for complex samples, limiting the overall performance of the model. To address these issues, we propose a novel progressive gradient-guided self-distillation method (PG^2SD) for keypoint detection, which possesses self-evolutionary learning capabilities. Specifically, we propose a progressive gradient constraint strategy (PGCS) that dynamically adjusts the gradient contributions of different samples, enabling the model to adapt to the evolving learning capability during training. On this basis, we propose a gradient-guided self-distillation strategy (G^2SDS), which integrates seamlessly with PGCS to alleviate the insufficient feature representation of hard samples in the early training stage. We further design a novel loss function to achieve dynamic collaboration between PGCS and G^2SDS, allowing G^2SDS to adaptively adjust the self-distillation parameters through the PGCS. Experimental results on multiple benchmark datasets show that our method achieves state-of-the-art performance on image matching, visual localization, and 3D reconstruction tasks without designing a proprietary network, indicating broad application prospects.

Abstract:
Attribute-missing deep graph clustering, which aims to categorize the graph nodes with partial attribute-missing samples into distinct categories in an unsupervised manner, has gained significant popularity. However, most existing researches have at least one of the following issues: 1) seldom exploit diverse clustering structural information to facilitate non-Euclidean data imputation and refine the clustering pattern and 2) ignoring the positive effect of diverse information on feature imputation and representation extraction, resulting in sub-optimal missing feature estimation and inferior clustering performance. To solve these issues, we propose a novel Prototype-driven Multi-view Attribute-missing Graph Clustering (PMAGC) model that leverages rich structural and diverse information to assist the processes of imputing missing attributes and learning clustering-friendly features. Specifically, we design a multi-view augmentation module that extracts attribute-complete samples as node view and constructs feature and edge views using feature pre-imputation and edge masking techniques. Then, guided by clustering pseudo-labels, we promote the proximity between the prototypes of attribute-missing samples and those of attribute-complete samples within the feature space. Thus, PMAGC cleverly employs both clustering structural information and reliably attribute-complete sample data to assist feature imputation. In addition, we design a prototype-wise contrastive loss, which considers prototypes from different views within the same cluster as positive samples, while treating others as negative samples. Hence, the optimized features could more accurately guide the attribute learning process. Extensive experiments on six graph datasets with missing attributes are conducted to demonstrate the effectiveness of the proposed PMAGE.

Abstract:
Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.

Abstract:
Crowd counting has drawn increasing attention across various fields. However, existing crowd counting tasks primarily focus on estimating the overall population, ignoring the behavioral and semantic information of different social groups within the crowd. In this paper, we aim to address a newly proposed research problem, namely fine-grained crowd counting, which involves identifying different categories of individuals and accurately counting them in static images. In order to fully leverage the categorical information in static crowd images, we propose a two-tier salient feature propagation module designed to sequentially extract semantic information from both the crowd and its surrounding environment. Additionally, we introduce a category difference loss to refine the feature representation by highlighting the differences between various crowd categories. Moreover, our proposed framework can adapt to a novel problem setup called few-example fine-grained crowd counting. This setup, unlike the original fine-grained crowd counting, requires only a few exemplar point annotations instead of dense annotations from predefined categories, making it applicable in a wider range of scenarios. The baseline model for this task can be established by substituting the loss function in our proposed model with a novel hybrid loss function that integrates point-oriented cross-entropy loss and category contrastive loss. Through comprehensive experiments, we present results in both the formulation and application of fine-grained crowd counting.

Abstract:
Sketch-based 3D shape retrieval (SBSR) can be approached by learning domain-invariant descriptors or ranking metrics from sketches and 2D view images of 3D shapes rendered through numerous viewpoints. However, determining the most appropriate viewpoints that convey discriminative geometric features to benefit the task of SBSR became an essential yet not fully explored area. Existing works extract 3D features from multi-view images observed through pre-defined viewpoints to match 2D sketches. Those methods, however, fail to dynamically select viewpoints by considering the SBSR task. In this work, we introduce a fully differentiable viewpoint learning paradigm driven by the downstream SBSR task, which supports the task-aware and sketch-dependent dynamic viewpoint determination process. We naturally integrate this task-specific and sketch-dependent viewpoint learning process into a meta-learning framework to develop a novel Dynamic Viewer (DV) module for category-level SBSR. DV module comprises a Meta View Learner (MVL) block and a View Generator (VG) block. Specifically, as the first part of the DV module, the MVL block learns to initiate the necessary network parameters of the VG block. Then, the VG block that serves as the second part learns the best viewpoints to render 2D images. To learn the optimal viewpoints for category-level SBSR, we further introduce a view mining loss that aims to maximize the similarity of feature-level information among rendered 2D views and the query sketch. Further, we adopt a variational autoencoder (VAE) to retrieve 3D shapes by setting the newly rendered images and query sketch as inputs. Comprehensive experimental results on popular SBSR datasets demonstrate that our proposed category-level SBSR framework have achieved state-of-the-art performance for category-level SBSR task. Furthermore, our approach can be easily adapted to address instance-level SBSR task with promising results.

Abstract:
Self-attention learns capturing the long-range dependencies between embeddings (e.g., image pixels). However, the memory overhead and computation cost are prohibitive due to being quadratic in term of the spatial resolution. The structure analysis reveals two crucial roles in the attention: the correlation-based dependency structure and feature normalization. In this work, an efficacious Local-Global Semantics (LGS) module is proposed to alleviate the above issues by modeling the local semantic aggregation and global semantic interaction. Our LGS module contains a group convolution and an Efficient Global Semantic Attention (EGSA). Firstly, the group convolution aggregates local semantics. Secondly, considering a feature map as a sequence of 2-D channel representations, EGSA formulates a general model for the global semantic interaction. The linear correlation is computed between global semantics. LGS has the linear memory overhead and computation cost in term of the spatial resolution. The LGS module can be smoothly incorporated into object detection frameworks. The experiment results verify its effectiveness on two popular detection datasets: the MS COCO and PASCAL VOC.

Abstract:
The increasing popularity of online food blogs and food ordering services has made personalized recipe recommendation a vital aspect of our emotional well-being. However, existing solutions, mainly based on graph neural networks, still face significant challenges, such as (a) focusing on exploiting the user-recipe interactions while neglecting other crucial pairwise and high-order relationships, and (b) failing to explicitly distinguish the distinct factors, e.g., hedonic and healthy, that influence recipe selection. To address these issues, we propose a progressively-passing-then-disentangling approach named P2D. Our approach utilizes a three-stage progressive message-passing mechanism for better representation learning. Specifically, we incorporate the extra pairwise relationships between recipes and nutrients, ingredients, and visual contents to create fine-grained and multimodal recipe representations. We next refine these representations via message passing between high-order recipe relationships to learn people's shared food preferences. Based on them, we could derive comprehensive user representations, which are subsequently transformed into disentangled forms that correspond to various decision factors through contrastive and mutual information regularization. Experimental results demonstrate both the superiority and the rationality of our method: (a) P2D outperforms the state-of-the-art recipe recommendation methods by a large margin under various metrics, (b) ablation studies confirm the positive impact of each of its components, and (c) our visualization analysis empirically supports the advantage of explicitly disentangling decision factors.

Abstract:
Visual transformers have achieved great success in representation learning. This is mainly due to efficient token dependency modeling via self-attention. However, the computational burden increases sharply as the input pixels increase. Although recent Fourier-based global frequency-domain mixing methods attempt to improve the efficiency of transformers for high-resolution image inputs, the Fourier operator has limited ability to capture the local geometric structure. Complex wavelets can perform local attention in both the spatial domain and the frequency domain. Therefore, we propose the complex wavelet informed transformer operator that uses the real and imaginary wavelets of the dual-tree complex wavelet transform to simulate the interaction in the attention kernel. In order to further reduce the computational burden of operators, we introduce an adaptive local block shared attention mechanism in the channel domain for our wavelet informed operators. Further, we construct the deep multi-head operator network consisting of a hybrid stack of complex wavelet informed transformer operators and self-attention layers. This enables the Transformer to more sparsely capture multi-scale and multi-directional structured features in the process of learning dependencies. Extensive experimental results show that our adaptive complex wavelet informed transformer operator under the Transformer architecture achieves highly competitive accuracy performance on multiple image classification benchmark datasets. And the proposed operators can be flexibly and effectively migrated to vision tasks in dynamic video scenarios.

Abstract:
Current timestamp-supervised temporal action segmentation (TS-TAS) methods typically follow a two-phase pipeline: initializing the model with timestamp labels and refining it with pseudo-labels. However, limited by the sparsity of timestamp annotations, current methods' performance is sub-optimal. Specifically, initializing the model with only timestamp annotations may cause overfitting to labeled frames. Additionally, sparse timestamp annotations cannot capture the diverse action representations throughout the whole instance, especially those near the ambiguous action boundaries, leading to pseudo-label noise. Inspired by the cluster assumption of semi-supervised learning (SSL) that points within the same manifold likely share the same label, we here model TS-TAS as an SSL problem. Specifically, we propose a Temporal Embedding Consistency (TEC) strategy to mitigate the excessive focus on annotated frames. The TEC strategy encourages frames with similar representations within the video to have similar classification probability distributions, thereby propagating labeled frames' information to implicit ones. Besides, we design a TS-Mix strategy to further leverage unlabeled data to mitigate the influence of pseudo-label noise in a consistency regularization manner. The TS-Mix strategy includes intra-mix, which adds linear interpolation of two adjacent timestamps to every frame between them, and inter-mix, which mixes frames from two different untrimmed videos frame-by-frame. Then the mixed video is trained with the correspondingly mixed pseudo-labels. Comprehensive experimental results on different benchmarks show that we achieve new state-of-the-art performances. Furthermore, the proposed method can seamlessly enhance existing methods, significantly improving their performances.

Abstract:
In this article, we present an innovative solution tailored for the intricate challenges of the virtual try-on task—our novel Hierarchical Cross-Attention Network, HCANet. HCANet is meticulously crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic and visually convincing virtual try-on outcomes. A distinctive feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block functions as a cornerstone, enhancing the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our extensive set of experiments establishes the prowess of HCANet. The results showcase its cutting-edge performance across both objective quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that not only excel in accuracy but also satisfy subjective criteria of realism. This marks a significant step forward in advancing the field of virtual try-on technologies.

Abstract:
Video-to-text generation is a challenging task that involves translating video contents into accurate and expressive sentences. Existing methods often ignore the importance of establishing fine-grained semantics within visual representations and exploring textual knowledge implied by video contents, leading to difficulty in generating satisfactory sentences. To address these problems, a vision-language relational transformer model is proposed for video-to-text generation. Three key novel aspects are investigated. First, a visual relation modeling block is designed to obtain higher-order feature representations and establish semantic relationships between regional and global features. Second, a knowledge attention block is developed to explore hierarchical textual information and capture cross-modal dependencies. Third, a video-centric conversation system is constructed to complete multi-round dialogues by incorporating the proposed modules including visual relation modeling, knowledge attention and text generation. Extensive experiments on five benchmark datasets including MSVD, MSRVTT, ActivityNet, Charades and EMVPC demonstrate that the proposed scheme achieves remarkable performance compared with the state-of-the-art methods. Besides, the qualitative experiment reveals the system's favorable conversation capability and provides a valuable exemplar for future video understanding works.

Abstract:
Event-based Synthetic Aperture Imaging (E-SAI) extends the SAI technique to observe targets behind extremely dense occlusions. Existing approaches remain confined to the de-occlusion of a specific depth plane, i.e., single depth in focus, unable to be applied to observe occluded targets with varying depths due to the decreased focus range. To achieve All-in-Focus E-SAI, i.e., recovering the occlusion-free image of all depth planes, the depth information behind the occlusions should be given to ensure accurate event refocusing. In this paper, we first prove the feasibility of predicting the depth map from captured events in the presence of dense occlusions. Then, we propose the ESAI-AF network, which consists of a Depth Estimation Module (DEM) designed to estimate the depth information from multi-view events and an Image Enhancement Module (IEM) designed to reconstruct high-quality occlusion-free images from the refocused events. We employ only multi-view occlusion-free images as supervised signals for end-to-end training of the above modules. Extensive experiments have shown that the proposed method can effectively perform All-in-Focus image reconstruction of occluded multi-depth targets and achieves superior performance to existing methods.

Abstract:
With its well-designed network architecture, the deep learning-based infrared and visible image fusion (IVIF) method shows its efficiency and effectiveness by realizing a fine feature extraction and fusion mechanism. However, disparities in cross-modal features often result in an imbalance between texture details and contextual information, causing detailed features to be overshadowed by prevailing contextual information. To tackle this issue, this study introduces PIDFusion, a fusion model driven by a PID controller, designed to dynamically optimize cross-modal feature fusion deviations. The core of PIDFusion is the dynamic adaptation capability of the PID controller, which facilitates real-time corrections for deviations encountered during the fusion process, thereby maintaining a harmonious balance between texture details and contextual information. Additionally, we introduced the Cyclic Self-Supervised Feature Refinement (CSSFR), which under the constraint of self-supervised loss, minimizes redundant information within the feature flow and ensures the preservation of salient feature through the cyclic input of decoupled features. Concurrently, we developed the Iterative Attention Module (IAM), utilizing the unique gating mechanism of LSTM to capture feature changes across successive iterations, thereby driving the model to cultivate more discriminative feature representations. Extensive experiments revealed that PIDFusion outperforms SOTA methods in terms of both efficiency and cost-effectiveness, through static statistics and high-level vision tasks.

Abstract:
Sarcasm Explanation in Dialogue (SED) is a new yet challenging task, which aims to generate a natural language explanation for the given sarcastic dialogue that involves multiple modalities (i.e., utterance, video, and audio). Although existing studies have achieved great success based on the generative pretrained language model BART, they overlook exploiting the sentiments residing in the utterance, video and audio, which play important roles in reflecting sarcasm that essentially involves subtle sentiment contrasts. Nevertheless, it is non-trivial to incorporate sentiments for boosting SED performance, due to three main challenges: 1) diverse effects of utterance tokens on sentiments; 2) gap between video-audio sentiment signals and the embedding space of BART; and 3) various relations among utterances, utterance sentiments, and video-audio sentiments. To tackle these challenges, we propose a novel sEntiment-enhanceD Graph-based multimodal sarcasm Explanation framework, named EDGE. In particular, we first propose a lexicon-guided utterance sentiment inference module, where a heuristic utterance sentiment refinement strategy is devised. We then develop a module named Joint Cross Attention-based Sentiment Inference (JCA-SI) by extending the multimodal sentiment analysis model JCA to derive the joint sentiment label for each video-audio clip. Thereafter, we devise a context-sentiment graph to comprehensively model the semantic relations among the utterances, utterance sentiments, and video-audio sentiments, to facilitate sarcasm explanation generation. Extensive experiments on the publicly released dataset WITS verify the superiority of our model over cutting-edge methods.

Abstract:
Late fusion-based algorithms have attracted extensive attention because of their low time and space complexity for handling in-complete multiview data. However, these methods have certain limitations. First, the basic clustering indicator matrices generated by incomplete views are susceptible to low-quality imputation. Second, traditional methods often fail to adequately consider the high-order correlations between these basic clustering indicator matrices, leading to suboptimal performance. Third, conventional methods focus primarily on improving speed, with less emphasis on enhancing clustering performance. To address these issues, we propose two novel models. The first is called tensor-based late fusion incomplete multiview clustering (TLF-IMVC-1). Specifical-ly, TLF-IMVC-1 first seeks a consensus clustering matrix from the basic clustering indicator matrices and subsequently imputes the incomplete portions of these matrices via the learned consen-sus matrix. This approach seamlessly integrates the clustering process with the imputation of missing elements into a unified framework. Furthermore, we construct a third-order tensor from these basic clustering matrices, constrained by the tensor nuclear norm, to capture their high-order correlations. Although this model is effective, it lacks proper guidance in the learning process of the basic clustering indicator matrices, making them suscepti-ble to low-quality imputation. Therefore, we introduce the second novel model, i.e., TLF-IMVC-2, to address this issue. Specifically, TLF-IMVC-2 uses the learned consensus representation matrix as a new component to construct the third-order tensor. This strategy leverages the robust clustering structure inherent in the consensus matrix to guide the learning process of the basic clus-tering matrices. The experimental results demonstrate that both models outperform state-of-the-art methods in clustering.

Abstract:
In this paper, we present a novel task of source-free cross-modal adversarial example generation, which generates adversarial examples based on textual descriptions of attackers. This task has two challenges as follows. First, how to generate adversarial examples when the clean examples are missing or inaccessible. Second, how to achieve fine-grained custom adversarial example generation according to the semantic descriptions of the attackers. Existing adversarial example generation methods can not effectively deal with these two challenges. To address these challenges, we propose a Source-Free Cross-Modal Adversarial Example Generation framework, abbreviated as SFCM-AEG. Within the SFCM-AEG model, we firstly leverage a pre-trained GPT as a simulator to construct textual descriptions of attackers by labels. Following this, we employ a diffusion model to synthesize an image that aligns with the generated textual description. Finally, the generated images are converted into adversarial examples using an adversarial example generation method. Experimental results demonstrate that our proposed SFCM-AEG method can generate adversarial examples with customized semantic descriptions, without relying on clean examples, while achieving strong attack performance in a white-box setting.

Abstract:
In this paper, we present and study a new instance-level retrieval task: PointCloud-Text Matching (PTM), which aims to identify the exact cross-modal instance that matches a given point-cloud query or text query. PTM has potential applications in various scenarios, such as indoor/urban-canyon localization and scene retrieval. However, there is a lack of suitable and targeted datasets for PTM in practice. To address this issue, we present a new PTM benchmark dataset, namely SceneDepict-3D2T. We observe that the data poses significant challenges due to its inherent characteristics, such as the sparsity, noise, or disorder of point clouds and the ambiguity, vagueness, or incompleteness of texts, which render existing cross-modal matching methods ineffective for PTM. To overcome these challenges, we propose a PTM baseline, named Robust PointCloud-Text Matching method (RoMa). RoMa consists of two key modules: a Dual Attention Perception module (DAP) and a Robust Negative Contrastive Learning module (RNCL). Specifically, DAP leverages token-level and feature-level attention mechanisms to adaptively focus on useful local and global features, and aggregate them into common representations, thereby reducing the adverse impact of noise and ambiguity. To handle noisy correspondence, RNCL enhances robustness against mismatching by dividing negative pairs into clean and noisy subsets and assigning them forward and reverse optimization directions, respectively. We conduct extensive experiments on our benchmarks and demonstrate the superiority of our RoMa.

Affiliations: College of Mathematics, Sichuan University, Chengdu, China; School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University-Anker Embodied AI Lab, Peking University, Beijing, China; School of Information Technology and Management, University of International Business and Economics, Beijing, China; Department of Computer Science, University of Washington, Seattle, WA, USA; Terminus Group, Beijing, China; College of Computer Science, Sichuan University, Chengdu, China

Abstract:
This paper studies the problem of semi-supervised learning on graphs, which has recently aroused widespread interest in relational data mininThe focal point of exploration in this area has been the utilization of graph neural networks (GNNs), which stand out for excellent performance. Previous methods, however, typically rely on the limited labeled data while ignoring the abundant structural information in unlabeled nodes inherently on graphs, easily resulting in overfitting, especially in scenarios where only a few label nodes are available. Even worse, GNNs, despite their success, are constrained by their ability to solely capture local neighborhood information through message-passing mechanisms, thereby falling short in modeling higher-order dependencies among nodes. To circumvent the above drawbacks, we propose a simple yet effective framework called Hypergraph COnsistency LeArning (HOLA). Specifically, we employ a collaborative distillation framework consisting of a teacher network and a student network. To achieve effective interaction, we propose momentum distillation, a self-training method that enables the student network to learn from pseudo-targets generated by a momentum teacher network. Further, a novel hypergraph structure learning network is developed to model complex high-order relations among nodes with relational consistency learning, thereby transferring the knowledge to the student network. Extensive experiments conducted on a variety of benchmark datasets demonstrate the superior performance of the HOLA over various state-of-the-art methods.

Abstract:
Image layout representation learning, which converts layouts into compact vectors, is essential for tasks such as image retrieval, editing, and generation. However, existing methods—especially those applied to photographic images—face several challenges: supervised methods rely on expensive labeled datasets, weakly-supervised methods struggle with generalization, and self-supervised methods are limited in handling the diversity of photographic layouts. To address these issues, we propose a novel heterogeneous layout graph that efficiently captures the layout information in images. The vertices of this graph represent the compositional primitives of the image, capturing their attributes, while the edges encode the relationships between these primitives. We also design effective pretext tasks to guide a layout encoder-decoder in self-supervised training, ultimately generating the layout graph embedding vector. Additionally, we introduce a new layout evaluation dataset—LODB—which features a richer variety of layout categories, significantly better label quality than existing datasets, and a more balanced distribution of semantic scenes across layout categories, providing a comprehensive benchmark for evaluation. Experiments on the LODB dataset demonstrate that our method outperforms existing approaches in representing photographic image layouts.

Abstract:
In this paper, we propose a novel neural network called Topology Learning Network (TL-Net), that exploits local and global geometric relation by topology graphs to handle the problem of correspondence filtering in complex scenes. Specifically, we first design a Multi-level Topology Encoder (MLTE), which fuses local and global topology graphs by a channel attention, to sufficiently extract the geometric relation among correspondences. MLTE not only includes local topology graphs by gathering the information of relative motion and multi-resolution group convolution, but also includes a global topology graph by aggregating the information of the similarity and the Graph Laplacian. In addition, inspired by Transformer, we design the backbone of TL-Net to generate enriched fdeature maps for correspondence filtering. Meanwhile, by simplifying the global context aggregation, we maintain the lightweight of the backbone, introducing the superiority of Transformer while avoiding extra parameters and calculations. Empirical experiments on several computer vision tasks show that the performance and generalization ability of TL-Net are significantly superior to the state of the art methods. Notably, on relative pose estimation, we achieve 5.63% and 5.03% mAP improvements under an error threshold of 5^\circ outdoors and indoors, respectively.

Abstract:
The diagnosis of colon polyps is important for the prevention of colorectal cancer. Polyp segmentation, however, is still a challenging problem given that recent medical computer-aided equipment suffers from situations of polyp variations in terms of size, color, texture, and poor illuminations in endoscopy videos. These obstacles hinder the prediction of polyp boundaries. Inspired by the observation that the values of pixels on the border region change more sharply than others, we propose the oriented-derivative (OD) representation to capture the relationship between pixels and the boundary region given distance and orientation. To adaptively use the proposed representation in arbitrary frameworks, we design plug-in modules to learn the representation and aggregate features to improve the accuracy of boundary predictions in the polyp segmentation task, which can be implemented in frameworks including the encoder-decoder and top-down architectures. Extensive experimental results show the improvement from the proposed oriented-derivative representation for the polyp segmentation task and the extendibility of our proposed modules in different architectures. Our methods achieved an improvement ranging from 0.3% to 2.5% (mDice) compared with the baseline on five publicly available datasets, including Kvasir, CVC-ClinicDB, EndoScene, CVC-ColonDB, and ETIS.

Abstract:
The Visual Question Generation (VQG) task generally aims to produce questions based on images in natural language. Existing studies often handle VQG as a reverse Visual Question Answering (VQA), training data-driven generators on VQA datasets. However, this solution pipeline struggles to generate high-quality questions that effectively challenge robots and humans, even by leveraging the most advanced large-scale foundational models. There are also some other VQG methods depending on elaborate and costly manual preprocessing heavily. To address these limitations, we propose a novel method with a two-module framework for automatically generating inferential visual questions that also follow commonsense. The “Scene Graph Generation” module constructs specialized scene graphs by progressively expanding connections from high-confidence nodes. This module ensures semantic consistency by aligning visual, textual, and salient features. Additionally, we incorporate external knowledge to extend abstract semantic concepts and associated facts, enriching the content of generated questions and facilitating the generated question to better follow the commonsense of human. Another module “Question Generation” utilizes the above scene graph as a foundation to search and instantiate for the question. The generated questions will match with the program templates and have diverse inferential paths. Experimental results demonstrate that our method is both effective and highly scalable. The generated questions are controllable in terms of semantic richness and difficulty, exhibiting clear inferential and commonsense properties. Furthermore, we automatically utilize our method to create a large-scale dataset, ICVQA, which includes approximately 160,000 images and 800,000 questionanswer pairs, thereby facilitating further research in VQA and visual dialogue.

Abstract:
Incomplete multi-view clustering focus on mining useful information from low-quality multiple sources, such as missing and distorted data that are prevalent in real life. However, after representation learning and the processing of incomplete information, existing methods often leave representations containing information task-irrelevant information. In addition, the separation between missing data imputation and clustering tasks leads to sub-optimal multi-view clustering performance. To address these issues, we propose an incomplete multi-view clustering method based on mutual information. For the problem of task-irrelevant information, we use incomplete view prediction to extract sufficient and minimal task-relevant information and provide theoretical proof from the perspective of mutual information. For the problem of separation between missing data imputation and clustering tasks, we integrate incomplete-view prediction with contrastive clustering, collaboratively enhancing the clustering performance. Comparative experiments on five public datasets, under both complete and incomplete scenarios, reveal that our method outperforms nine other competing approaches, demonstrating its effectiveness and robustness in handling multi-view data.

Abstract:
Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

Abstract:
The integration of conversational artificial intelligence (AI) into mental health care promises a new horizon for therapist-client interactions, aiming to closely emulate the depth and nuance of human conversations. Despite the potential, the current landscape of conversational AI is markedly limited by its reliance on single-modal data, constraining the systems’ ability to empathize and provide effective emotional support. This limitation stems from a paucity of resources that encapsulate the multimodal nature of human communication essential for therapeutic counseling. To address this gap, we introduce the Multimodal Emotional Support Conversation (MESC) dataset, a first-of-its-kind resource enriched with comprehensive annotations across text, audio, and video modalities. This dataset captures the intricate interplay of user emotions, system strategies, system emotions, and system responses, setting a new precedent in the field. Leveraging the MESC dataset, we propose a general Sequential Multimodal Emotional Support framework (SMES) grounded in Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES framework incorporates an LLM-based reasoning model that sequentially generates user emotion recognition, system strategy prediction, system emotion prediction, and response generation. Our rigorous evaluations demonstrate that this framework significantly enhances the capability of AI systems to mimic therapist behaviors with heightened empathy and strategic responsiveness. By integrating multimodal data in this innovative manner, we bridge the critical gap between emotion recognition and emotional support, marking a significant advancement in conversational AI for mental health support. This work not only pushes the boundaries of AI’s role in mental health care but also establishes a foundation for developing conversational agents that can provide more empathetic and effective emotional support.

Abstract:
Object description plays an important role for visually impaired individuals to understand and compare the differences between objects. Recent multimodal large language models (MLLMs) exhibit powerful perceptual abilities and demonstrate impressive potential for generating object-centric descriptions. However, the descriptions generated by such models may still usually contain a lot of content that is not relevant to the user intent or miss some important object dimension details. Under special scenarios, users may only need the details of certain dimensions of an object. In this paper, we propose a training-free object description refinement pipeline, Dimension Tailor, designed to enhance user-specified details in object descriptions. This pipeline includes three steps: dimension extracting, erasing, and supplementing, which decompose the description into user-specified dimensions. Dimension Tailor can not only improve the quality of object details but also offer flexibility in including or excluding specific dimensions based on user preferences. We conducted extensive experiments to demonstrate the effectiveness of Dimension Tailor on controllable object descriptions. Notably, the proposed pipeline can consistently improve the performance of the recent MLLMs.

Abstract:
Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning the commonality and specificity within categories. Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is accomplished through joint optimization of five constraints: a decorrelation function applied to disentangled segments, three constraints ensuring common feature consistency and specific feature distinctiveness, and a prediction calibration term. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.2% in FGDG performance. Beyond that, the explainability analysis on explicit concept matching intensity between the shared concepts among categories and the model channels, along with experiments on various mainstream model architectures, substantiates the validity of FS.

Abstract:
Nowadays, generative models are shaping various fields such as art, design, and human-computer interaction, yet they are accompanied by copyright infringement and content management challenges. In response, existing research seeks to identify the unique fingerprints on the images they generate, which can be leveraged to attribute the generated images to their source models. However, existing methods are restricted to identifying models within a static set included in classifier training, incapable of adapting dynamically to newly emerging unseen models. To bridge this gap, this paper aims to develop a generalized model fingerprint extractor capable of zero-shot attribution that effectively attributes unseen models without exposure during training. Central to our method is a model synthesis technique, which generates numerous synthetic models that mimic the fingerprint patterns of real-world generative models. The design of the synthesis technique is motivated by observations on how the basic generative model's architecture building blocks and parameters influence fingerprint patterns, and it is validated through designed metrics to examine synthetic models' fidelity. Our experiments demonstrate that the fingerprint extractor, trained solely on synthetic models, achieves impressive zero-shot generalization on a wide range of real-world generative models, improving model identification and verification accuracy on unseen models by over 40% and 15%, respectively, compared to existing approaches.

Abstract:
Cross-domain few-shot segmentation (CD-FSS) is a challenging vision task that involves segmenting novel classes from unseen domains using only a few annotated examples. Recent methods typically rely on powerful fundamental models for feature extraction, combined with complex, parameter-based decoders for segmentation. Due to the scarcity of annotated samples in the target domain, training a large number of parameters from scratch only on the source domain can lead to overfitting, which harms generalization in cross-domain settings. These methods often assume that the fundamental feature extractor provides a sufficiently robust feature space, and freeze it during training to reduce the number of parameters. However, there is an inherent discrepancy between the tasks of training the fundamental model and performing segmentation, leading to a feature space mismatch. This misalignment results in features that, while useful for general tasks, may not fully satisfy the specific requirements of the segmentation task. To address this issue, we propose a metric-based approach called the High Specificity Guided Prototype Network (HSGNet). Our method is lightweight and focuses on fine-tuning the embedding network to align the feature space for the segmentation task. Specifically, we introduce a novel Feature Enrichment Module with extremely few parameters, which enhances the embedding network’s ability to better align with the segmentation requirements. Instead of using a parameter-based decoder, our approach employs a non-parameter, similarity-based, high-specificity segmentation strategy. Additionally, we introduce a non-parameter test-time refinement mechanism to further improve prediction accuracy. Extensive experiments on cross-domain benchmarks demonstrate that our method achieves state-of-the-art performance with minimal additional parameters.

Abstract:
Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.

Abstract:
Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.

Abstract:
In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image continuously, since in many cases we may want to get a slighter or stronger enhancement effect rather than one fixed adjusted result. In this paper, we propose a quality-guided image enhancement paradigm that enables image enhancement models to learn the distribution of images with various quality ratings. By learning this distribution, image enhancement models can associate image features with their corresponding perceptual qualities, which can be used to adjust images continuously according to different quality scores. To validate the effectiveness of our proposed method, a subjective quality assessment experiment is first conducted, focusing on skin tone adjustment in portrait photography. Guided by the subjective quality ratings obtained from this experiment, our method can adjust the skin tone corresponding to different quality requirements. Furthermore, an experiment conducted on 10 natural raw images corroborates the effectiveness of our model in situations with fewer subjects and fewer shots, and also demonstrates its general applicability to natural images.

Abstract:
Image inpainting aims to restore visually realistic contents from a corrupted image, while inpainting forensic methods focus on locating the inpainted regions to fight against inpainting manipulations. Motivated by these two mutually interdependent tasks, in this paper, we propose a novel image inpainting network called Adversarial Collaborative Network (AdvColabNet), which leverages the contradictory and collaborative information from the two tasks of image inpainting and inpainting forensics to enhance the progress of the inpainting model through adversarial collaborative training. Specifically, the proposed AdvColabNet is a coarse-to-fine two-stage framework. In the coarse training stage, a simple generative adversarial model-based U-Net-style network generates initial coarse inpainting results. In the fine stage, the authenticity of inpainting results is assessed using the estimated forensic mask. A forensics-driven adaptive weighting refinement strategy is developed to emphasize learning from pixels with higher probabilities of being inpainted, which helps the network to focus on the challenging regions, resulting in more plausible inpainting results. Comprehensive evaluations on the CelebA-HQ and Places2 datasets demonstrate that our method achieves state-of-the-art robustness performance in terms of PSNR, SSIM, MAE, FID, and LPIPS metrics. We also show that our method effectively deceives the proposed inpainting forensic method compared to state-of-the-art inpainting methods, further demonstrating the superiority of the proposed method.

Abstract:
Deep multi-view subspace clustering aims to reveal a common subspace structure by exploiting rich multi-view information. Despite promising progress, current methods focus only on multi-view consistency and complementarity, often overlooking the adverse influence of entangled superfluous information in features. Moreover, most existing works lack scalability and are inefficient for large-scale scenarios. To this end, we innovatively propose a deep subspace clustering method via Multi-view Feature Decoupling (MvFD). First, MvFD incorporates well-designed multi-type auto-encoders with self-supervised learning, explicitly decoupling consistent, complementary, and superfluous features for every view. The disentangled and interpretable feature space can then better serve unified representation learning. By integrating these three types of information within a unified framework, we employ information theory to obtain a minimal and sufficient representation with high discriminability. Besides, we introduce a deep metric network to model self-expression correlation more efficiently, where network parameters remain unaffected by changes in sample numbers. Extensive experiments show that MvFD yields State-of-the-Art performance in various types of multi-view datasets.

Abstract:
Learning-based video denoisers have attained state-of-the-art (SOTA) performances on public evaluation benchmarks. Nevertheless, they typically encounter significant performance drops when applied to unseen real-world data, owing to inherent data discrepancies. To address this problem, this work delves into the model pretraining techniques and proposes masked central frame modeling (MCFM), a new video pretraining approach that significantly improves the generalization ability of the denoiser. This proposal stems from a key observation: pretraining denoiser by reconstructing intact videos from the corrupted sequences, where the central frames are masked at a suitable probability, contributes to achieving superior performance on real-world data. Building upon MCFM, we introduce a robust video denoiser, named MVDenoiser, which is firstly pretrained on massive available ordinary videos for general video modeling, and then finetuned on costful real-world noisy/clean video pairs for noisy-to-clean mapping. Additionally, beyond the denoising model, we further establish a new paired real-world noisy video dataset (RNVD) to facilitate cross-dataset evaluation of generalization ability. Extensive experiments conducted across different datasets demonstrate that the proposed method achieves superior performance compared to existing methods.

Abstract:
Semantic segmentation in complex scenes relies not only on object appearance but also on object location and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to the huge computational cost for large-scale point clouds. In this article, we propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning Network (REFL-Net) that leverages region correlations to enhance point feature learning. We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage. In the first stage, the input points are grouped into a set of regions based on their semantic and spatial proximity. In the second stage, we explore inter-region semantic and spatial relationships by employing a self-attention block on region features and then fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is plug-and-play and can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetV2 and S3DIS datasets and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS with negligible computational cost compared with backbone models. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net.

Abstract:
Compositional Zero-Shot Learning (CZSL) aims to learn visual concepts (i.e., attributes and objects) from seen compositions and combine them to predict unseen compositions. Existing visual encoders in CZSL typically use traditional visual encoders (i.e., CNN and Transformer) or image encoders from Visual-Language Models (VLMs) to encode image features. However, traditional visual encoders need more multi-modal textual information, and image encoders of VLMs exhibit dependence on pre-training data, making them less effective when used independently for predicting unseen compositions. To overcome this limitation, we propose a novel approach based on the joint modeling of traditional visual encoders and VLMs visual encoders to enhance the prediction ability for uncommon and unseen compositions. Specifically, we design an adaptive fusion module that automatically adjusts the weighted parameters of similarity scores between traditional and VLMs methods during training, and these weighted parameters are inherited during the inference process. Given the significance of disentangling attributes and objects, we design a Multi-Attribute Object Module that, during the training phase, incorporates multiple pairs of attributes and objects as prior knowledge, leveraging this rich prior knowledge to facilitate the disentanglement of attributes and objects. Building upon this, we select the text encoder from VLMs to construct the Adaptive Fusion Network. We conduct extensive experiments on the Clothing16 K, UT-Zappos50 K, and C-GQA datasets, achieving excellent performance on the Clothing16 K and UT-Zappos50 K datasets.

Abstract:
Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a \mathitskeleton-like (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.

Abstract:
In object detection, the cost of labeling is very high because it needs not only to confirm the categories of multiple objects in an image but also to determine the bounding boxes of each object accurately. Thus, integrating active learning into object detection will raise pretty positive significance. In this paper, we propose a classification committee for the active deep object detection method by introducing a discrepancy mechanism of multiple classifiers for samples' selection when training object detectors. The model contains a main detector and a classification committee. The main detector denotes the target object detector trained from a labeled pool composed of the selected informative images. The role of the classification committee is to select the most informative images according to their uncertainty values from the view of classification, which is expected to focus more on the discrepancy and representative of instances. Specifically, they compute the uncertainty for a specified instance within the image by measuring its discrepancy output by the committee pre-trained via the proposed Maximum Classifiers Discrepancy Group Loss (MCDGL). The most informative images are finally determined by selecting the ones with many high-uncertainty instances. Besides, to mitigate the impact of interference instances, we design a Focusing on Positive Instances Loss (FPIL) to provide the committee the ability to automatically focus on the representative instances as well as precisely encode their discrepancies for the same instance. Experiments are conducted on Pascal VOC and COCO datasets versus some popular object detectors. And results show that our method outperforms the state-of-the-art active learning methods, which verifies the effectiveness of the proposed method.

Abstract:
Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.

Abstract:
Underwater image enhancement (UIE) is a highly challenging task due to the complexity of underwater environment and the diversity of underwater image degradation. Due to the application of deep learning, current UIE methods have made significant progress. Most of the existing deep learning-based UIE methods follow a single-stage network which cannot effectively address the diverse degradations simultaneously. In this paper, we propose to address this issue by designing a two-stage deep learning framework and taking advantage of cascaded contrastive learning to guide the network training of each stage. The proposed method is called CCL-Net in short. Specifically, the proposed CCL-Net involves two cascaded stages, i.e., a color correction stage tailored to the color deviation issue and a haze removal stage tailored to improve the visibility and contrast of underwater images. To guarantee the underwater image can be progressively enhanced, we also apply contrastive loss as an additional constraint to guide the training of each stage. In the first stage, the raw underwater images are used as negative samples for building the first contrastive loss, ensuring the enhanced results of the first color correction stage are better than the original inputs. While in the second stage, the enhanced results rather than the raw underwater images of the first color correction stage are used as the negative samples for building the second contrastive loss, thus ensuring the final enhanced results of the second haze removal stage are better than the intermediate color corrected results. Extensive experiments on multiple benchmark datasets demonstrate that our CCL-Net can achieve superior performance compared to many state-of-the-art methods. In addition, a series of ablation studies also verify the effectiveness of each key component involved in the proposed CCL-Net.

Abstract:
Recent progress in remote sensing image (RSI) super-resolution (SR) has exhibited remarkable performance using deep neural networks, e.g., Convolutional Neural Networks and Transformers. However, existing SR methods often suffer from either a limited receptive field or quadratic computational overhead, resulting in sub-optimal global representation and unacceptable computational costs in large-scale RSI. To alleviate these issues, we develop the first attempt to integrate the Vision State Space Model (Mamba) for RSI-SR, which specializes in processing large-scale RSI by capturing long-range dependency with linear complexity. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR, to explore the spatial and frequent correlations. In particular, our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM) to grasp their merits for effective spatial-frequency fusion. Considering that global and local dependencies are complementary and both beneficial for SR, we further recalibrate these multi-level features for accurate feature fusion via learnable scaling adaptors. Extensive experiments on AID, DOTA, and DIOR benchmarks demonstrate that our FMSR outperforms state-of-the-art Transformer-based methods HAT-L in terms of PSNR by 0.11 dB on average, while consuming only 28.05% and 19.08% of its memory consumption and complexity, respectively.

Affiliations: School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China; Department of Computer Science, University of Hong Kong, Hong Kong; School of Computer Science and Informatics, Cardiff University, Cardiff, U.K.; National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen, China

Abstract:
Low-light image enhancement (LIE) aims to restore images taken under poor lighting conditions, thereby extracting more information and details to robustly support subsequent visual tasks. While past deep learning (DL)-based techniques have achieved certain restoration effects, these existing methods treat all samples equally, ignoring the fact that difficult samples may be detrimental to the network's convergence at the initial training stages of network training. In this paper, we introduce a self-paced learning (SPL)-based LIE method named SPNet, which consists of three key components: the feature extraction module (FEM), the low-light image decomposition module (LIDM), and a pre-trained denoise module. Specifically, for a given low-light image, we first input the image, its pseudo-reference image, and its histogram-equalized version into the FEM to obtain preliminary features. Second, to avoid ambiguities during the early stages of training, these features are then adaptively fused via an SPL strategy and processed for retinex decomposition via LIDM. Third, we enhance the network performance by constraining the gradient prior relationship between the illumination components of the images. Finally, a pre-trained denoise module reduces noise inherent in LIE. Extensive experiments on nine public datasets reveal that the proposed SPNet outperforms eight state-of-the-art DL-based methods in both qualitative and quantitative evaluations and outperforms three conventional methods in quantitative assessments.

Abstract:
In recent years, the zero-shot image recognition with semantic knowledge has achieved good performance due to vision-language models. However, because of the complexity of 3D shapes, the model cannot fully use the semantic knowledge of 3D shapes, which results in low accuracy of zero-shot 3D shape recognition. To address this problem, we propose a Semantic-enhanced ULIP for Zero-shot 3D Shape Recognition (SE-ULIP). This method utilizes the contrastive learning to fine-tune the text encoder in two stages, including the domain adaptation fine-tuning and the triplets-based text encoder fine-tuning. In the domain adaptation fine-tuning, we fine-tune the image encoder and the text encoder using the views and the Semantic Descriptive Text (SDT) of each view generated by the Visual Question Answering (VQA) model, which aims to align the view features with the semantic knowledge. In the triplets-based text encoder fine-tuning, we propose an Adaptive Conditional Adjustment Context Optimization (ACACoOp) to learn the optimal context vectors. The optimal context vectors are used as the input to fine-tune the text encoder again, which enhance SE-ULIP to understand the semantic knowledge of 3D shapes. Experiments show that our method achieves the state-of-the-art performance through the fine-tuned text encoder on three 3D backbone networks for both zero-shot and standard 3D shape recognition.

Abstract:
Scribble supervised salient object detection (SSSOD) constructs segmentation ability of attractive objects from surroundings under the supervision of sparse scribble labels. For the better segmentation, depth and thermal infrared modalities serve as the supplement to RGB images in the complex scenes. Existing methods specifically design various feature extraction and multi-modal fusion strategies for RGB, RGB-Depth, RGB-Thermal, and Visual-Depth-Thermal image input respectively, leading to similar model flood. As the recently proposed Segment Anything Model (SAM) possesses extraordinary segmentation and prompt interactive capability, we propose an SSSOD family based on SAM, named SSFam, for the combination input with different modalities. Firstly, different modal-aware modulators are designed to attain modal-specific knowledge which cooperates with modal-agnostic information extracted from the frozen SAM encoder for the better feature ensemble. Secondly, a siamese decoder is tailored to bridge the gap between the training with scribble prompt and the testing with no prompt for the stronger decoding ability. Our model demonstrates the remarkable performance among combinations of different modalities and refreshes the highest level of scribble supervised methods and comes close to the ones of fully supervised methods.

Abstract:
Visible-infrared image re-identification (VIReID) aims to match objects with the same identity appearing across different modalities. Given the significant differences between visible and infrared images, VIReID poses a formidable challenge. Most existing methods focus on extracting modality-shared features while ignore modality-specific features, which often also contain crucial important discriminative information. In addition, high-level semantic information of the objects, such as shape and appearance, is also crucial for the VIReID task. To further enhance the retrieval performance, we propose a novel one-stage CLIP-based Modality Compensation (CLIP-MC) method for the VIReID task. Our method introduces a new prompt learning paradigm that leverages the semantic understanding capabilities of CLIP to recover missing modality information. CLIP-MC comprises three key modules: Instance Text Prompt Generation (ITPG), Modality Compensation (MC), and Modality Context Learner (MCL). Specifically, the ITPG module facilitates effective alignment and interaction between image tokens and text tokens, enhancing the text encoder's ability to capture detailed visual information from the images. This ensures that the text encoder generates fine-grained descriptions of the images. The MCL module captures the unique information of each modality and generates modality-specific context tokens, which are more flexible compared to fixed text descriptions. Guided by the modality-specific context, the text encoder discovers missing modality information from the images and produces compensated modality features. Finally, the MC module combines the original and compensated modality features to obtain complete modality features that contain more discriminative information. We conduct extensive experiments on three VIReID datasets and compare the performance of our method with other existing approaches to demonstrate its effectiveness and superiority.

Abstract:
Existing pretraining methods for semantic segmentation are hampered by the task gap between global image -level pretraining and local pixel-level finetuning. Joint dense-level pretraining is a promising alternative to exploit off-the-shelf annotations from diverse segmentation datasets but suffers from low-quality class embeddings and inconsistent data and supervision signals across multiple datasets by directly employing CLIP. To overcome these challenges, we propose a novel Multi-datasEt harmoNized pretraining framework for Semantic sEgmentation (MENSA). MENSA incorporates high-quality language embeddings and momentum-updated visual embeddings to effectively model the class relationships in the embedding space and thereby provide reliable supervision information for each category. To further adapt to multiple datasets, we achieve one-to-many pixel-embedding pairing with cross-dataset multi-label mapping through cross-modal information exchange to mitigate inconsistent supervision signals and introduce region-level and pixel-level cross-dataset mixing for varying data distribution. Experimental results demonstrate that MENSA is a powerful foundation segmentation model that consistently outperforms popular supervised or unsupervised ImageNet pretrained models for various benchmarks under standard fine-tuning. Furthermore, MENSA is shown to significantly benefit frozen-backbone fine-tuning and zero-shot learning by endowing pixel-level distinctiveness to learned representations.

Abstract:
Diagram question answering (DQA), which is defined as answering natural language questions according to the visual diagram context, has attracted attention and has recently become a new benchmark for evaluating the complex reasoning ability of models. However, this reasoning task is extremely challenging because of the inclusion of abstract visual objects and specialized textual terms, as well as the complex relationships between them. The rarity of data caused by the high cost of annotation also makes large-scale deep models invalid for the DQA task. To address the above challenges, this paper proposes the cross-modal alignment-guided self-supervised learning model for DQA (CAS-DQA). Unlike previous works, the CAS-DQA model focuses on learning internal visual-textual object relationships, innovatively proposes an attention mechanism module based on object alignment, and effectively integrates cross-modal knowledge units for diagram understanding. In addition, the CAS-DQA model constructs two self-supervised learning (SSL) tasks via intermediate results of visual-textual object alignment. These two tasks exploit the unnoticed objects inside the diagram to fully and completely understand the diagram. They also effectively increase the amount of diagram question-answering data to address the challenge of data scarcity. To the best of our knowledge, the CAS-DQA model is the first to extend SSL strategies to the diagram question-answering task. We evaluate the CAS-DQA model on three different datasets. The results of extensive experiments show that our model significantly outperforms baselines on different scenarios and that the internal object alignment module and self-supervised tasks produce excellent results.

Abstract:
Owing to the proliferation of user-generated videos on the Internet, blind video quality assessment (BVQA) at the edge attracts growing attention. The usage of deep-learning-based methods is restricted to be applied at the edge due to their large model sizes and high computational complexity. In light of this, a novel lightweight BVQA method called GreenBVQA is proposed in this work. GreenBVQA features a small model size, low computational complexity, and high performance. Its processing pipeline includes: video data cropping, unsupervised representation generation, supervised feature selection, and mean-opinion-score (MOS) regression and ensembles. We conduct experimental evaluations on three BVQA datasets and show that GreenBVQA can offer state-of-the-art performance in the Pearson Linear Correlation Coefficient (PLCC) and the Spearman Rank Order Correlation Coefficient (SROCC) metrics while demanding significantly smaller model sizes and lower computational complexity. Thus, GreenBVQA is well-suited for edge devices.

Abstract:
The ever-increasing demands for intuitive interactions in virtual reality have led to surging interests in facial expression recognition (FER). There are however several issues commonly seen in existing methods, including narrow receptive fields and homogenous supervisory signals. To address these issues, we propose in this paper a novel multimodal supervision-steering transformer for facial expression recognition in the wild, referred to as FER-former. Specifically, to address the limitation of narrow receptive fields, a hybrid feature extraction pipeline is designed by cascading both prevailing CNNs and transformers. To deal with the issue of homogenous supervisory signals, a heterogeneous domain-steering supervision module is proposed to incorporate text-space semantic correlations to enhance image features, based on the similarity between image and text features. Additionally, a FER-specific transformer encoder is introduced to characterize conventional one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. Based on the collaboration of multifarious token heads, global receptive fields with multimodal semantic cues are captured, delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-art methods.

Abstract:
An ideal full-reference image quality (FR-IQA) model should exhibit both high separability for images with different quality and compactness for images with the same or indistinguishable quality. However, existing learning-based FR-IQA models that directly compare images in deep-feature space, usually overly emphasize the quality separability, neglecting to maintain the compactness when images are of similar quality. In our work, we identify that the perception bias mainly stems from an inappropriate subspace where images are projected and compared. For this issue, we propose a Debiased Mapping based quality Measure (DMM), leveraging orthonormal bases formed by singular value decomposition (SVD) in the deep features domain. The SVD effectively decomposes the quality variations into singular values and mapping bases, enabling quality inference with more reliable feature difference measures. Extensive experimental results reveal that our proposed measure could mitigate the perception bias effectively and demonstrates excellent quality prediction performance on various IQA datasets.

Abstract:
Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.

Abstract:
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation.

Abstract:
Current clothes-changing person re-identification (re-id) approaches usually perform retrieval based on clothes-irrelevant features, while neglecting the potential of clothes-relevant features. However, we observe that relying solely on clothes-irrelevant features for clothes-changing re-id is limited, since they often lack adequate identity information and suffer from large intra-class variations. On the contrary, clothes-relevant features can be used to discover same-clothes intermediaries that possess informative identity clues. Based on this observation, we propose a Feasibility-Aware Intermediary Matching (FAIM) framework to additionally utilize clothes-relevant features for retrieval. First, an Intermediary Matching (IM) module is designed to perform an intermediary-assisted matching process. This process involves using clothes-relevant features to find informative intermediates, and then using clothes-irrelevant features of these intermediates to complete the matching. Second, in order to reduce the negative effect of low-quality intermediaries, an Intermediary-Based Feasibility Weighting (IBFW) module is designed to evaluate the feasibility of intermediary matching process by assessing the quality of intermediaries. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on several widely-used clothes-changing re-id benchmarks.

Abstract:
Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference. One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels. However, its performance is significantly affected by noisy pseudo-labels. This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise. To address this issue, we propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation. First, we decouple the optimization of class prototypes. For each class prototype, we reduce its distance with positive samples and enlarge its distance with negative samples in a contrastive manner. This strategy prevents the model from overfitting to noisy pseudo-labels. Second, we propose a memory-based strategy to enhance DPL's robustness for the small batch sizes often encountered in TTA. We update each class's pseudo-feature from a memory in a momentum manner and insert an additional DPL loss. Finally, we introduce a consistency regularization-based approach to leverage samples with unconfident pseudo-labels. This approach transfers feature styles of samples with unconfident pseudo-labels to those with confident pseudo-labels. Thus, more reliable samples for TTA are created. The experimental results demonstrate that our methods achieve state-of-the-art performance on domain generalization benchmarks, and reliably improve the performance of self-training-based methods on image corruption benchmarks.

Abstract:
Recently Vision Transformer (ViT) and Convolution Neural Network (CNN) start to emerge as a hybrid deep architecture with better model capacity, generalization, and latency trade-off. Most of these hybrid architectures often directly stack self-attention module with static convolution or fuse their outputs through two pathways within each block. Instead, we present a new Transformer architecture (namely Stream-ViT) to novelly integrate ViT with streamlined convolutions, i.e., a series of high-to-low resolution convolutions. The kernels of each convolution are dynamically learnt on a basis of current input features plus pre-learnt kernels throughout the whole network. The new architecture incorporates a critical pathway to streamline kernel generation that triggers the interactions between dynamically learnt convolutions across different layers. Moreover, the introduction of a layer-wise streamlined convolution is functionally equivalent to a squeezed version of multi-branch convolution structure, thereby improving the capacity of self-attention module with enlarged cardinality in a cost-efficient manner. We validate the superiority of Stream-ViT over multiple vision tasks, and its performances surpass state-of-the-art ViT and CNN backbones with comparable FLOPs.

Abstract:
Vision-and-Language Navigation (VLN) requires an agent to follow given instructions to navigate. Despite the significant progress, the model trained on seen environments has a performance drop on unseen environments due to distribution shift. To improve the generalization, existing method attempts to apply test-time adaptation to VLN. However, it needs to access the training data and all testing data for updating the model before inference. The setting is not suitable for the real application because it is hard for the agent to access training data and all testing data when the agent is applied in a new environment. In this paper, we consider a more practical setting with source-free and online-inference test-time adaption. In other words, the model can only access one testing sample for test-time adaptation. In this setting, the model may suffer from catastrophic forgetting of the learned knowledge and unstable parameter update issues. To solve these challenges, we propose an elastic adaptation model (EAM) that consists of an auxiliary decision model and a sample replay mechanism. We use the online testing samples to adapt the auxiliary decision model to new environments, which cooperates with the frozen original model to make better action decisions. The sample replay mechanism stores the historical testing samples to make the adaptation process more stable. Our method is model-agnostic and is effortless to be applied to most existing methods. Experimental results show that our method achieves stable performance improvement based on three existing methods on three VLN benchmark datasets.

Abstract:
Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, i.e., Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids indiscriminateselection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.

Abstract:
Existing hand gesture recognition methods predominantly rely on a close-set assumption, which in essence limits the viewpoints, gesture categories, and hand shapes at test time to closely resemble those seen during training. This requirement is however rarely met in practice, as images are often captured from unconstrained viewpoints, with novel gestures and unseen hand shapes that can differ significantly from the training data. This motivates us to investigate an open-set hand gesture recognition problem, where hand gestures are still recognizable from unconstrained viewpoints, and novel gesture classes and hand shapes can be incrementally learned with just a few examples. To address this, we propose a viewpoint influence elimination network that extracts view-independent features, significantly improving performance in scenarios with unconstrained viewpoints. Moreover, a joint-weighted classification scheme is introduced to augment the cosine similarity metric for evaluating few-shot incremental learning of novel gestures and shapes. Finally, as existing hand gesture recognition datasets primarily adhere to the close-set assumption, a new hand gesture recognition dataset, OHG, is introduced in this paper, that includes a wide range of viewpoints, diverse gesture classes, and distinct hand shapes. Experimental hand gesture recognition results demonstrate the superior performance of our approach in both unconstrained viewpoint and few-shot incremental learning scenarios.

Abstract:
Unsupervised domain adaptation for object detection aims to bridge the domain gap by transferring knowledge from a labeled source domain to an unlabeled target domain, thus improving the performance of detection models. Common strategies focus on aligning the feature distributions between source and target domains to reduce their discrepancies. However, achieving complete alignment is often not feasible in real-world situations due to a lack of annotations in the target domain. Recently, TeacherStudent approaches achieve feature alignment by generating reliable target pseudo-labels and become the dominant solution for addressing this issue. However, due to the domain shift, the teacher model bias to source domain, making it challenging to enhance the quality of target pseudo-labels. Some methods within this framework attempt to overcome the domain shift by incorporating distribution alignment components, yet these approaches also face challenges in achieving perfect alignment between domains. In this paper, we propose the Dual-Domain Teacher (DDT) method to address the domain adaptation detection problem by simultaneously detecting objects in both domains, thereby decreasing the need for perfect alignment. To address the issue of duplicate detection results produced by the Dual-Domain detection process, a candidate set refinement strategy is proposed to eliminate these duplicates across domains. Moreover, when teachers generate pseudo-labels by selecting reliable predictions with fixed confidence thresholds, valuable predictions may be overlooked in mutual learning. In our approach, a minimum variance-based dynamic threshold module is designed to mine valuable pseudo-labels by adaptively adjusting to the optimal threshold. Extensive experiments show that the DDT achieve a 56.7% mAP on the CityScapes-to-Foggy CityScapes task, marking a 4.8 point improvement over the latest methods. On the PASCAL VOC-to-Clipart1k task, our method reaches 51.2% mAP, outperforming previous state-of-the-art.

Abstract:
We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with quantitative comparisons validating the effectiveness of our approach.

Abstract:
Benefiting from the advantages of low storage cost and high retrieval efficiency, hash learning could significantly speed up large-scale cross-modal retrieval. Based on the prior annotations, most of the available cross-modal hashing usually introduces the margin-based constraint to generate different boundaries for each class in the inference phase, optimizing the model. However, these obtained label-guided penalty boundaries may differ from the primitive semantic relationships between heterogeneous modalities, impairing retrieval performance. Besides, the margin-based constraint is too weak to penalize the classes with low intra-class variances or inter-class correlations, which struggle to learn high-quality embeddings. In this paper, we propose a novel Deep Semantic-consistent Penalizing Hashing framework (DScPH) to learn the consistent penalizing fields for all classes, achieving accurate and efficient cross-modal retrieval. Specifically, by exploring unbalanced intra-class and inter-class correlations, the consistent penalizing loss is introduced into cross-modal retrieval to learn the consistency decision boundaries across classes. During training, the dice-like optimization strategy is developed to balance the pulling penalizing elements and pushing penalizing elements, facilitating the model convergence. Besides, based on the invariance of similarity measures under orthogonal transformations, the alternative quantization is proposed to minimize the errors between the learned continuous embeddings and binary discretization, maintaining the consistency of semantic relationships after performing binary projection. Extensive experiments are conducted on three benchmark datasets, and the comprehensive results validate the efficacy of our proposed DScPH framework, which outperforms the current mainstream deep cross-modal hashing algorithms.

Abstract:
Recently, diffusion models have significantly improved the performance of Camouflaged Object Detection (COD) by adding noise to a mask and iteratively denoising it to match the target distributions. Due to the direct extraction of features from noisy masks and the lack of conditional constraints on a prediction area, the diffusion model may deviate from a correct prediction range and produces mispredictions in regions with high uncertainty. To address this issue, we propose an uncertainty-guided diffusion model (UGDNet) for COD, which explicitly quantifies uncertainty and integrates it as an anchor condition into the diffusion models to provide an initialization of the diffusion regions. The core idea is first to utilize a probability representation and transformer to explicitly model uncertainty, aiming to identify areas where a model may generate overconfident mispredictions. Then, we use the uncertainty as an anchor condition to provide a reference prediction range for the diffusion model, guiding each step of the diffusion process. Furthermore, we use uncertainty to guide feature aggregation, prompting the model to pay extra attention to the semantic features of regions with high uncertainty to refine the segmentation results further. The experimental results indicate that our proposed UGDNet achieves higher accuracy than existing state-of-the-art models on five COD benchmarks, including COD10K, NC4K, CAMO, CHAMELEON, and CDS2K.

Abstract:
Zero-shot visual grounding is the task of identifying and localizing an object in an image based on a referring expression without task-specific training. Existing methods employ heuristic rules to step-by-step perform visual perception for visual grounding. Despite their remarkable performance, there are still two limitations. First, such a rule-based manner struggles with expressions that are not covered by predefined rules. Second, existing methods lack a mechanism for identifying and correcting visual perceptual errors of incomplete information, resulting in cascading errors caused by reasoning based on incomplete visual perception results. In this article, we propose an Error-Aware Generative Reasoning (EAGR) method for zero-shot visual grounding. To address the limited adaptability of existing methods, a reasoning chain generator is presented, which prompts LLMs to dynamically generate reasoning chains for specific referring expressions. This generative manner eliminates the reliance on human-written heuristic rules. To mitigate visual perceptual errors of incomplete information, an error-aware mechanism is presented to elicit LLMs to identify these errors and explore correction strategies. Experimental results on four benchmarks show that EAGR outperforms state-of-the-art zero-shot methods by up to 10% and an average of 7%.

Abstract:
Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of text ambiguity, we propose a Multi-Modal Reference learning framework to learn robust representations. We first propose a multi-modal reference construction module to aggregate all visual and textual details of the same object into a comprehensive multi-modal reference. The multi-modal reference hence facilitates the subsequent representation learning and retrieval similarity computation. Specifically, a reference-guided representation learning module is proposed to use multi-modal references to learn more accurate visual and textual representations. Additionally, we introduce a reference-based refinement method that employs the object references to compute a reference-based similarity that refines the initial retrieval results. Extensive experiments are conducted on five fine-grained text-to-image retrieval datasets for different text-to-image retrieval tasks. The proposed method has achieved superior performance over state-of-the-art methods. For instance, on the text-to-person image retrieval dataset RSTPReid, our method achieves the Rank1 accuracy of 56.2%, surpassing the recent CFine by 5.6%.

Abstract:
The social and environmental interactions, as well as the pedestrian goal are crucial for pedestrian trajectory prediction. This is because they could learn both complex interactions in the scenes and the intentions of the pedestrians. However, most existing methods either learn the one-moment social interactions, or supervise the pedestrian trajectories using long-term goal, resulting in suboptimal prediction performances. In this paper, we propose a novel network named Completed Interaction Network (CINet) to simultaneously consider the social interactions in all moments, the environmental interactions and the short-term goal of pedestrians in a unified framework for pedestrian trajectory prediction. Specifically, we propose the Spatio-Temporal Transformer Layer (STTL) to fully mine the spatio-temporal information among historical trajectories of all pedestrians in order to obtain the social interactions in all moments. Additionally, we present the Gradual Goal Module (GGM) to capture the environmental interactions under the supervision of the short-term goal, which is beneficial to understanding the intentions of the pedestrian. Afterwards, we employ the cross-attention to effectively integrate the all-moment social and environmental interactions. The experimental results on three standard pedestrian datasets, i.e., ETH/UCY, SDD and inD demonstrate that our method achieves a new state-of-the-art performance. Furthermore, the visualization results indicate that our method could predict trajectories more reasonably in complex scenarios such as sharp turns, infeasible areas and so on.

Abstract:
The computer vision community has witnessed an extensive exploration of vision transformers in the past two years. Drawing inspiration from traditional schemes, numerous works focus on introducing vision-specific inductive biases. However, the implicit modeling of permutation invariance and fully-connected interaction with individual tokens disrupts the regional context and spatial topology, further hindering higher-order modeling. This deviates from the principle of perceptual organization that emphasizes the local groups and overall topology of visual elements. Thus, we introduce the concept of hypergraph for perceptual exploration. Specifically, we propose a topology-aware vision transformer called HyperGraph Transformer (HGFormer). Firstly, we present a Center Sampling K-Nearest Neighbors (CS-KNN) algorithm for semantic guidance during hypergraph construction. Secondly, we present a topology-aware HyperGraph Attention (HGA) mechanism that integrates hypergraph topology as perceptual indications to guide the aggregation of global and unbiased information during hypergraph messaging. Using HGFormer as visual backbone, we develop an effective and unitive representation, achieving distinct and detailed scene depictions. Empirical experiments show that the proposed HGFormer achieves competitive performance compared to the recent SoTA counterparts on various visual benchmarks. Extensive ablation and visualization studies provide comprehensive explanations of our ideas and contributions.

Abstract:
Inconsistent accuracy between classification and localization tasks is a common challenge in modern object detection. Task decoupling, which employs distinct features or labeling strategies for each task, is a widely used approach to address this issue. Although it has led to noteworthy advancements, this approach is insufficient as it neglects task interdependence and lacks an explicit consistency constraint. To bridge this gap, this paper proposes the Progressive Semi-Decoupled Detector (ProSDD) to enhance both classification and localization accuracy. Specifically, a new detection head is designed that incorporates feature suppression and enhancement mechanism (FSEM) and bidirectional interaction module (BIM). Compared with the decoupled head, it not only filters out task-irrelevant information and enhances task-related information, but also avoids excessive decoupling at the feature level. Moreover, both FSEM and BIM are used multiple times, thus forming a progressive semi-decoupled head. Then, a novel consistency loss is proposed and integrated into the loss function of object detection, ensuring harmonic performance in classification and localization. Experimental results demonstrate that the proposed ProSDD effectively alleviates inconsistent accuracy and achieves high-quality object detection. Taking the pretrained ResNet-50 as the backbone, ProSDD achieves a remarkable 43.3 AP on the MS COCO dataset, surpassing contemporary state-of-the-art detectors by a substantial margin under the equivalent configurations.

Abstract:
Text-driven style transfer for Neural Radiance Fields (NeRFs) is an emerging research topic that leverages text descriptions instead of reference style images to apply style transfer. However, existing methods for stylizing NeRFs predominantly struggle to extend to 4D dynamic scenes, due to NeRFs’ inherent limitation to static environments. Moreover, these current methods require training for each specific text input, which limits them to a single style description and significantly hampers generalizability and applications. In this paper, we introduce a novel approach to zero-shot text-driven 4D style transfer that adopts text inputs into the CLIP’s style space with a canonical feature volume. Specifically, using geometric priors from pre-trained dynamic Neural Radiance Fields, we train a canonical feature volume by rendering feature maps under the supervision of a pre-trained VGG encoder. Then we utilize CLIP’s multi-modal embedding to connect the text descriptions with style images and learn a canonical style transformation matrix in CLIP’s feature space. Experiments show that our method achieves zero-shot text-driven style transfer for dynamic neural radiance fields and maintains good multi-view and cross-time consistency.

Abstract:
Implicit neural representations (INRs), which leverage neural networks to represent signals by mapping coordinates to their corresponding attributes, have garnered significant attention. They are extensively utilized for image representation, with pixel coordinates as input and pixel values as output. In contrast to prior works focusing on investigating the effect of the model's inside components (activation function, for instance), this work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible computation overhead. Moreover, we present two perspectives, depth and normalization, to interpret the performance benefits caused by scale and shift transformation. Overall, our work provides a new avenue for future works to understand and improve INR through the lens of kernel transformation.

Abstract:
Existing open-set recognition (OSR) studies typically assume that each image contains only one class label,with the unknown test set (negative) having a disjoint label space from the known test set (positive), a scenario referred to as full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with both known and unknown classes co-occurring in the negatives, leading to a more complex super-label shift that better reflects real-world scenarios. To tackle this challenge, we propose the OpenSlot framework, based on object-centric learning, which uses slot features to represent diverse class semantics and generate class predictions. The proposed anti-noise slot (ANS) technique helps mitigate the impact of noise (invalid or background) slots during classification training, addressing the semantic misalignment between class predictions and ground truth. We evaluate OpenSlot on both mixed and conventional OSR benchmarks. Without elaborate designs, our method not only excels existing approaches in detecting super-label shifts across OSR tasks, but also achieves state-of-the-art performance on conventional benchmarks. Meanwhile, OpenSlot can localize class objects without using bounding boxes during training, demonstrating competitive performance in open-set object detection and potential for generalization.

Abstract:
A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications.

Abstract:
This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model’s comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 \textAP_\text3D@0.7 improvement on the KITTI dataset.

Abstract:
Consistency regularization has prevailed in semi-supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image-augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi-Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image-augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point-to-point alignment and prototype-based intra-class compactness. Moreover, we propose a self-adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature-perturbation based Prediction consistency learning. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.

Abstract:
Human mesh recovery (HMR) holds significant utility in many applications. Studying HMR involving various types of sensors is necessary, as it enables the acquisition of human meshes in diverse scenes. Unlike HMR based on RGB images, HMR based on LiDAR has received considerably less attention in previous works. The major challenge in estimating human poses and meshes from sparse point clouds lies in the sparsity, noise, and incompletion of LiDAR point clouds. To address these challenges, we propose a LiDAR-based 3D human mesh recovery algorithm, called LiDAR-HMR. This algorithm involves estimating a sparse representation of a human (3D human pose) and gradually reconstructing the body mesh. To better leverage the 3D structural information of point clouds, we propose a point-cloud-to-SMPL pipeline that uses the original point cloud features to guide the reconstruction. The experimental results on four publicly available datasets demonstrate the effectiveness of LiDAR-HMR.

Abstract:
The rapidly evolving realm of Extended Reality (XR) demands high bandwidth and low-latency communication to support immersive experiences such as high-resolution 360-degree videos and real-time interactions in virtual reality gaming. In this study, “resources” are defined as digital assets essential for XR applications, divided into “static resources” (immutable media files such as textures and video segments) and “dynamic resources” (real-time user data and interactive elements crucial for user interactions). A primary challenge in XR environments is optimizing the delivery and caching of these resources within existing network infrastructures to enhance the Quality of Experience (QoE) for users. We introduce a novel hybrid network architecture that integrates resource caching, user-to-user communication, and central server oversight. This architecture not only ensures reliable delivery but also significantly reduces communication latency. Preliminary experiments, conducted under conditions where each node in the network has a 10% chance of failing at any given time, demonstrate that our approach enhances the delivery efficiency of static resources by 68%, affecting 38% of communications, with an increase in latency observed in 4% of cases by 22% . For dynamic resources, it reduces latency in 89% of the cases by an average of 30%, though 8% of cases experienced a 36% increase in latency. These results affirm the effectiveness of our architecture in enhancing user experience in XR environments under challenging network conditions.

Abstract:
Few-shot open-set recognition (FSOSR) poses a significant challenge as it requires identifying unknown classes while maintaining the classification performance of known classes, despite having limited access to labeled training samples. Current methods often employ non-directional metric-based losses to encapsulate feature attributes within the embedding space, inadvertently disregarding the potential influence of spatial distribution deviations of feature representations on open-set recognition performance. To address this, we present a novel directional metric-based method termed Bilevel Direction Preserving (BiDirP). This method incorporates two direction-preserving regularizers operating at distinct levels, specifically at the instance and prototype levels. The combined application of these two direction-preserving regularizers effectively enhances the spatial separation between prototypes of different classes and refines the classification decision boundaries, which results in an improved discriminative ability to differentiate unknown classes within a broader open space. Comprehensive experiments on public benchmarks show that BiDirP can significantly improve the detection ability of unknown classes while correctly classifying known classes.

Abstract:
Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs’ limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.

Abstract:
Face recognition owes its success to the availability of large-scale training data. Recent adaptive margin-based loss functions pay more attention to hard (misclassified) samples, resulting in more discriminative face embeddings. However, large-scale datasets inevitably include open-set noise samples, which are usually mistaken for hard samples by mining-based methods and thus mislead the training of the model. In this work, we redefine hard samples and further design a dynamic association learning strategy for mining hard samples while ignoring noise. We argue that the difficulty of recognizing a sample depends on both identity-related and objective factors. On one hand, intrinsic attributes such as facial structure and face shape inherently influence the ease of identity recognition. On the other hand, external factors, including pose, occlusion, and resolution, directly affect the recognizability of a sample. Particularly in the case of noise samples, although they pose challenges for the deep network similar to hard samples, should not be regarded as hard samples. To this end, we propose an associated prototype learning method to achieve an approximation of face identity difficulty by exploring the fitting trends of identity prototype. Furthermore, we design a dynamic sample learning method to distinguish noise samples from hard samples by observing the distance fluctuation from the class center during sample learning. All observations are integrated into the loss function through adaptive margins and sample weights. Extensive experiments and visualizations on several datasets demonstrate that our method significantly outperforms state-of-the-art counterparts.

Abstract:
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. MLLMs involve significant external knowledge within their parameters; however, it is challenging to continually update these models with the latest knowledge, which involves huge computational costs and poor interpretability. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs. We first leverage the question to instruct the extraction of visual information through interactions with one set of learnable queries, minimizing irrelevant interference and redundancy during retrieval and generation. Besides, we introduce a pre-trained multimodal adaptive fusion module to achieve question text-to-multimodal retrieval and integration of multimodal knowledge by projecting visual and language modalities into a unified semantic space. Furthermore, we present an Adaptive Selection Knowledge Generation (ASKG) strategy to train the generator to autonomously discern the relevance of retrieved knowledge, which realizes excellent denoising performance. Extensive experiments on open multimodal question-answering datasets demonstrate that RA-BLIP achieves significant performance and surpasses the state-of-the-art retrieval-augmented models.

Abstract:
Complex action recognition aims to identify multiple actions over a long time. Multiple actions may occur at the same time (defined as simultaneous actions), and may occur after each other (defined as each action) Complex action recognition may suffer from two challenges. (1) Temporal repeated bias. The same action may repeat in a temporal duration. In this duration, the prediction may be biased to the majority of actions, which occur repeatedly in the past temporal frames. (2) Epistemic uncertainty of multiple actions. When there are multiple simultaneous actions in one frame, this frame’s feature may result in the distribution of multiple actions overlapping each other. Without modeling proper relations between actions, the model may hinder accurately explaining certain categories in multiple actions (defined as the model’s epistemic uncertainty). In this work, we propose an Instructive Probabilistic Transformer, which contains a probabilistic temporal memorizer, and a probabilistic prototype Transformer. First, to alleviate temporal repeated bias, we design a probabilistic temporal memory module, which learns probabilistic temporal gates to localize each action. The probabilistic gates instruct the selective memory of each action in long-term frames. Second, we cluster features to capture common action semantics among features (defined as action prototypes). To alleviate the epistemic uncertainty of multiple actions, we design a probabilistic prototype Transformer module. This module learns probabilistic relations depending on each prototype, which can ensure the separation between different prototypes. Third, to ensure the proper probabilistic relations depending on each prototype, we extend action loss with distribution loss to learn uncertainty-aware action loss. In uncertainty-aware action loss, the distribution loss measures the consistency between probabilistic relations and prototype relation distribution. The prediction uncertainty is learned by analyzing the entropy of multiple predictions, and helps to ensure the effect between action loss and distribution loss. Extensive experiments demonstrate that our method achieves state-of-the-art performance on Charades, Breakfast Actions, and MultiTHUMOS.

Abstract:
Camouflaged object detection (COD) is a challenging task that struggles to accurately detect the objects concealed in the surrounding environment. This is largely attributed to the intrinsic similarity of the camouflaged objects with the surrounding environment. To address this challenge, we propose a Spatial-Frequency Collaborative Learning network for COD (SFCNet). Specifically, we propose a Domain Transformation Fusion (DTF) module to handle the similarity between the camouflaged objects and the background, because when processed in the frequency domain, the features of the camouflaged object and the background become easy to discriminate. Then, we design a Cross-domain Integration Unit (CIU) to integrate the high-level features progressively through a Spatial-Frequency Coordinated Fusion (SFCF) module and a Multi-scale Feature Enhancement (MFE) module. Finally, the low-level features are combined with the high-level features from different decoding stages to correct the camouflaged objects in detail. In addition, an Edge Amplification (EA) module is designed to enable the model to pay attention to the global contour of the camouflaged object. It can facilitate the generation of prediction maps with accurate object boundaries. Extensive experiments on four benchmark COD datasets show that SFCNet outperforms state-of-the-art (SOTA) COD models. Meanwhile, it also has the characteristics of low parameters (21.01 M) and low computational complexity (24.14 G).

Abstract:
Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods.

Abstract:
The popularity of blended-target domain adaptation (BTDA) is growing since target data in the real world often come from multiple domains with different data distributions. Most BTDA studies adapt directly from the source domain to the target domains without considering which kinds of semantic information embedded in images should be explored. Therefore, some irrelevant semantic information is inevitably used, which leads to negative transfer. To address these issues, we propose a semantic dual-adversarial network (SDN) method for BTDA. Specifically, to suppress irrelevant semantic information, we adopt a min-max game strategy between the classifier and the feature extractor. The classifier tries to maximize the prediction distribution discrepancy, whereas the extractor endeavors to minimize this discrepancy. In this process, irrelevant semantic information is suppressed and the principal semantic information is emphasized. To align the categorical distributions, we train a category-aware domain discriminator and a feature extractor with category labels. In addition, we introduce a random ratio-based feature fusion scheme to augment the source domain, which can decrease domain gaps. At last, we propose a weighted negative self-supervised learning method to enhance the model’s generalization. Extensive experiments on multiple benchmarks showcase that our method significantly outperforms the prior state-of-the-art methods in BTDA.

Abstract:
3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

Abstract:
Adversarial examples can assess the robustness of machine learning models, which has attracted the attention of many researchers to adversarial example generation methods. Transferability and imperceptibility stand out as two crucial metrics for evaluating the quality of adversarial examples. However, achieving a balance between these two indicators poses a formidable challenge. In this paper, we propose a low-frequency guided adversarial attack method (LGA) to generate adversarial examples with strong transferability and good imperceptibility. Specifically, we enhance the transferability of adversarial examples by increasing the diversity of attack algorithms, and introduce the guiding principle and the triplet loss constraint to ensure that the generated adversarial examples are optimized away from the class regions of the clean examples. We find that the low-frequency component in the frequency domain of the image contains the vast majority of the semantic information of the image. Therefore, we constrain the attack perturbations to low-frequency component space to enhance the covert nature while maintaining visual coherence, rendering the adversarial examples more difficult to perceive. We conduct extensive experiments on various models with different network structures and multiple defense strategies, and the experimental results demonstrate that our method outperforms existing methods in the tradeoff between transferability and imperceptibility, achieving the SOTA performance.

Abstract:
Out-of-distribution detection aims to protect models against overconfidently categorizing samples from unknown categories, i.e., out-of-distribution data (OOD), into known categories, i.e., in-distribution data (ID). From the perspective of feature distribution, the difference between OOD samples and ID samples can be decomposed into semantic shifts and covariate shifts. Most DL-based methods only extract deeper features, which represent semantic shifts, to discern feature variances in the data, ignoring the exploration of covariance shifts. In this paper, we propose a Shallow Layer-driven Enhanced OOD detection method (SLE), which enhances the difference of OOD samples by exploiting covariate shifts in shallow features. Specifically, it contains three main components: Hierarchical Feature Extractor (HFE), Adaptive Dimensionality Reduction Strategy (ADR), Cross-layer Score Aggregator (CSA). HFE is responsible for extracting both deeper and shallow features from the deep network. ADR adaptively reduces all hierarchical feature dimensionality according to sample characteristics, avoiding feature redundancy. CSA defines a novel confidence score for OOD samples, that effectively prevents confusion in the feature representation space at each layer. In SLE, these three closely related components cooperate with each other to effectively enhance the representation ability of OOD samples and divide OOD data better. We conduct extensive experiments to examine the performance of SLE in four benchmarks and discuss its individual components. This method performs well on the OOD datasets.

Abstract:
Transformers have achieved remarkable success in the field of computer vision due to their advantage in capturing the global information of images. However, they fail to model the variance of rotation, resulting in significant performance loss in target detection in remote sensing imagery. In this paper, a rotation-invariant transformer plus model, namely RIFormer+ is proposed to enhance the capabilities of transformers in rotation-invariant feature learning at both long-overlooked local-level and the acknowledged global-level. At the local-level, a rotation-invariant cross-patch embedding (RICPE) module is designed to generate dense patches, which handles encoding inconsistency of tokens with similar semantic information before and after rotation. Moreover, response-enhanced attention (REA) is proposed to extract more rotation-robust global features, which highlights overly dispersed responses ensure sustained attention on discriminative regions. Extensive experiments on three datasets demonstrate the effectiveness of RIFormer+. Without bells and whistles, RIFormer+ increases the classification accuracy by an average of 10% and improves the accuracy on rotated datasets by 20% compared with some state-of-the-art transformers.

Abstract:
Prompt tuning has been proven effective for Domain Generalization (DG) by enhancing the generalization capability of visual-language models with fewer learnable tokens. Existing methods adopt mostly inferring global-level individual prompts for the whole dataset to capture domain-invariant knowledge across different domains. However, since domain shifts exist, a single global-level individual prompt is easily overfitted to source domain datasets, thus lacking generalizability to the whole dataset’s feature distribution. Moreover, fluctuations in the generalization performance during the training process in DG problems often pose significant challenges to model selection strategies. To address the aforementioned problems, inspired by the Mixture-of-Expert (MOE) and knowledge distillation, we propose a novel Multiple Local Prompts Distillation (MLPD) method to inject the knowledge of multiple local prompts into a unique global prompt, improving both the generalization and discriminative ability. To ensure the diversity of local prompts, we split the whole dataset into several subsets to infer the discriminative local prompts for each subset, which is further applied to generate the generability global prompt. Formally, for each subset, Meta Prompt Tuning (MPT) is proposed to constrain each local prompt to capture both the domain-specific and domain-shared generalization knowledge on the basis of the domain label and meta-learning mechanism. After that, Prompt Knowledge Distillation (PKD) is proposed to distill the knowledge captured in the local-level prompts into the global-level prompt with prompt-level and feature-level knowledge distillations. The final evaluation on multiple benchmarks underscores the effectiveness of the proposed MLPD, e.g, achieving mAPs of 97.3%, 84.8%, 85.2%, 57.3%, and 60.7% on PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, respectively.

Abstract:
In the field of multi-view clustering, latent representations are often employed to address the challenge posed by low-quality data. Traditional approaches typically assume that multiple views are fully dependent, directly learning a common latent representation from the observed data. However, this assumption is overly restrictive in real-world scenarios and may overlook valuable information, as the independence of different views can reveal critical view-specific characteristics. To overcome this limitation, we propose learning Orthogonal Latent Representations for Multi-View Clustering (OLR-MVC), which jointly captures both cross-view dependence and independence. Specifically, our model maps multi-view data into shared and private latent spaces using distinct projection bases. To accurately capture both dependence and independence, we enforce orthogonality between the shared and private latent representations while also encouraging pairwise orthogonality among private representations. Furthermore, we leverage the self-expressive property of these latent representations to capture global data structures. Extensive experimental evaluations demonstrate that OLR-MVC outperforms state-of-the-art multi-view clustering methods.

Abstract:
In crowded scenarios, achieving the counting task of dynamically evolving categories is extremely challenging. In addition to grappling with challenges such as scale variations, severe occlusion and complex backgrounds, it is imperative to mitigate the issue of catastrophic forgetting. Previous approaches have heavily relied on leveraging historical data for knowledge distillation to tackle these difficulties. However, this strategy encounters two prominent obstacles: 1) Employing the teacher network from the previous stage for distillation incurs additional computational overhead during the training stage. 2) Although knowledge distillation can facilitate effective knowledge transfer, some inaccurate predictions from the teacher network may affect the knowledge acquisition in the current stage. To overcome these issues, we introduce a novel solution: a self-reflection neural network for class-incremental object counting. First, we construct a global-aware incremental regression branch that uses stacked transformer layers as backends to capture global information, while the final regression layers dynamically expand as categories increase. Furthermore, we introduce an uncertain estimation branch that selectively isolates certain feature maps to avoid some neurons updated with excessive gradient information, thereby enhancing the network plasticity while preserving stability. The output of this branch functions as a regularization signal, steering the learning process of the incremental regression branch. To foster a more robust retention of past knowledge, we propose a self-reflection loss. It employs the rectified outputs of global-aware incremental regression branch to encourage the network to reflect upon and refine its grasp of historical knowledge, effectively averting the pitfalls of inaccurate information. Our extensive experiments validate the effectiveness of our proposed method, achieving state-of-the-art results.

Abstract:
Despite the significant advancements achieved in image matting, the existing models heavily depend on manually drawn trimaps to produce accurate results in natural image scenarios. However, the process of obtaining trimaps is time-consuming and lacks user-friendliness and device compatibility. This greatly limits the practical applicability of all trimap-based matting methods. To address this issue, we introduce Click2Trimap, an interactive model that is capable of predicting high-quality trimaps and alpha mattes with minimal user click inputs. By analyzing real users’ behavioral logic and the characteristics of trimaps, we successfully propose a powerful iterative three-class training strategy and a dedicated simulation function, making Click2Trimap exhibit versatility across various scenarios. Compared with all existing trimap-free matting methods, Click2Trimap achieves superior performance in quantitative and qualitative assessments conducted on synthetic and real-world matting datasets. In particular, in a user study, Click2Trimap yields high-quality trimap and matting predictions in just 5 seconds per image on average, demonstrating its substantial practical value for use in real-world applications.

Abstract:
In data mining, subspace clustering is a crucial technique which determines the union of the underlying subspace to cluster data points in an unsupervised manner. Although deep-learning-based subspace clustering, typically referred to as deep subspace clustering (DSC), has significantly improved clustering accuracy, existing DSC models still struggle to capture a comprehensive and compact latent representation as they generally explore the spatial domain to extract useful information and face difficulty in balancing the high mutual and low redundant information between the original input space and latent subspace. This leads to the performance of the model being dependent on initialization, resulting in a lack of stability. In this study, a novel network is proposed to extract features in both the frequency domain and spatial domain. We introduce three types of ResBlocks in the discrete Fourier transform (DFT), discrete cosine transform (DCT), or discrete wavelet transform (DWT) frequency domains separately to learn both the low-frequency and high-frequency information in the proposed networks. Additionally, to extract concise and rich latent representations, IB loss is employed by deriving a variational lower bound on the IB objective. Extensive experiments on several benchmark datasets verify the effectiveness of our networks compared to state-of-the-art models. In addition, detailed ablation studies are performed to demonstrate the advantages of the two introduced components.

Abstract:
Action Quality Assessment (AQA) aims to evaluate and score human actions in videos accurately. Existing approaches involve extracting features from the input video and implementing regression based on those features. However, representations derived from a single branch often lack the necessary diversity and flexibility to capture the complexity of human actions effectively. This work addresses these limitations by introducing a multi-branch architecture designed to capture a broad spectrum of video dynamics at varying levels of granularity. Specifically, we enhance video representation in the flow-guided branch by integrating optical flow with video features. This combination of multimodal features offers a more comprehensive context of global motion. Meanwhile, the moment-focused branch is tailored to extract frame-specific features, constructing two distinct quality-based representations with different focuses on moments, which achieves adaptive clues aggregation. Furthermore, the detail-aware branch leverages multiscale deep embeddings from a hierarchy convolutional neural network to capture fine-grained spatial information, which is useful when objects have complex spatial changes. Finally, a post-fusion strategy is employed to merge outputs from all branches, contributing to the comprehensive action quality assessment. Experimental evaluations on three benchmark datasets, FineDiving, MTL-AQA, and AQA-7, demonstrate the superiority of our model in providing reliable assessments of action quality.

Abstract:
Multimodal sentiment analysis has garnered increasing attention. The bulk of existing work in multimodal sentiment analysis primarily focuses on designing various networks to align and subsequently fuse representations from individual modalities. Contrastive learning, recognized for its intrinsic alignment capabilities, has also been extensively applied in multimodal sentiment analysis. However, current contrastive learning methods are often limited to pairwise modalities and typically perform contrastive learning prior to modality fusion, neglecting the consistency of interactions across multiple modalities. Moreover, they overlook the overall consistency within samples. To address these issues, we introduce a novel Multi-Level Contrastive Learning (MLCL) framework for multimodal sentiment analysis, composed of Uni-Modal Contrastive Learning (UMCL), Bi-Modal Contrastive Learning (BMCL) and Tri-Modal Contrastive Learning (TMCL). UMCL enhances intra-modal representations by creating positive pairs using modality-specific random dropout, while BMCL leverages the asymmetry of attention mechanisms, using two directional attentions as positive samples. TMCL aligns non-overlapping uni-modal and bi-modal representations, underscoring the complementarity of tri-modal information. The effectiveness of MLCL is demonstrated through its performance on multiple datasets. Our comprehensive experiments across multiple datasets demonstrate the superiority of the MLCL framework, which achieves new state-of-the-art performance.

Abstract:
Pre-training has greatly boosted scene text detection methods by learning the representation of text. However, they still suffer from two drawbacks: 1) The learned representation for text is not discriminative due to the insufficient annotated real data and the domain gap between synthetic data. 2) Existing methods perform poorly on text lacking of visual information (e.g. occluded text). To address them, this paper explores the potential of the CLIP model and proposes a novel self-supervised pre-training network with masked text modeling (MTM) and text knowledge distillation (TKD), which aims at obtaining discriminative representation for text. First, a Text Perception Module is proposed to perceive coarse text area under an unsupervised manner. Second, we design a Text-aware Masking Strategy to mask the text area with a certain ratio and reconstruct the masked texts by the MTM Module. Compared to randomly pixel-level masking in classic masked image modeling, we perform a targeted text-aware masking and reconstruction. MTM obtains linguistic reasoning ability of text occlusion with reconstruction of masked text. Besides, to better utilize the multimodal knowledge of text in CLIP model, this paper devises a TKD Module to guide the representation learning of masked texts in semantic level. This robust feature extraction learned by reconstructing masked text and knowledge distillation ensures a more discriminative representation for text. Extensive experiments on four challenging datasets verify the effectiveness and superiority of our pre-training method. Specifically, our method achieves F-measure of \mathbf86.5%, \mathbf87.1% and \mathbf88.5% for DBNet++ on CTW1500, Total-Text and MSRA-TD500 respectively.

Abstract:
While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and design a novel Structure-guided Diffusion Transformer based Low-light image enhancement (SDTL) framework. We compress the feature through wavelet transform to improve the inference efficiency of the model and capture the multi-directional frequency band. Then we propose a Structure Enhancement Module (SEM) that uses structural prior to enhance the texture and leverages an adaptive fusion strategy to achieve more accurate enhancement effect. In Addition, we propose a Structure-guided Attention Block (SAB) to pay more attention to texture-riched tokens and avoid interference from noisy areas in noise prediction. Extensive qualitative and quantitative experiments demonstrate that our method achieves SOTA performance on several popular datasets, validating the effectiveness of SDTL in improving image quality and the potential of DiT in low-light enhancement tasks.

Abstract:
Image set compression (ISC) refers to compressing the sets of semantically similar images. Traditional ISC methods typically aim to eliminate redundancy among images at either signal or frequency domain, but often struggle to handle complex geometric deformations across different images effectively. Here, we propose a new Hybrid Neural Representation for ISC (HNR-ISC), including an implicit neural representation for Semantically Common content Compression (SCC) and an explicit neural representation for Semantically Unique content Compression (SUC). Specifically, SCC enables the conversion of semantically common contents into a small-and-sweet neural representation, along with embeddings that can be conveyed as a bitstream. SUC is composed of invertible modules for removing intra-image redundancies. The feature level combination from SCC and SUC naturally forms the final image set. Experimental results demonstrate the robustness and generalization capability of HNR-ISC in terms of signal and perceptual quality for reconstruction and accuracy for the downstream analysis task.

Abstract:
An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.

Affiliations: National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen, China; College of Management, Shenzhen University, Shenzhen, China; School of Mechanical and Control Engineering, Baicheng Normal University, Baicheng, China; School of Computer Science, Guangdong University of Technology, Guangzhou, China; Faculty of Information Science and Engineering, Ningbo University, Ningbo, China

Abstract:
Camouflaged object detection (COD) aims to segment targeted objects that have similar colors, textures, or shapes to their background environment. Due to the limited ability in distinguishing highly similar patterns, existing COD methods usually produce inaccurate predictions, especially around the boundary areas, when coping with complex scenes. This paper proposes a Progressive Region-to-Boundary Exploration Network (PRBE-Net) to accurately detect camouflaged objects. PRBE-Net follows an encoder-decoder framework and includes three key modules. Specifically, firstly, both high-level and low-level features of the encoder are integrated by a region and boundary exploration module to explore their complementary information for extracting the object's coarse region and fine boundary cues simultaneously. Secondly, taking the region cues as the guidance information, a Region Enhancement (RE) module is used to adaptively localize and enhance the region information at each layer of the encoder. Subsequently, considering that camouflaged objects usually have blurry boundaries, a Boundary Refinement (BR) decoder is used after the RE module to better detect the boundary areas with the assistance of boundary cues. Through top-down deep supervision, PRBE-Net can progressively refine the prediction. Extensive experiments on four datasets indicate that our PRBE-Net achieves superior results over 21 state-of-the-art COD methods. Additionally, it also shows good results on polyp segmentation, a COD-related task in the medical field.

Abstract:
Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.

Abstract:
Visible-Infrared Person Re-identification aims to retrieve images of specific identities across modalities. To relieve the large cross-modality discrepancy, researchers introduce the auxiliary modality within the image space to assist modality-invariant representation learning. However, the challenge persists in constraining the inherent quality of generated auxiliary images, further leading to a bottleneck in retrieval performance. In this paper, we propose a novel Auxiliary Representation Guided Network (ARGN) to explore the potential of auxiliary representations, which are directly generated within the modality-shared embedding space. In contrast to the original visible and infrared representations, which contain information solely from their respective modalities, these auxiliary representations integrate cross-modality information by fusing both modalities. In our framework, we utilize these auxiliary representations as modality guidance to reduce the cross-modality discrepancy. First, we propose a High-quality Auxiliary Representation Learning (HARL) framework to generate identity-consistent auxiliary representations. The primary objective of our HARL is to ensure that auxiliary representations capture diverse modality information from both modalities while concurrently preserving identity-related discrimination. Second, guided by these auxiliary representations, we design an Auxiliary Representation Guided Constraint (ARGC) to optimize the modality-shared embedding space. By incorporating this constraint, the modality-shared embedding space is optimized to achieve enhanced intra-identity compactness and inter-identity separability, further improving the retrieval performance. In addition, to improve the robustness of our framework against the modality variation, we introduce a Part-based Adaptive Gaussian Module (PAGM) to adaptively extract discriminative information across modalities. Finally, extensive experiments are conducted to demonstrate the superiority of our method over state-of-the-art approaches on three VI-ReID datasets.

Abstract:
In cross-domain recognition tasks, the divergent distributions of data acquired from various domains degrade the effectiveness of knowledge transfer. Additionally, in practice, cross-domain data also contain a massive amount of redundant information, usually disturbing the training processes of cross-domain classifiers. Seeking to address these issues and obtain efficient domain-invariant knowledge, this paper proposes a novel cross-domain classification method, named cross-scatter sparse dictionary pair learning (CSSDL). Firstly, a pair of dictionaries is learned in a common subspace, in which the marginal distribution divergence between the cross-domain data is mitigated, and domain-invariant information can be efficiently extracted. Then, a cross-scatter discriminant term is proposed to decrease the distance between cross-domain data belonging to the same class. As such, this term guarantees that the data derived from same class can be aligned and that the conditional distribution divergence is mitigated. In addition, a flexible label regression method is introduced to match the feature representation and label information in the label space. Thereafter, a discriminative and transferable feature representation can be obtained. Moreover, two sparse constraints are introduced to maintain the sparse characteristics of the feature representation. Extensive experimental results obtained on public datasets demonstrate the effectiveness of the proposed CSSDL approach.

Abstract:
Finding reliable correspondences in two-view image and recovering the camera poses are key problems in photogrammetry and image signal processing. Multilayer perceptron (MLP) has a wide application in two-view correspondence learning for which is good at learning disordered sparse correspondences, but it is susceptible to the dominant outliers and requires additional functional blocks to capture context information. CNN can naturally extract local context information, but it cannot handle disordered data and extract global context and channel information. In order to overcome the shortcomings of MLP and CNN, we design a correspondence learning network based on Transformer, named Vector Rectifier Transformer (VRTNet). Transformer is an encoder-decoder structure which can handle disordered sparse correspondences and output sequences of arbitrary length. Therefore, we design two sub-Transformers in VRTNet to achieve the mutual conversion between disordered and ordered correspondences. The self-attention and cross-attention mechanisms in them allow VRTNet to focus on the global context relations of all correspondences. To capture local context and channel information, we propose rectifier network (including CNN and channel attention block) as the backbone of VRTNet, which avoids the complex design of additional blocks. Rectifier network can correct the errors of ordered correspondences to obtain rectified correspondences. Finally, outliers are removed by comparing original and rectified correspondences. VRTNet performs better than the state-of-the-art methods in the tasks of relative pose estimation, outlier removal and image registration.

Abstract:
Visible-infrared person re-identification (VI-ReID) seeks to identify and match individuals across visible and infrared ranges within intelligent monitoring environments. Most current approaches predominantly explore a two-stream network structure that extract global or rigidly split part features and introduce an extra modality for image compensation to guide networks reducing the huge differences between the two modalities. However, these methods are sensitive to misalignment caused by pose/viewpoint variations and additional noises produced by extra modality generating. Within the confines of this articles, we clearly consider addresses above issues and propose a Cross-modality Semantic Consistency Learning (CSCL) network to excavate the semantic consistent features in different modalities by utilizing human semantic information. Specifically, a Parsing-aligned Attention Module (PAM) is introduced to filter out the irrelevant noises with channel-wise attention and dynamically highlight the semantic-aware representations across modalities in different stages of the network. Then, a Semantic-guided Part Alignment Module (SPAM) is introduced, aimed at efficiently producing a collection of semantic-aligned fine-grained features. This is achieved by incorporating parsing loss and division loss constraints, ultimately enhancing the overall person representation. Finally, an Identity-aware Center Mining (ICM) loss is presented to reduce the distribution between modality centers within classes, thereby further alleviating intra-class modality discrepancies. Extensive experiments indicate that CSCL outperforms the state-of-the-art methods on the SYSU-MM01 and RegDB datasets. Notably, the Rank-1/mAP accuracy on the SYSU-MM01 dataset can achieve 75.72%/72.08%.

Abstract:
In this paper, we focus on Class-Incremental Unsupervised Domain Adaptation (CI-UDA), where the labeled source domain already includes all classes, and the classes in the unlabeled target domain emerge sequentially over time. This task involves addressing two main challenges. The first is the domain gap between the labeled source data and the unlabeled target data, which leads to weak generalization performance. The second is the inconsistency between the source and target category spaces at each time step, which causes catastrophic forgetting during the testing stage. Previous methods focus solely on the alignment of similar samples from different domains, which overlooks the underlying causes of the domain gap/class distribution difference. To tackle the issue, we rethink this task from a causal perspective for the first time. We first build a structural causal graph to describe the CI-UDA problem. Based on the causal graph, we present Memory-Enhanced Confidence Calibration (MECC), which aims to improve confidence in the predicted results. In particular, we argue that the domain discrepancy caused by the different styles is prone to make the model produce less confident predictions and thus weakens the generalization and continual learning abilities. To this end, we first explore using the gram matrix to generate source-style target data, which is combined with the original data to jointly train the model and thereby reduce the domain-shift impact. Second, we utilize the model of the previous time step to select corresponding samples that are used to build a memory bank, which is instrumental in alleviating catastrophic forgetting. Extensive experimental results on multiple datasets demonstrate the superiority of our method.

Abstract:
Current supervised methods for 3D shape representation learning have achieved satisfying performance, yet require extensive human-labeled datasets. Unsupervised learning-based methods provide a viable solution by learning shape representations without using ground truth labels. In this study, we develop a contrastive learning framework for unsupervised representation learning of 3D shapes. Specifically, in order to encourage models to pay more attention to useful information during representation learning, we first introduce a new paradigm for critical points search based on the adversarial mechanism. We extract critical points with a larger impact on the global feature by attacking a pre-trained auto-encoder model, and apply data augmentations on these points to generate adversarial examples. Taking a pair of adversarial examples as inputs, we obtain their intermediate embeddings and global representations of corresponding inputs, which are then transformed into latent spaces by two predictor heads. Finally, we train the proposed model by maximizing the agreements on these latent spaces via Normalized Temperature-scaled Cross Entropy (NT-Xent) loss and a newly designed Cross-layer Normalized Temperature-scaled Cross Entropy (Cross-NT-Xent) loss, where the latter is proposed in this article to enforce cross-layer feature similarities. The effectiveness, robustness, and transferability of learned representations are validated on three downstream tasks, including object classification, few-shot classification, and shape retrieval. Experiments on three benchmark datasets show that our learned representations achieve better or competitive performance than current state-of-the-art methods in these downstream tasks. Moreover, our model can easily be extended to 3D part segmentation and scene segmentation tasks.

Abstract:
Registration of multiview point clouds obtained from 3D scanners is a common method for 3D reconstruction. However, most existing registration methods are designed to handle point clouds with known overlap relationships that are ensured by external equipment (e.g., manipulators, turntables) or acquisition sequences, which limits the application range and increases the acquisition cost. To overcome these limitations, an unknown overlap registration (UOR) method for multiview point clouds is proposed, which can estimate overlap confidence, construct a connected graph, and remove outlier point clouds automatically. First, the overlap confidence between two point clouds is estimated by calculating the average nearest neighbor feature distance within the predicted overlap region. We then construct a minimal spanning tree based on the confidence levels and search for the central node to serve as the world coordinate. Finally, the Lie algebra-based SE(3)-sensitive perturbation scheme is introduced to solve the fine transformations, in which a robust weighting function is designed to weight point correspondences. Our method can find reliable connections among point clouds, and the proposed graph can be combined with different pairwise registration methods. The experimental results on both indoor and industrial datasets demonstrate the accuracy and effectiveness of our method.

Abstract:
Transformer-based Self-supervised Representation Learning methods learn generic features from unlabeled datasets for providing useful network initialization parameters for downstream tasks. Recently, methods based upon masking Autoencoders have been explored in the fields. The input can be intuitively masked due to regular content, like sequence words and 2D pixels. However, the extension to 3D point cloud is challenging due to irregularity. In this article, we propose masked Autoencoders in 3D point cloud representation learning (abbreviated as MAE3D), a novel autoencoding paradigm for self-supervised learning. We first split the input point cloud into patches and mask a portion of them, then use our Patch Embedding Module to extract the features of unmasked patches. Secondly, we employ patch-wise MAE3D Transformers to learn both local features of point cloud patches and high-level contextual relationships between patches, then complete the latent representations of masked patches. We use our Point Cloud Reconstruction Module with multi-task loss to complete the incomplete point cloud as a result. We conduct self-supervised pre-training on ShapeNet55 with the point cloud completion pre-text task and fine-tune the pre-trained model on ModelNet40 and ScanObjectNN (PB_T50_RS, the hardest variant). Comprehensive experiments demonstrate that the local features extracted by our MAE3D from point cloud patches are beneficial for downstream classification tasks, soundly outperforming state-of-the-art methods (93.4% and 86.2% classification accuracy, respectively).

Abstract:
We present an efficient approach to generating uniformly distributed resampling points of raw 3D point clouds. A key contribution for making such a resampling method both practical and efficient is the construction of the centroidal Voronoi tessellation on the given point cloud efficiently achieved by applying the proposed Anderson-accelerated Lloyd's method. The calculations involved in the method are mainly carried out over a group of locally approximated quadratic surfaces, instead of directly on the given point cloud, providing us a great advantage in filtering out the affection of distribution of original points on output results. Once the resampling points are initialized, the resampling quality can be improved progressively by optimizing resampling points and updating the local approximated surfaces. In addition, by restricting the movement of resampling points, we can deal with unclosed point clouds without any boundary detection. Our approach outperforms existing resampling methods in generating uniform results, and extensive experiments are conducted to demonstrate its efficacy.

Abstract:
3D point cloud completion is very challenging because it relies on accurately understanding the complex 3D shapes (e.g., high-curvature, concave/convex, and hollowed-out 3D shapes) and the unknown & diverse patterns of the partially available point clouds. In this paper, we propose a novel solution, i.e., Point-block Carving (PC), for completing the complex 3D point cloud completion. Given the partial point cloud as the guidance, we carve a 3D block that contains the uniformly distributed 3D points, yielding the entire point cloud. We propose a new network architecture to achieve PC, i.e., CarveNet. This network conducts the exclusive convolution on each block point, where the convolutional kernels are trained on the 3D shape data. CarveNet determines which point should be carved to recover the complete shapes' details effectively. Furthermore, we propose a sensor-aware method for data augmentation, i.e., SensorAug, for training CarveNet on richer patterns of partial point clouds, thus enhancing the completion power of the network. The extensive evaluations on the ShapeNet, ShapNet-55/34 and KITTI datasets demonstrate the generality of our approach on the partial point clouds with diverse patterns. On these datasets, CarveNet successfully outperforms the state-of-the-art methods.

Affiliations: State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China; State Key Laboratory of Integrated Services Networks, School of Cyber Engineering, Xidian University, Xi'an, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; Sydney AI Center, School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW, Australia

Abstract:
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is stricter and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.

Abstract:
Multi-focus image fusion (MFIF) aims at merging multiple images captured at different focal lengths to create an all-in-focus image. This paper introduces a fully unsupervised learning approach for MFIF that uses only pairs of defocused images for end-to-end training, bypassing the need for ground-truths in supervised learning. Unlike existing methods training via a similarity loss between fused and source images, we propose a dual-path learning framework comprising two networks: an image fuser and a mask predictor. The mask predictor is modeled as a self-supervised denoising network on imperfect fusion masks, trained with a masking-based unsupervised learning scheme. The image fuser, crafted with deep unrolling, leverages the output from the mask predictor to supervise its mask generation at each unrolled step. Moreover, we introduce a fusion consistency loss to ensure the alignment between the image fuser and the mask predictor. In extensive experiments, our proposed approach shows superiority over existing end-to-end unsupervised methods and competitive performance against the supervised ones.

Abstract:
Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering. Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches.

Abstract:
Classical continuous sign language recognition (CSLR) suffers from some main challenges in real-world scenarios: accurate inter-frame movement trajectories may fail to be captured by traditional RGB cameras due to the motion blur, and valid information may be insufficient under low-illumination scenarios. In this paper, we for the first time leverage an event camera to overcome the above-mentioned challenges. Event cameras are bio-inspired vision sensors that could efficiently record high-speed sign language movements under low-illumination scenarios and capture human information while eliminating redundant background interference. To fully exploit the benefits of the event camera for CSLR, we propose a novel event-guided multi-modal CSLR framework, which could achieve significant performance under complex scenarios. Specifically, a time redundancy correction (TRCorr) module is proposed to rectify redundant information in the temporal sequences, directing the model to focus on distinctive features. A multi-modal cross-attention interaction (MCAI) module is proposed to facilitate information fusion between events and frame domains. Furthermore, we construct the first event-based CSLR dataset, named EvCSLR, which will be released as the first event-based CSLR benchmark. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on EvCSLR and PHOENIX-2014 T datasets.

Abstract:
Freeform handwriting authentication verifies a person's identity from their writing style and habits in messy handwriting data. This technique has gained widespread attention in recent years as a valuable tool for various fields, e.g., fraud prevention and cultural heritage protection. However, it still remains a challenging task in reality due to three reasons: (i) severe damage, (ii) complex high-dimensional features, and (iii) lack of supervision. To address these issues, we propose SherlockNet, an energy-oriented two-branch contrastive self-supervised learning framework for robust and fast freeform handwriting authentication. It consists of four stages: (i) pre-processing: converting manuscripts into energy distributions using a novel plug-and-play energy-oriented operator to eliminate the influence of noise; (ii) generalized pre-training: learning general representation through two-branch momentum-based adaptive contrastive learning with the energy distributions, which handles the high-dimensional features and spatial dependencies of handwriting; (iii) personalized fine-tuning: calibrating the learned knowledge using a small amount of labeled data from downstream tasks; and (iv) practical application: identifying individual handwriting from scrambled, missing, or forged data efficiently and conveniently. Considering the practicality, we construct EN-HA, a novel dataset that simulates data forgery and severe damage in real applications. Finally, we conduct extensive experiments on six benchmark datasets including our EN-HA, and the results prove the robustness and efficiency of SherlockNet.

Abstract:
Learning-based local feature extraction algorithms have advanced considerably in terms of robustness. While excelling at enhancing feature robustness, some outstanding algorithms tend to neglect discriminability—a crucial aspect in vision tasks. With the increase of deep learning convolutional layers, we observe an amplification of semantic information within images, accompanied by a diminishing presence of spatial structural information. This imbalance primarily contributes to the subpar feature discriminability. Therefore, this paper introduces a novel network framework aimed at imbuing feature descriptors with robustness and discriminative power by reinforcing spatial structural information. Our approach incorporates a spatial structure enhancement module into the network architecture, spanning from shallow to deep layers, ensuring the retention of rich structural information in deeper layers, thereby enhancing discriminability. Finally, we evaluate our method, demonstrating superior performance in visual localization and feature-matching tasks.

Abstract:
Recently, an increasing number of few-shot image classification methods have been proposed, and they aim at seeking a learning paradigm to train a high-performance classification model with limited labeled samples. However, the neglect of part-level relationships causes few-shot methods to struggle to distinguish between closely similar subcategories, which makes it difficult for them to solve the fine-grained image classification problem. To tackle this challenging task, this paper proposes a fine-grained few-shot image classification method that exploits both intra-part and inter-part relationships among different samples. To establish comprehensive relationships, we first extract multiple discriminative descriptors from the input image, representing its different parts. Then, we propose to define the metric spaces by interpolating intra-part relationships, which can help the model adaptively find clear boundaries for these confusing classes. Finally, since the unlabeled image has high similarities to all classes, we project these similarities into a high-dimension space according to the inter-part relationship and interpolate a parameterized classifier to discover the subtle differences among these similar classes. To evaluate our proposed method, we conduct extensive experiments on various fine-grained datasets. Without any pre-train/fine-tuning process, our approach clearly outperforms previous few-shot learning methods, which demonstrates the effectiveness of our approach.

Abstract:
Cloth-changing person re-identification (CC-ReID) aims to match persons who change clothes over long periods. The key challenge in CC-ReID is to extract cloth-irrelated features, such as face, hairstyle, body shape, and gait. Current research mainly focuses on modeling body shape using multi-modal biological features (such as silhouettes and sketches). However, it does not fully leverage the personal description information hidden in the original RGB image. Considering that there are certain attribute descriptions that remain unchanged after the changing of cloth, we propose a Masked Attribute Description Embedding (MADE) method that unifies personal visual appearance and attribute description for CC-ReID. Specifically, handling variable cloth-sensitive information, such as color and type, is challenging for effective modeling. To address this, we mask the clothes type and color information (upper body type, upper body color, lower body type, and lower body color) in the personal attribute description extracted through an attribute detection model. The masked attribute description is then connected and embedded into Transformer blocks at various levels, fusing it with the low-level to high-level features of the image. This approach compels the model to discard cloth information. Experiments are conducted on several CC-ReID benchmarks, including PRCC, LTCC, Celeb-reID-light, and LaST. Results demonstrate that MADE effectively utilizes attribute description, enhancing cloth-changing person re-identification performance, and compares favorably with state-of-the-art methods.

Abstract:
Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.

Abstract:
As densely-sampled Light Field (LF) images are beneficial to many applications, LF reconstruction becomes an important technology in related fields. Recently, neural rendering shows great potential in reconstruction tasks. However, volume rendering in existing methods needs to sample many points on the whole camera ray or epipolar line, which is time-consuming. In this paper, specifically for LF images with regular angular sampling, we propose a novel Structure-Aware Pre-Selected neural rendering framework for LF reconstruction. Instead of sampling on the whole epipolar line, we propose to sample on several specific positions, which are estimated using the color and inherent scene structure information explored in the regular angular sampled LF images. Sampling only a few points that closely match the target pixel, the feature of the target pixel is quickly rendered with high-quality. Finally, we fuse the features and decode them in the view dimension to obtain the final target view. Experiments show that the proposed method outperforms the state-of-the-art LF reconstruction methods in both qualitative and quantitative comparisons across various tasks. Our method also surpasses the most existing methods in terms of speed. Moreover, without any retraining or fine-tuning, the performance of our method with no-per-scene optimization is even better than the methods with per-scene optimization.

Abstract:
Data hallucination or augmentation is a straightforward solution for few-shot learning (FSL), where FSL is proposed to classify a novel object under limited training samples. Common hallucination strategies use visual or textual knowledge to simulate the distribution of a given novel category and generate more samples for training. However, the diversity and capacity of generated samples through these techniques can be insufficient when the knowledge domain of the novel category is narrow. Therefore, the performance improvement of the classifier is limited. To address this issue, we propose a Symmetric data hallucination strategy with Knowledge Transfer (SHKT) that interacts with multi-modal knowledge in both visual and textual spaces. Specifically, we first calculate the relations based on semantic knowledge and select the most related categories of a given novel category for hallucination. Second, we design two parameter-free data hallucination strategies to enrich the training samples by mixing the given and selected samples in both visual and textual spaces. The generated visual and textual samples improve the visual representation and enrich the textual supervision, respectively. Finally, we connect the visual and textual knowledge through transfer calculation, which not only exchanges content from different modalities but also constrains the distribution of the generated samples during the training. We apply our method to four benchmark datasets and achieve state-of-the-art performance in all experiments. Specifically, compared to the baseline on the Mini-ImageNet dataset, it achieves 12.84% and 3.46% accuracy improvements for 1 and 5 support training samples, respectively.

Abstract:
Deep recognition models are widely vulnerable to adversarial examples, which change the model output by adding quasi-imperceptible perturbation to the image input. Recently, Segment Anything Model (SAM) has emerged to become a popular foundation model in computer vision due to its impressive generalization to unseen data and tasks. Realizing flexible attacks on SAM is beneficial for understanding the robustness of SAM in the adversarial context. To this end, this work aims to achieve a targeted adversarial attack (TAA) on SAM. Specifically, under a specific prompt, the goal is to make the predicted mask of an adversarial example resemble that of a given target image. The task of TAA on SAM has been realized in the white-box setup by assuming access to prompt and model, which is thus less practical. To address the issue of prompt dependence, we propose a simple yet effective approach by only attacking the image encoder. Moreover, we propose a novel regularization loss to enhance the cross-model transferability by increasing the feature dominance of adversarial images over random natural images. Extensive experiments verify the effectiveness of our proposed method to conduct a successful black-box TAA on SAM.

Affiliations: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, National Supercomputer Center in Jinan, Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Shandong Academy of Sciences, Qilu University of Technology, Jinan, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract:
Weakly supervised object localization (WSOL) is both a promising and challenging task that aims to achieve object localization exclusively through image category labels for supervision. Visual transformers have recently been applied to WSOL, demonstrating significant success through the exploitation of long-range feature dependencies in self-attention mechanisms. However, the transformer-based approach suffers from the same partial activation problem as the CNN-based approach due to the use of the classification task to train self-attention map, i.e., only a few discriminative regions are assigned high attention response and thus the localization map does not cover the whole object. To alleviate this problem, we propose a plug-and-play Token Masking Transformer (TMT) method to help transformer-based WSOL methods to obtain a more complete localization map by dynamic discriminative token masking. Specifically, a batch-wise discriminative token selection strategy is first introduced to flexibly determine the tokens to be masked in each image. Then, we design a token masking transformer block to perform token masking and inspire the network to mine more object-related tokens. Besides, we also design an intermediate token activation loss to further improve the performance of TMT by imposing constraints on intermediate tokens. Extensive experiments demonstrate that our TMT can substantially improve the performance of existing transformer-based methods without increasing the computational cost, and achieves state-of-the-art performance on two mainstream benchmarks.

Abstract:
When presented with a question regarding an image, Visual Commonsense Reasoning (VCR) offers not only a correct answer but also a rationale to justify the answer. Existing methods simply combine features from multiple modalities onto a shared dimension space, which doesn't align with human reasoning patterns, resulting in inadequate cross-modal and intra-modal reasoning behaviors. On the one hand, inadequate cross-modal reasoning arises from existing models relying on semantic correlations between answers and rationales in both textual modalities rather than the generative process of human reasoning from visual to textual modality. On the other hand, inadequate intra-modal reasoning arises from the incapacity of existing models to leverage previously acquired object relations beyond current observations like humans. To this end, we propose a novel Relation Inference Enhancement Network (RIE-Net), which enhances reasoning ability based on cross-modal image analysis and introduces intra-modal relational reasoning modules to memorize reasoning knowledge. To enhance the cross-modal association between images and rationales, RIE-Net introduces a cross-modal image analysis module, which eliminates language bias between answers and rationales by generating rationale from images. In addition, to comprehend and retain relational knowledge, RIE-Net introduces intra-modal relational reasoning modules to capture prior knowledge associated with various object categories and enhance the model's understanding of visual-spatial relationships. Quantitative and qualitative evaluations of the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods.

Abstract:
In recent years, the use of large-scale pre-trained models for vision-language tasks has gained significant attention and has shown promising results in the video question answering. However, the increasing size of these models has made the fully fine-tuning strategy impractical. Therefore, there is a growing need for research in parameter-efficient transfer learning for downstream tasks. To address this challenge, we introduce a novel parameter-efficient transfer learning technique based on a temporal reasoning adapter for the video question answering task. Our proposed approach captures the temporal relationship within videos, enabling the model to possess visual reasoning ability and knowledge acquisition ability from language models. Our extensive experiments on four video question answering datasets indicate that our method can match or even outperform fully fine-tuning strategies and state-of-the-art models, while having the advantage of parameter efficiency.

Abstract:
The predator has the ability to quickly respond to the misjudged decision and hunt the camouflaged target by analyzing its movement. Those decision compensation and movement analysis for hunting are closely tied to temporal and spatial information. This can be mirrored in the video camouflaged object detection (VCOD) task where the captured temporal information may be misjudged as well as the spatial information tends to be inaccurate in complex scenes. Thus, two key factors should be considered in the VCOD task: How can a model cope with the misjudged temporal information; How can spatial features interact with the temporal information to understand dynamic scenes? To this end, we propose a predator-mimicking network (PMNet) equipped with a selective temporal alignment module (STAM) and a temporal-spatial feedback module (T-SFM). The STAM is designed to alleviate the influence of the misjudged motion trajectory by adopting our adaptive selection mechanism from a novel perspective. In T-SFM, the temporal information works as the self-knowledge to provide assistance and interact with spatial features, enabling the model to effectively detect the camouflaged object. Experimental results demonstrate that our method achieves state-of-the-art performance on VCOD benchmarks. Furthermore, our model can be generalized in the video salient object detection (VSOD) task and also outperforms existing state-of-the-art methods. The source code will be publicly available at https://github.com/LiuTingWed/CriDiff.

Abstract:
Few-Shot Action Recognition (FSAR) aims to recognize novel class action with limited annotated training data from the same class. Most FSAR methods subconsciously follow the few-shot image classification solutions by solely focusing on appearance-level matching between support and query videos, such as part-level matching, frame-level matching, and segment-level matching. However, these methods, almost always, have two main limitations: 1) generally ignore the relationship among these part-, frame- and segment-level features and 2) may mismatch the same class actions under fast-term and slow-term dynamics. To this end, we present a novel Hierarchical Motion-enhanced Matching (HM^2) framework to hierarchically learn the relation-aware multi-modal features, and jointly promote the multi-modal matching, including appearance-level matching on segments, frames, and parts, as well as the motion-level matching on dynamics. Specifically, we first propose a new Hierarchical Tokenizer (HT) to learn multi-modal features, namely utilizing a hierarchical Transformer to learn appearance-level features, along with a Slow-Fast Aware Motion (SFAM) strategy to learn motion-level features covering fast- and slow-term dynamics. Next, we propose a new Relation-aware Matcher (RM) to match the multi-modal features, by leveraging a Hierarchical Relational Graph Convolutional Network (H-RGCN) to capture the relationship among these appearance-level features. Further, a Dual Sample-to-Class Matching (DSCM) strategy is proposed to measure the bidirectional similarities among appearance- and motion-modal features by sample-to-class matching and class-to-sample matching. Extensive experiments on four golden FSAR datasets demonstrate significant performance improvements of HM^2 compared with the state-of-the-art methods.

Abstract:
Most existing super-resolution (SR) methods assume that the degradation is fixed (e.g., bicubic downsampling), whereas their performance would be degraded if the actual degradation differs from this assumption. To deal with unknown degradations, existing unknown SR methods are committed to learning degradation representation to generate high-resolution images. Nevertheless, they ignore that the impact of degradations on images is related to image content, or they learn degradation representations without any constraints. In this article, we propose a degradation maps extractor for unknown SR. Specifically, we learn degradation maps and condense them into a one-dimensional representation space to distinguish various degradations, which obtains distinguishable degradation maps and preserves the connection with the image contents. Furthermore, we propose a degradation map-guided SR (DMGSR) network, in which the degradation maps adaptively influence the SR process by applying channel attention and spatial attention to middle features. With the cooperation of the degradation maps extractor and the degradation maps-guided SR network, our network can flexibly handle various degradations. Experimental results show that our model achieves state-of-the-art performance in quantitative and qualitative metrics for the unknown SR task.

Abstract:
The field of image aesthetics assessment (IAA) is rapidly advancing due to its wide applications. However, relying solely on single-modal information for aesthetic evaluation presents inherent limitations. While multimodal IAA models incorporating user comments have achieved significant advancements, these comments are often unavailable due to privacy concerns and practical considerations, and they also introduce additional computational overhead during inference. To address this issue, we propose a cross-modal hierarchical knowledge distillation method, termed HKD-IAA, to enhance the performance of unimodal image models effectively. Specifically, HKD-IAA comprises four components: feature extraction, feature decomposition, hierarchical knowledge distillation, and dynamic decay. During training, we first decompose the extracted features into a weighted sum of basic aesthetic elements and their corresponding weights, thereby reducing the learning difficulty for the student model. Building on this, we design a new hierarchical knowledge distillation framework, which aligns features at the feature, relation, and response levels to effectively transfer the knowledge from the teacher model. Finally, we introduce a dynamic decay strategy to adjust the weight of the distillation loss, thereby enhancing the student model's learning effectiveness during training. Extensive experiments on two benchmark datasets validate that the proposed method achieves state-of-the-art performance using only visual modal data. Our code is available at https://github.com/Hangwei-Chen/HKD-IAA.

Abstract:
Food categorization is pivotal in numerous aspects of everyday life, assisting in the selection of food, managing diets, and addressing essential survival requirements. By leveraging the complementary information of various views, multi-view learning usually achieves superior performance compared to the single-view learning methods. However, characterized by the unrestrained openness of internet platforms and potential inconsistencies in food data collection processes, multi-view features often suffer from data loss, resulting in incomplete multi-view food data. Conventional multi-view clustering methods often falter in effectively capitalizing on the diverse correlations contained in food data, and exhibit limitations in dealing with the noise and irregularities pervading different views. Addressing these challenges, this paper presents the Robust Multi-Graph Contrastive network (RMGC) for multi-view food clustering. RMGC artfully combines multi-view representation learning with multi-graph contrastive regularization, creating a cohesive framework to manage incomplete multi-view data. By developing a multi-view encoding network, RMGC seamlessly blends various views into a cohesive representation, astutely assessing the significance of each view. More importantly, the proposed robust multi-graph contrastive regularization enhances the precision of the learned representation and successfully counteracts the noise and unreliability in multi-view data. The experiments conducted across several multi-view datasets manifest the effectiveness of RMGC, showing its superiority over existing methods. Our method not only making an advancement in food categorization but also contributes to the broader field of multi-view learning, offering innovative solutions for handling incomplete and noisy multi-view data.

Abstract:
Although there are several food-level benchmarks for food-related learning, the lack of fine-grained ingredient annotation significantly impedes progress in food scene understanding. In this study, we focus on Chinese food understanding which involves fine-grained ingredient detection and cross-modal ingredient retrieval. Specifically, to support studies on Chinese food understanding, we build the first cross-modal ingredient-level dataset called CMIngre, which contains 8,001 image-text pairs from three different sources, i.e. dishes, recipes, and user-generated content, covering 429 distinct ingredients and 95,290 bounding boxes. Based on CMIngre, we evaluate the performance of traditional CNN-based detection algorithms and transformer-based pre-trained large models for ingredient detection. We also propose baseline methods for the cross-modal ingredient retrieval task in both the end-to-end and two-stage settings. Extensive experiments on CMIngre demonstrate the effectiveness of our proposed methods on food understanding.

Abstract:
Image segmentation under low-light conditions is essential in real-world applications, such as autonomous driving and video surveillance systems. The recent Segment Anything Model (SAM) exhibits strong segmentation capability in various vision applications. However, its performance could be severely degraded under low-light conditions. On the other hand, multimodal information has been exploited to help models construct more comprehensive understanding of scenes under low-light conditions by providing complementary information (e.g., depth). Therefore, in this work, we present a pioneer attempt that elevates a unimodal vision foundation model (e.g., SAM) to a multimodal one, by efficiently integrating additional depth information under low-light conditions. To achieve that, we propose a novel method called Depth Perception SAM (DPSAM) based on the SAM framework. Specifically, we design a modality encoder to extract the depth information and the Depth Perception Layers (DPLs) for mutual feature refinement between RGB and depth features. The DPLs employ the cross-modal attention mechanism to mutually query effective information from both RGB and depth for the subsequent feature refinement. Thus, DPLs can effectively leverage the complementary information from depth to enrich the RGB representations and obtain comprehensive multimodal visual representations for segmenting anything in the dark. To this end, our DPSAM maximally maintains the instinct expertise of SAM for RGB image segmentation and further leverages on the strength of depth for enhanced segmenting anything capability, especially for cases that are likely to fail with RGB only (e.g., low-light or complex textures). As demonstrated by extensive experiments on four RGBD benchmark datasets, DPSAM clearly improves the performance for the segmenting anything performance in the dark, e.g., +12.90% mIoU and +16.23% mIoU on LLRGBD and DeLiVER, respectively.

Abstract:
Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.

Abstract:
Few-shot multi-label recognition (FS-MLR) presents a significant challenge due to the need to assign multiple labels to images with limited examples. Existing methods often struggle to balance the learning of novel classes and the retention of knowledge from base classes. To address this issue, we propose a novel Cross-Modality Prompts (CMP) approach. Unlike conventional methods that rely on additional semantic information to mitigate the impact of limited samples, our approach leverages multimodal prompts to adaptively tune the feature extraction network. A new FS-MLR benchmark is also proposed, which includes single-label training and multi-label testing, accompanied by benchmark datasets constructed from MS-COCO and NUS-WIDE. Extensive experiments on these datasets demonstrate the superior performance of our CMP approach, highlighting its effectiveness and adaptability. Our results show that CMP outperforms CoOp on the MS-COCO dataset with a maximal improvement of 19.47% and 23.94% in mAPharmonic for 5-way 1-shot and 5-way 5-shot settings, respectively.

Abstract:
Personalized text-to-image generation aims to synthesize images tailored to individual user preferences. Existing methods primarily generate customized content using a few reference images, which often struggle to mine user preferences from historical records, and thus fail to synthesize truly personalized content. In addition, it is difficult to directly incorporate the extracted feature of user preferences into the feature space of the generation model, since there exists a considerable gap between them. In this paper, we propose a novel multi-view personalized text-to-image generation method based on the diffusion model, named MVP-Diffusion, which learns instance- and user-level preferences from historical records and integrates them into the generation model. For instance-level user preference modeling, we employ a chain-of-thought prompting strategy to deduce preference keywords and integrate them into input prompts with the aid of a large language model. For user-level preference modeling, we construct a learnable embedding for each user to capture more comprehensive preferences by analyzing their historical records. An adaptive user preference fusion module is proposed to inject user preferences into the generation model via a set of learnable parameters. Experimental results demonstrate that the proposed method significantly enhances the personalization of the generated images compared to the other personalized text-to-image generation methods.

Abstract:
Cross-modal hash learning has drawn widespread attention for large-scale multimodal retrieval because of its stability and efficiency in approximate similarity searches. However, most existing cross-modal hashing approaches employ discrete label-guided information to coarsely reflect intra- and intermodality correlations, making them less effective to measuring the semantic similarity of data with multiple modalities. In this paper, we propose a new heterogeneous pairwise-semantic enhancement hashing (HPsEH) for large-scale cross-modal retrieval by distilling higher-level pairwise-semantic similarity from supervision information. First, we adopt a supervised self-expression to learn a data-specific quantified semantic matrix, which uses real values to measure both the similarity and dissimilarity ranks of paired instances, such that the intrinsic semantics of the data can be well captured. Then, we fuse the label-based information and quantified semantic similarity to collaboratively learn the hash codes of multimodal data, such that both the intermodality consistency and modality-specific features can be simultaneously obtained during hash code learning. Moreover, we employ effective iterative optimization to address the discrete binary solution and massive pairwise matrix calculation, making the HPsEH scalable to large-scale datasets. Extensive experimental results on three widely used datasets demonstrate the superiority of our proposed HPsEH method over most state-of-the art approaches.

Abstract:
In recent years, there has been a notable increase in interest in image-based surface normal estimation. These approaches are capable of predicting the surface normal of real scenes using only an image, thereby facilitating a more profound comprehension of the actual scene and providing assistance with other perceptual tasks. However, dense regression predictions are susceptible to misdirection when encountering intricate details, which presents a paradoxical challenge for image-based surface normal estimation in reconciling detail and density. By introducing quaternion rotations as fusion module with geometric property, we propose a quaternion-based refined network structure that fuses detailed and structural information. Specifically, we design a high-resolution surface normal baseline with a streamlined structure, to extract fine-grained features while reducing the angular error in surface normal regression values caused by downsampling. Additionally, we propose a subtle angle loss function that prevents subtle changes from being overlooked without extra information, further enhancing the model's ability to learn detailed information. The proposed method demonstrates state-of-the-art performance compared to existing techniques on three real-world datasets comprising indoor and outdoor scenes. The results demonstrate the robust effectiveness of our deep learning approach that incorporates geometric prior guidance, highlighting improved robustness in applying deep learning methods. The source code will be released upon acceptance.

Abstract:
Human motion recognition is extremely important for many practical applications in several disciplines, such as surveillance, medicine, sports, gait analysis, and computer graphics. Graph convolutional networks (GCNs) enhance the accuracy and performance of skeleton-based action recognition. However, this approach has difficulties in modeling long-term temporal dependencies. In Addition, the fixed topology of the skeleton graph is not sufficiently robust to extract features for skeleton motions. Although transformers that rely entirely on self-attention have demonstrated great success in modeling global correlations between inputs and outputs, they ignore the local correlations between joints. In this study, we propose a novel segmented spatiotemporal skeleton graph-attention network (S3GAAR) to effectively learn different human actions and concentrate on the most operative part of the human body for each action. The proposed S3GAAR models spatial-temporal features through spatiotemporal attention for each segment to capture short-term temporal dependencies. Owing to several human actions that focus on one or more body parts such as mutual actions, our novel method divides the human skeleton into three segments: superior, inferior, and extremity joints. Our proposed method is designed to extract the features of each segment individually because human actions focus on one or more segments. Moreover, our segmented spatiotemporal graph introduces additional edges between important distant joints in the same segment. The experimental results show that our novel method outperforms state-of-the-art methods up to 1.1% on two large-scale benchmark datasets, NTU-RGB+D 60 and NTU-RGB+D 120.

Abstract:
As imaging sensor technology in remote sensing has advanced quickly, multimodal fusion classification has become an important research direction in land cover and urban planning classification tasks. While generative models and image classification have greatly benefited from diffusion models, the present ones primarily concentrate on single-modality-driven diffusion processes. Therefore, this paper presents a 3D self-awareness diffusion network (3DSA-DiffNet) for multispectral (MS) and panchromatic (PAN) image fusion classification, which would make it easier to classify heterogeneous data from various sensors. First, in order to model the relationship between multi-channel spectra and multi-pixel spatial distributions as well as samples, respectively, a spatial-spectral joint denoising network (S^2JD-Net) is proposed. It can incorporate the diffusion process into the neural network to enhance the quality of diffusion features. Secondly, to imitate the brain's spatial-spectral coexistence learning mechanism, this work offers a 3D self-awareness module (3DSA-Module) that can learn the weight of each pixel in 3D space, resulting in extraordinarily high feature representation capabilities. Finally, experimental verification demonstrates that the 3D self-awareness diffusion fusion network driven by brain inspiration outperforms more sophisticated approaches on the Xi'an, Huhhot, and Muufl datasets.

Abstract:
Exploring a substantial amount of unlabeled data, semi-supervised learning boosts the recognition performance when only a limited number of labels are provided. However, conventional methods assume a class-balanced data distribution, which is difficult to realize in practice due to the long-tailed nature of real-world data. While addressing the data imbalance is a well-explored area in supervised learning paradigms, directly transferring existing approaches to SSL is nontrivial, as prior knowledge about unlabeled data distribution remains unknown in SSL. In light of this, we introduce the Balanced Memory Bank (BMB), a framework for long-tailed semi-supervised learning. The core of BMB is an online-updated memory bank that caches historical features alongside their corresponding pseudo-labels, and the memory is also carefully maintained to ensure the data therein are class-rebalanced. Furthermore, an adaptive weighting module is incorporated to work jointly with the memory bank to further re-calibrate the biased training process. Experimental results across various datasets demonstrate the superior performance of BMB compared with state-of-the-art approaches. For instance, an improvement of 8.2% on the 1% labeled subset of ImageNet127 and 4.3% on the 50% labeled subset of ImageNet-LT.

Abstract:
Cloud services have attracted extensive attention due to low cost, agility and mobility. However, when processing data on cloud servers, users may worry about semi-honest third parties stealing private information from them, hence, data encryption is applied for privacy protection. Inpainting is a technique that reconstructs certain undesirable regions in an image through an imperceptible manner, which can be accomplished by searching for well-matching candidate patches and copying them to to-be-inpainted locations. However, when the image is encrypted, the matched candidate patch searching is a challenging dilemma. Therefore, tackling these data-privacy issues for image inpainting over a cloud infrastructure, we propose an image inpainting scheme using Markov random field (MRF) modeling in encrypted domain. In this scheme, the sender encrypts the to-be-inapinted image by using a homomorphic cryptosystem that supports homomorphic ciphertext comparison. Then, the cloud realizes the MRF-based inpainting for encrypted images through some specific homomorphic operations. In addition, secure context descriptors are utilized to improve the inpainting of textures and structures. Finally, the receiver obtains the inpainted result through image decryption. The proposed scheme is proved to be secure through various cryptographic attacks. Qualitative and quantitative results demonstrate our scheme achieves better inpainted results in structure compared with state-of-the-art schemes in encrypted domain.

Abstract:
We propose a framework for creating articulated human avatars, editing their styles, and animating the human avatars from three different types of text instructions. The three types of instructions, identity, edit, and action, are fed into three models that generate, edit, and animate human avatars. Specifically, the proposed framework takes identity instruction and multi-view pose condition images to generate the images of a human using the avatar generation model. Then, the avatar can be edited with text instructions by changing the style of the images generated. We apply the Neural Radiance Field (NeRF) and Poisson reconstruction to extract a human mesh model from images and assign linear blend skinning (LBS) weights to the vertices. Finally, the action instructions can animate human avatars, where we use the off-the-shelf method to generate the motions from text instructions. Notably, our proposed method adapts the appearance of hundreds of different individuals to construct a conditionally editable avatar-generated model, allowing easy creation of 3D avatars using text instructions. We demonstrate high-fidelity 3D animatable avatar creation with text instructions on various datasets and highlight a superior performance of the proposed method compared to the previous studies.

Abstract:
Deep learning methods often struggle with the domain shift problem, leading to poor generalization on out-of-domain (OOD) data. To address the problem, domain generalization (DG) has been proposed to leverage the source domains to train a model that can generalize to OOD data. Existing domain generalization methods primarily focus on learning domain invariance, but they fail to ensure proximity among samples within the same category when domains are aligned for domain-invariant learning. Consequently, their generalization performance remains suboptimal. In this paper, we propose a novel approach to address this issue by iteratively approximating the category domain-invariant distribution from all domains. Our method involves an iterative loop where we initially estimate the domain-invariant distribution for each category by averaging the statistical characteristics across all domains. Then the adversarial perturbation alignment is adopted to keep each sample close to its corresponding category domain-invariant distribution. With the iterative loop, the deep network is optimized for robust domain invariance learning. Extensive experiments demonstrate that our proposed method consistently outperforms state-of-the-art approaches across various scenarios.

Abstract:
Repetitive Action Counting (RAC) aims to count the number of repetitive actions occurring in videos. In the real world, repetitive actions have great diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform periods, and action interruptions). Existing methods based on the temporal self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of insufficient capturing action periods when applied to complicated daily videos. To tackle this issue, we propose a novel method named Hybrid Temporal Relation Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly consists of three key components: bi-modal temporal self-similarity matrix modeling, random matrix dropping, and local temporal context modeling. Specifically, we construct temporal self-similarity matrices by bi-modal (self-attention and dual-softmax) operations, yielding diverse matrix representations from the combination of row-wise and column-wise correlations. To further enhance matrix representations, we propose incorporating a random matrix dropping module to guide channel-wise learning of the matrix explicitly. After that, we inject the local temporal context of video frames and the learned matrix into temporal correlation modeling, which can make the model robust enough to cope with error-prone situations, such as action interruption. Finally, a multi-scale matrix fusion module is designed to aggregate temporal correlations adaptively in multi-scale matrices. Extensive experiments across intra- and cross-datasets demonstrate that the proposed method not only outperforms current state-of-the-art methods and but also exhibits robust capabilities in accurately counting repetitive actions in unseen action categories. Notably, our method surpasses the classical TransRAC method by 20.04% in MAE and 22.76% in OBO.

Abstract:
Semi-supervised semantic segmentation pursues a holistic pixel-wise understanding of unseen images with limited annotation. To this end, existing methods focus on regularizing per-pixel prediction consistency within unlabeled data, while rarely modeling contextual relationships. But in fact, contextual semantics can provide valuable clues for scene understanding like inner-object continuity and spatial relationships' causality. Thus, in this paper, we propose a Dual-level Masked Semantics Inference (DMSI) that takes the initiative to explicitly learn contextual relationships via enforcing our model to infer the semantics of a pixel according to its surrounding contexts. This allows our model to exhaust accurate semantics by incorporating inter-pixel context clues, further leading to comprehensive segmentation. Specifically, DMSI comprises two main components. 1) Dual-level mask consistency regularization (DMCR) that learns the ability of semantics inference by aligning the predictions of masked views with the prediction of the complete view. The masked views here come from both the image level and feature level, where our model captures low-level attributes and high-level representations respectively. 2) AdaMask that provides a proper mask position and ratio for each image, guiding our model to focus on semantic-rich regions while providing balanced training between hard and easy samples. Through learning the ability of semantic inferring, DMSI remarkably enhances the interaction between pixels, further progressively intensifying the understanding of semantics. Extensive experiments under various settings on Cityscapes and Pascal VOC 2012 show that DMSI achieves new state-of-the-art performances. Furthermore, analysis indicates that our method has superiority in mining inter-pixel semantic relationships and improving robustness facing noise corruption.

Abstract:
Hyperspectral videos contain richer spectral and physical features than RGB videos and thus have greater potential for use in object tracking. The mainstream hyperspectral object tracking approach involves the integration of multiple RGB-based video tracking models. Although ensembles of multiple models can effectively utilize spectral information and improve tracker performance, this approach has high computational complexity, making it difficult to meet the real-time requirements of video object tracking. To bridge the gap, we propose a new hyperspectral object tracking framework (HotMoE) based on Mixture-of-Experts (MoE). HotMoE leverages a divide-and-conquer strategy, where only a subset of expert models is computed for each input, reducing computational complexity while maintaining performance. In this paper, we first design a splitter to group multiple spectral bands into multiple false-color images based on spectral correlations. Then, we design a hyperspectral MoE router that can adaptively learn to aggregate spectral image feature information and route it to suitable experts. Different experts can handle various scenarios, and HotMoE effectively utilizes the capabilities of different experts to obtain better overall performance. Compared with previous state-of-the-art hyperspectral object tracking networks, our model has significantly reduced inference time and performs well, with a processing speed of 43.7 FPS and an AUC of 0.704 with the HOT2022 dataset.

Abstract:
Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of Contrastive Language-Image Pre-training (CLIP) to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks.

Abstract:
Semi-supervised domain adaptation (SSDA) has been extensively researched due to its ability to improve classification performance and generalization ability of models by using a small amount of labeled data on the target domain. However, existing methods cannot effectively adapt to the target domain due to difficulty in fully learning rich and complex target semantic information and relationships. In this paper, we propose a novel SSDA learning framework called semantic regularization learning (SERL), which captures the target semantic information from multiple perspectives of regularization learning to achieve adaptive fine-tuning of the source pre-trained model on the target domain. SERL includes three robust semantic regularization techniques. Firstly, semantic probability contrastive regularization (SPCR) helps the model learn more discriminative feature representations from a probabilistic perspective, using semantic information on the target domain to understand the similarities and differences between samples. Additionally, adaptive weights in SPCR can help the model learn the semantic distribution correctly through the probabilities of different samples. To further comprehensively understand the target semantic distribution, we introduce hard-sample mixup regularization (HMR), which uses easy samples as guidance to mine the latent target knowledge contained in hard samples, thereby learning more complete and complex target semantic knowledge. Finally, target prediction regularization (TPR) regularizes the target predictions of the model by maximizing the correlation between the current prediction and the past learned objective, thereby mitigating the misleading of semantic information caused by erroneous pseudo-labels. Extensive experiments on three benchmark datasets demonstrate that our SERL method achieves state-of-the-art performance.

Abstract:
Incremental object detection (IOD) aims to train an object detector on non-stationary data streams without forgetting previous knowledge. Prevalent replay-based methods keep a buffer composed of carefully selected instances towards this goal. However, due to the limited storage space and uniform feature distribution, existing methods are prone to overfit on replayed instances, leading to poor generalization on diverse test data. Additionally, the imbalance in data quantity makes the detector fail to distinguish old and new classes that are visually similar, introducing bias toward new classes. To enhance the diversity of stored instances and eliminate bias, we propose a Local Response Exploration (LRE) framework, which comprises three modules. First, Region-Entropy Instance Selector (REIS) introduces a novel metric to assess instance diversity based on the entropy of local responses. Second, Confusion-Guided Instance Replay (CGIR) replaces the previous random replay approach by replaying specific old class instances based on class similarity, ensuring that parameters for similar new and old classes are updated together, thereby mitigating bias and helping mining discriminative patterns. Third, Confusion-Aware Region Segregation (CARS) adaptively differentiates biased regions from other regions based on local responses, reducing bias toward new classes while preserving relationships between new and old classes. Extensive evaluations on Pascal-VOC and MS COCO datasets demonstrate that our approach outperforms State-of-the-Art methods in incremental object detection.

Abstract:
Large contrastive vision-language models (VLMs) have recently shown promise in skeleton-based action recognition. However, given the lack of skeleton frame-text training datasets for VLMs, aligning the representations between the skeleton frames and labels remains challenging. Specifically, two key limitations must be addressed. First, VLMs struggle to align abstract action labels' language representations with sequential skeleton frames containing primary action semantics, impeding the ability of language representations to represent primary action information effectively. Second, vision representations with high-order action information are difficult to align with labels' language representations because of the risk of homogenizing discriminative features from different data streams. To address these challenges, we propose a Contrastive Feedback Vision-Language (CFVL) model for 3D skeleton-based action recognition that consists of a language representations' feedback decoder and a data stream-adaptive projection module. The feedback decoder aligns the decoded language representations with the original skeleton inputs to help the model comprehend primary action vision information. The projection module employs adaptive structures to further extract spatiotemporal information from various data streams. Additionally, the data stream-adaptive projection module projects vision and text language representations into a unified high-latency semantic space. Discriminative action vision representations, along with consistent representation spaces, support the effective alignment of vision-language representations with high-order action information. The experimental results demonstrate the superior performance of the proposed CFVL model on the Northwestern-UCLA, PKU MMD, NTU RGB+D 60/120, and FSD-10 datasets.

Abstract:
Image rectangling involves filling in the blanks created during image stitching through deformation techniques. However, existing methods still struggle with incomplete filling and distortion of content, ultimately affecting the overall visual impression and potentially hindering subsequent tasks such as recognition. In this work, we design a pixel-wise deformation framework that utilizes explicit edge guidance to maintain consistency of texture and structure, yielding rectangular images with natural structure. Specifically, we decouple motion into region-level and pixel-level components through uniform mesh warping and pixel-wise deformation to precisely rearrange the spatial distribution of all pixels. Uniform deformation preserves local structure within divided patches, while pixel-wise motion coordinates the consistency between patches. Their combination provides robust and accurate pixel-wise offsets for structure-preserved rectangling. To further bolster the consistency of structure and texture, we leverage edge information to establish structural constraints and design an edge-guided enhancement module to aid in restoring fine texture details. Additionally, stitched images encompass both meaningful content and blank spaces, we innovatively incorporate a mask predictor, which acts as a guiding beacon, directing the network's attention solely towards content-rich regions to facilitate precise pixel-wise motion estimation. Experimental results demonstrate that our approach achieves state-of-the-art performance in rectifying irregular boundaries while contributing to downstream visual perception tasks.

Abstract:
Recent pan-sharpening methods have predominantly utilized techniques tailored for natural image scenes, often overlooking the unique features arising from non-overlapping spectral responses. In light of this, we have reevaluated the utility of panchromatic (PAN) images and introduced a theory anchored in the spectral response of satellite sensors. This posits that a PAN image is effectively a linear weighted summation of individual bands from its corresponding multi-spectral (MS) image, offset by an error map. We developed a deep unmixing network termed “DUN” that integrates an unmixing network, a fusion mechanism, and a distinctive mutual information contrastive loss function. Notably, the unmixing network is adept at decomposing a PAN image into its MS counterpart and error map. Further, the demixed image alongside the low-resolution MS image is channeled into the fusion network for pan-sharpening. Recognizing the challenges of achieving robust supervised learning directly from the unmixing phase, we have innovated a mutual information contrastive learning loss function, ensuring enhanced separation and minimizing overlap during the unmixing process. Preliminary experiments underscore both the quantitative and qualitative prowess of the proposed method.

Abstract:
In the multi-view domain, it is challenging to correctly label multiple people across viewpoints because of occlusions, visual ambiguities, appearance variation, etc. Deep learning, although having witnessed remarkable success in computer vision tasks, still remains underexplored for the multi-view labelling task, due to the lack of labelled multi-view datasets. In this paper, we propose a novel end-to-end deep neural network named Multi-View Labelling network (MVL-net) that addresses this issue. To overcome the dataset shortage, a large-scale multi-view dataset is generated by combining 3D human models and panoramic backgrounds, along with human poses and realistic rendering. In the proposed MVL-net, we first incorporate Transformer blocks to capture the non-local information for multi-view feature extraction. A matching net is then introduced to achieve multiple people labelling, by predicting matching confidence scores for pairwise instances from two views, thus addressing the problem of the unknown number of people when labelling across views. An additional geometry feature obtained from the epipolar geometry is integrated to leverage multi-view cues during training. To the best of our knowledge, the MVL-net is the first work using deep learning to train a multi-view labelling network. Comprehensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed method, which outperforms the existing state-of-the-art approaches.

Affiliations: School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan, China; School of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia; JD Explore Academy, Beijing, China; School of Information Science and Engineering, Yunnan University, Kunming, China; School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan, China

Abstract:
Many supervised learning-based facial expression recognition (FER) methods achieve good performance with the assistance of expression labels and a complex framework. However, there are inconsistent annotations in different expression datasets, making the above methods disadvantageous for new expression datasets or datasets with limited training data. The objective of this paper is to learn self-supervised facial expression features that enable the FER model not to rely on the annotation consistency of the different datasets. Most current self-supervised learning algorithms based on contrastive learning learn the representation by forcing different augmented views of the same image close in the embedding space, but they cannot cover all variances within a semantic class. We propose a heatmap neighbor contrastive learning (HNCL) method for FER. It treats the images corresponding to the heatmap nearest neighbors of expressions as other positives, providing more semantic variations than pre-defined augmented transformations. Therefore, our HNCL can learn better expression features covering more intra-class variances, improving the performance of the FER model based on self-supervised learning. After fine-tuning, HNCL with a simple framework achieves top-three performance on the in-the-lab datasets and even matches the performance of state-of-the-art supervised learning methods on the in-the-wild datasets.

Abstract:
Event-based video reconstruction has emerged as an appealing research direction to break through the limitations of traditional cameras to better record dynamic scenes. Most existing methods reconstruct each frame from its corresponding event subset in chronological order. Since the temporal information contained in the whole event sequence is not fully exploited, these methods suffer inferior reconstruction quality. In this paper, we propose to enhance event-based video reconstruction by leveraging the bidirectional temporal information in event sequences. The proposed model processes event sequences in a bidirectional fashion, allowing for exploiting bidirectional information in the whole sequence. Furthermore, a transformer-based temporal information fusion module is introduced to aggregate long-range information in both temporal and spatial dimensions. Additionally, we propose a new dataset for the event-based video reconstruction task which contains a variety of objects and movement patterns. Extensive experiments demonstrate that the proposed model outperforms existing state-of-the-art event-based video reconstruction methods both quantitatively and qualitatively.

Abstract:
The prevalence of digital content leakage via screen capture highlights the urgent need for robust watermarking solutions capable of withstanding cross-media transmission. Current approaches primarily focus on developing watermarking techniques resilient to screen-shooting distortions, where distinguishing the watermark signal from these distortions is paramount. In contrast, our study addresses an inverse problem by investigating the generation patterns of noise during screen-shooting and considering them as feasible representations of watermark signals. Leveraging Moiré patterns as one of the distortion signals naturally generated by the interaction between electronic screens and camera sensors, we propose Moiré-watermark, presenting watermark information encoded into meticulously crafted Moiré patterns within images. To enhance the naturalness of Moiré-watermark amidst the irregularities of screen-shooting Moiré patterns, we encode watermark signals using gratings at different angles. A corresponding angle-based decoding method facilitates effective blind extraction of watermarks. Comprehensive experimental evaluations under diverse conditions of distance, angle, lighting, and across various capturing and display devices, alongside comparisons with existing methods, validate the superior performance of Moiré-watermark.

Abstract:
The purpose of robust image watermarking is to embed a watermark into a carrier image in an invisible form and extract the watermark successfully even under noise interference conditions to achieve copyright confirmation and traceability. Although watermarking methods based on deep learning can improve the robustness by adding a noise simulation layer, few theoretical analyses of the codec structure have been conducted. Theoretical explainability is the theoretical basis for developing a network architecture, which plays a guiding role in network development. On the basis of the interpretability of convolutional networks, this paper analyzes the mathematical process of embedding and extracting watermarks in codecs and proposes a novel watermarking framework based on multi-layer watermark feature fusion. Specifically, the encoder can be a convolutional network structure of arbitrary depth, whereas the decoder needs only to adopt its corresponding deconvolution structure. To improve the quality and robustness of the generated watermarked image, the watermark is associated with an arbitrary layer feature space in the decoder. In the decoder, the network quickly converges to each original encoding feature space through the deconvolution structure, thus decoupling the watermark features. Finally, the watermark is extracted via the automatic fusion of multi-layer watermark features. The experimental results show that the proposed method is suitable for few-shot learning, and its invisibility, robustness and generalization performance on multiple datasets are significantly better than those of other advanced methods.

Abstract:
Recent years have witnessed remarkable advances in spatiotemporal predictive learning, with methods incorporating auxiliary inputs, complex neural architectures, and sophisticated training strategies. While SimVP has introduced a simpler, CNN-based baseline for this task, it still relies on heavy Unet-like architectures for spatial and temporal modeling, which still suffers from high complexity and computational overhead. In this paper, we propose SimVPv2, a streamlined model that eliminates the need for Unet architectures and demonstrates that plain stacks of convolutional layers, enhanced with an efficient Gated Spatiotemporal Attention mechanism, can deliver state-of-the-art performance. SimVPv2 not only simplifies the model architecture but also improves both performance and computational efficiency. On the standard Moving MNIST benchmark, SimVPv2 achieves superior performance compared to SimVP, with fewer FLOPs, about half the training time, and 60% faster inference efficiency. Extensive experiments across eight diverse datasets, including real-world tasks such as traffic forecasting and climate prediction, further demonstrate that SimVPv2 offers a powerful yet straightforward solution, achieving robust generalization across various spatiotemporal learning scenarios. We believe the proposed SimVPv2 can serve as a solid baseline to benefit the spatiotemporal predictive learning community.

Abstract:
Interactive recommender systems have garnered widespread attention due to their ability to dynamically update recommendation strategies based on user feedback, enhancing the user's interactive experience. To maximize long-term user satisfaction, existing research has incorporated reinforcement learning into interactive recommender systems and combined it with meta-learning to form a meta-reinforcement learning framework that further addresses the cold-start problem in interactive recommendation. However, on one hand, there are latent confounders affecting user feedback; on the other hand, since training samples are observed rather than experimentally obtained, selection bias and exposure bias exist in the interactive data. Most existing studies remove biases using the method of Inverse Propensity Score, which often utilizes fixed propensity scores and neglects the latent confounders affecting user feedback. In this paper, we propose an unbiased interactive recommender system (UIRS) based on a meta-reinforcement learning framework. To eliminate the impact of latent confounders in the state encoding process, we design a user preference representer consisting of three interconnected gated recurrent units. Additionally, we use the item recommendation probabilities output from the policy network as propensity scores and design the objective functions based on these scores, to eliminate biases while addressing latent confounders. Extensive experiments conducted on three benchmark datasets demonstrate that our proposed UIRS model achieves significant improvements over existing state-of-the-art baseline models.

Abstract:
Existing image inpainting methods face limitations in detail restoration. Although transformer-based models have made certain progress recently, the lack of hierarchical feature interaction and insufficient consideration of the importance of features at different network levels lead to semantic ambiguity in image reconstruction. To enhance the visual quality and accuracy of image inpainting, we adopt a multi-level feature fusion approach and propose a novel, efficient hierarchical feature collaboration transformer (HFCT). Our approach comprises two modules: dual stream gated feature fusion (DSGF) and region-separated attention module (RSAM), effectively capturing features at different levels of the network and enhancing inter-level information exchange. The DSGF module uses soft gating to fuse primary and advanced features, strengthening the connection from local to global consistency and reducing artifacts. The RSAM module resolves attention isolation issues in feature fusion through region-separated attention, strengthening the understanding of feature relationships, capturing more image semantics, and improving restoration accuracy. Extensive experiments on the Paris StreetView, CelebA-HQ, and Places2 benchmark datasets demonstrate that our proposed method achieves superior image inpainting quality compared to several state-of-the-art inpainting algorithms.

Abstract:
Cross-domain recommendation (CDR) aims to address the data-sparsity problem by transferring knowledge across domains. Existing CDR methods generally assume that the user-item interaction data is shareable between domains, which leads to privacy leakage. Recently, some privacy-preserving CDR (PPCDR) models have been proposed to solve this problem. However, they primarily transfer simple representations learned only from user-item interaction histories, overlooking other useful side information, leading to inaccurate user preferences. Additionally, they transfer differentially private user-item interaction matrices or embeddings across domains to protect privacy. However, these methods offer limited privacy protection, as attackers may exploit external information to infer the original data. To address these challenges, we propose a novel Federated User Preference Modeling (FUPM) framework. In FUPM, first, a novel comprehensive preference exploration module is proposed to learn users' comprehensive preferences from both interaction data and additional data including review texts and potentially positive items. Next, a private preference transfer module is designed to first learn differentially private local and global prototypes, and then privately transfer the global prototypes using a federated learning strategy. These prototypes are generalized representations of user groups, making it difficult for attackers to infer individual information. Extensive experiments on four CDR tasks conducted on the Amazon and Douban datasets validate the superiority of FUPM over SOTA baselines.

Abstract:
With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers mainly focus on attacking naive 3D classification models by perturbing 3D objects. However, since real-world 3D applications generally rely on more complicated scene-based point cloud data, these attack methods are impractical to deploy in realistic scenarios. Therefore, in this paper, we attempt to introduce the adversarial attacks into a more practical yet challenging large-scale scene-based 3D task, i.e., text-guided 3D scene grounding. To make perturbations both effective and imperceptible in scene cases, we investigate the vulnerability of 3D grounding models to backdoor attacks, which implant backdoor triggers into 3D models via data poisoning so as to control the models' predictions at test time. Specifically, we propose a novel Joint Scene-Text Backdoor Attack (JSTBA) method to embed triggers in each of the input modalities and activate the malicious behavior only when both triggers are present. We further design a visual trigger optimization strategy to place the visual trigger appropriately in the 3D scene, aiming to make it natural and imperceptible. Extensive experiments are conducted on seven classic 3D grounding models and three datasets, showing that our JSTBA attack significantly degrades the performance of 3D models on the poisoned data while gaining comparable performance with the benign models on the clean data.

Abstract:
Deep learning based remote sensing (RS) image segmentation significantly impacts several real application scenarios. Behind its success, massive labeled data plays an important role. However, annotating high-resolution RS images requires time-consuming and relevant expertise efforts. To address it, many works dive into semi-supervised learning which utilizes raw information embedded in unlabeled data to improve the segmentation model. Nevertheless, previous studies ignore the integrity and effectiveness of the potential context information hidden in RS data. In this work, we propose an uncertainty-aware masked consistency learning (U-MCL) framework that contains an uncertainty-aware masked denoising (U-MD) module and an uncertainty-aware masked image consistency (U-MIC) module. U-MCL initially generates a patch-wise uncertainty map for each unlabeled image during each training iteration, which is then used to derive an adaptive mask ratio for pseudo-label denoising in U-MD. Simultaneously, the uncertainty map is adopted to model a masked unlabeled image for reasoning unseen areas in U-MIC. Consequently, U-MCL is capable of enhancing model performance by engaging in accurate and stable consistency learning while preserving the integrity of the context and employing the context to infer the predictions of the masked regions safely. Extensive experiments on six RS datasets, i.e., ISPRS Vaihingen, FloodNet, MiniFrance, LoveDA, MER, and MSL, demonstrate the superiority of our U-MCL over recent most advanced methods, achieving new state-of-the-art performance under all benchmarks.

Abstract:
Translating readily available visible (VIS) images into thermal infrared (TIR) images effectively alleviates the shortage of TIR data. While current methods have yielded commendable results, they fall short in generating diverse and realistic thermal infrared images, primarily due to insufficient consideration of temperature variations. In this paper, we propose a Thermally Controlled GAN (TC-GAN) that leverages VIS images to generate diverse TIR images, with the ability to control the relative temperatures of multiple objects, particularly those with temperature variations. Firstly, we introduce the physical coding module, which employs a conditional variational autoencoder GAN to learn the distributions of relative temperature information for the objects and environmental state information. Then, the physical information can be obtained by sampling the distribution. When this information is fused with the visible image, it facilitates the generation of diverse TIR images. To ensure authenticity and strengthen the physical constraints across different regions of the image, we introduce a self-attention mechanism in the generator that prioritizes the relative temperature relationships within the image. Additionally, we utilize a local discriminator that focuses on objects with actively changing temperatures and their interactions with the surrounding environment, thereby reducing the discontinuity between the target and the background. Experiments on the Drone Vehicle and AVIID datasets show that our approach outperforms mainstream diversity generation methods in terms of authenticity and diversity.

Affiliations: School of Information and Communications Engineering, Xi'an Jiaotong University, Xi'an, China; Department of Biomedical Engineering, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China; CRRC Qingdao Sifang Rolling Stock Research Institute Company Ltd., Qingdao, China; Zhengzhou Vocational College of Finance And Taxation, Zhengzhou, China; Shaanxi Key Laboratory of Clothing Intelligence and the School of Computer Science, Xi'an Polytechnic University, Xi'an, China; School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China

Abstract:
Retinex theory-based low-light image enhancement methods have received increasing attention and achieved tremendous advancements. However, there still exist two seldom-explored issues: 1) The above methods only formally simulate the Retinex decomposition, resulting in lacking explicit interpretability. 2) They usually are performed in single-scale space, leading to suboptimal enhancement results. In this paper, we propose an interpretable Multi-scale Retinex Unfolding Network (MRUNet) for low-light image enhancement, which can tackle both of the aforementioned issues simultaneously. Specifically, we formulate low-light image enhancement as a multi-scale Retinex optimization problem and design an iteration minimization solution to solve it. The optimization solution is further unfolded to fabricate MRUNet, which is empowered with clear physical significance and multi-scale prior knowledge in favor of image enhancement. However, it will aggravate model size and efficiency when exploiting multiple proximal mapping networks to extract multi-scale prior from multi-scale inputs. To surmount the issue, we propose a Scale-Aware Proximal mapping Module (SAPM), which efficiently collect multi-scale prior knowledge via the weight sharing strategy. In SAPM, we tailor a scale-aware transformer to model the specific scale-similarity among different scales. Extensive experiments manifest that MRUNet surpasses other Retinex-based low-light image enhancement methods on multiple benchmarks.

Abstract:
Direct pose estimation networks aim to directly regress the 6D poses of target objects in the scene image using a neural network. These direct methods offer efficiency and an optimal optimization target, presenting significant potential for practical applications. However, due to the complex and implicit mappings between input features and target pose parameters, direct methods are challenging to train and prone to overfitting on mappings seen during training, resulting in limited effectiveness and generalization capability on unseen mappings. Existing methods focus primarily on improvements of the network architecture and training strategies, with less attention given to mappings. In this work, we propose a geometric constraints learning approach, which enables networks to explicitly capture and utilize the geometric mappings between inputs and optimization targets for pose estimation. Specifically, we introduce a residual pose transformation formula that preserves pose transformation constraints within both the 2D image plane and the 3D space while decoupling the absolute pose distribution, thereby addressing the pose distribution gap issue. We further design a Geo6D mechanism based on the formula, which enables the network to explicitly utilize geometric constraints for pose estimation by reconstructing the inputs and outputs. We select two different methods as our baseline and extensive experiments show that Geo6D enhances the performance and reduces the dependence on extensive training data, remaining effective even with only 10% of the typical data volume.

Abstract:
How humans understand and recognize the actions of others is a complex neuroscientific problem that involves a combination of cognitive mechanisms and neural networks. Research has shown that humans have brain areas that recognize actions that process top-down attentional information, such as the temporoparietal association area. Also, humans have brain regions dedicated to understanding the minds of others and analyzing their intentions, such as the medial prefrontal cortex of the temporal lobe. Skeleton-based action recognition creates mappings for the complex connections between the human skeleton movement patterns and behaviors. Although existing studies encoded meaningful node relationships and synthesized action representations for classification with good results, few of them considered incorporating a priori knowledge to aid potential representation learning for better performance. LA-GCN proposes a graph convolution network using large-scale language models (LLM) knowledge assistance. First, the LLM knowledge is mapped into a priori global relationship (GPR) topology and a priori category relationship (CPR) topology between nodes. The GPR guides the generation of new “bone” representations, aiming to emphasize essential node information from the data level. The CPR mapping simulates category prior knowledge in human brain regions, encoded by the PC-AC module and used to add additional supervision—forcing the model to learn class-distinguishable features. In addition, to improve information transfer efficiency in topology modeling, we propose multi-hop attention graph convolution. It aggregates each node's k-order neighbor simultaneously to speed up model convergence. LA-GCN reaches state-of-the-art on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Abstract:
Associating driver attention with driving scene across two fields of view is a challenging cross-domain perception problem, which requires comprehensive consideration of cross-view mapping, dynamic driving scene analysis and driver status tracking. Previous methods typically analyze a single view or map attention to the scene through a two-step projection, failing to exploit their implicit connections and establish accurate associations. Moreover, simple fusion modules are inadequate for modeling the complex relationships between the two views, making information integration complicated. To address these issues, we propose EraW-Net, a novel end-to-end framework for scene-associated driver attention estimation by aggregating information from dual views. This method enhances the most discriminative dynamic cues, refines feature representations, and facilitates semantically aligned cross-domain integration through a W-shaped architecture, termed W-Net. Specifically, a Dynamic Adaptive Filter Module (DAF-Module) is proposed to address the challenges of frequently changing driving environments by extracting vital regions. It suppresses the indiscriminately recorded dynamics and highlights crucial ones by innovative joint frequency-spatial analysis, enhancing the model's ability to parse complex dynamics. Additionally, to track driver states during non-fixed facial poses, we propose a Global Context Sharing Module (GCS-Module) to construct refined feature representations by capturing hierarchical features that adapt to various scales of head and eye movements. Finally, W-Net achieves systematic cross-view information integration through its unique two-stage decoding strategy, addressing semantic misalignment in heterogeneous data integration. Experiments demonstrate that the proposed method robustly and accurately estimates scene-associated driver attention on large public datasets.

Abstract:
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28% and 4.06% on the challenging QVHighlights and Charades-STA benchmarks, respectively. More importantly, the performance gains are significantly higher for localization of complex queries.

Abstract:
The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.

Abstract:
Object detection from point clouds is a fundamental task for 3D scene understanding and has a wide range of applications in the field of multimedia data processing and analysis, such as autonomous driving and virtual interaction. The IoU evaluates the overlap between the two bounding boxes to ensure consistency across network optimization and testing, becoming a recognized regression loss in the field of 3D object detection. However, there is a kind of error coupling between the IoU and the angle, i.e., the IoU does not decrease as the angle error increases and vice versa. This problem leads to sub-optimal solutions for the neural network model, which severely hampers the improvement of 3D object detection accuracy. In this paper, a novel 4DIoU method is introduced for detecting 3D objects from point clouds, which provides a comprehensive rethinking of IoU computation by integrating angular information as an additional dimension. 4DIoU not only solves the problem of error coupling between IoU and angular but also facilitates neural network optimization using angle information. Furthermore, to solve the different impacts of various object shapes on IoU variations, a special 4DIoU called TV4DIoU is proposed to fuse shape information based on three orthogonal projection views, which can adaptively learn the information of objects with different shapes. In addition, to enhance the generalization of the 4DIoU method, a high-flexibility anchor encoding method and a cyclic consistent computation formula for angular errors are designed to make 4DIoU a plug-and-play module for both anchor-based and anchor-free frameworks. Extensive evaluations conducted on the nuScenes, Waymo, and KITTI datasets have confirmed the effectiveness of the proposed method.

Abstract:
Semantic Edge Segmentation (SED) is crucial for intelligent agents to understand and interact with their environments, as it enables them to locate and recognize semantic boundaries. The prevailing framework in the field of SED is multi-label learning, which identifies edges and their semantics by learning to assign multiple labels that indicate the categories of the objects forming the edges. However, this framework has demonstrated limited performance when dealing with complex scenarios. In this paper, we propose a mask classification framework specifically tailored for the SED task, termed EdgeMaskFormer. Within this framework, we develop a query-based edge semantic extractor to learn semantic embeddings for edge mask classification with assistance from regional semantic supervision. Additionally, we design a context-aware hierarchical edge extractor to serve as an edge mask head, which can capture multi-scale edges of different categories under the guidance from the semantic embeddings via dynamic convolution. Furthermore, we develop matching and supervision mechanisms specifically for edge mask classification in order to reduce edge noise and address the imbalance between edge and non-edge samples. Our extensive experiments on three public datasets demonstrate that the proposed approach achieves outstanding performance in semantic edge detection, particularly on those datasets with complex scenarios.

Abstract:
Recent few-shot action recognition (FSAR) methods typically perform semantic matching on learned discriminative features to achieve promising performance. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, etc.) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to make more accurate query sample predictions under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, SSv2-full, and SSv2-small).

Abstract:
Navigating in continuous environments with vision-language cues presents critical challenges, particularly in the accuracy of waypoint prediction and the quality of navigation decision-making. Traditional methods, which predominantly rely on spatial data from depth images or straightforward RGB-depth integrations, frequently encounter difficulties in environments where waypoints share similar spatial characteristics, leading to erroneous navigational outcomes. Additionally, the capacity for effective navigation decisions is often hindered by the inadequacies of traditional topological maps and the issue of uneven data sampling. In response, this paper introduces a robust memory-observation synergistic vision-language navigation framework to substantially enhance the navigation capabilities of agents operating in continuous environments. We present an advanced observation-driven waypoint predictor that effectively utilizes spatial data and integrates aligned visual and textual cues to significantly improve the accuracy of waypoint predictions within complex real-world scenarios. Additionally, we develop a strategic memory-observation planning approach that leverages memory panoramic environmental data and detailed current observation information, enabling more informed and precise navigation decisions. Our framework sets new performance benchmarks on the VLN-CE dataset, achieving a 60.25% Success Rate (SR) and a 50.89% Path Length Score (SPL) on the R2R-CE dataset’s unseen validation splits. Furthermore, when adapted to a discrete environment, our model also shows exceptional performance on the R2R dataset, achieving a 74% SR and a 64% SPL on the unseen validation split.

Abstract:
The goal of Image-to-Music Generation is to create pure music according to the given image. Unlike existing tasks such as text-to-image generation, there is no explicit connection between image content and musical melody. Some existing studies attempt to generate music by directly mapping image features (such as color, edges, etc.) into musical notes, which may result in the melodic incoherence. Inspired by neuroscience, it is desirable to employ emotion to bridge these two modalities. However, the continuity and complexity of emotions make it difficult to capture the cross-modal correlation. Drawing from human perception mechanisms of emotions, a Progressive Image-to-Music Generation (PIMG) framework is proposed. The framework designs a mean-teacher based association network to guide the music generation process progressively, starting from highly correlated image-music pairs. The generation network receives more challenging sample pairs gradually, eventually capturing complex cross-modal emotional correspondences. Additionally, a contrastive learning strategy is introduced into the diffusion models to better capture the consistency between pieces of music with the similar emotions. Extensive experimental results demonstrate that the proposed framework is able to generate high-quality and emotionally consistent music from images.

Abstract:
Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.

Abstract:
In low-light environments, human vision is severely limited by weak light sources, leading to significantly reduced visual capabilities. Similarly, in machine vision, low-light recognition tasks such as nighttime autonomous driving and surveillance tasks involving the detection of small faces in low-light conditions are more challenging than tasks in normal lighting. Current low-light face detection models lack adaptability to different low-light conditions, and the accuracy of face detection remains unsatisfactory. In this paper, we propose a novel face detection framework DSLL-Face, specifically designed to tackle the challenges of face detection in low-light environments. Our proposed DarkHead, featuring a specialized branch designed to predict the distribution of bounding boxes, thereby substantially enhances the supervision of bounding box localization. This innovative approach effectively resolves the issue of blurry bounding boxes and significantly increases the accuracy of predicted positions. We employ a novel loss function tailored for detecting small faces, enhancing the sensitivity and effectively addressing the blurriness issues in small face detection. Furthermore, we leverage the Channel Grouping and Partial Convolution block (CGP) to enhance multi-scale expression capabilities. We develop the EMNet-pro model with the aim of further enhancing images to improve their adaptability under various low-light conditions. Extensive experiments demonstrate that our model exhibits outstanding capability in low-light face detection on the DARK FACE dataset and achieves significantly better performance compared to existing state-of-the-art frameworks.

Abstract:
Source-free domain adaptive object detection (SFOD) aims to transfer models pre-trained on the source domain to the unlabeled target domain without requiring access to the source data. Most existing SFOD methods leverage pseudo-labels for self-supervised training in the target domain. We investigate the limitations of threshold techniques to obtain high-quality pseudo-labels. In response, we design the Sequential Source-Free domain adaptive Object Detection (S-SFOD) algorithm, which enhances the quality of pseudo-labels at both the image and instance levels. At the image level, we reconstruct the training dataset, prioritizing the training of images that yield more reliable pseudo-labels to help the model acquire valuable target domain knowledge in the initial training stages. At the instance level, we introduce an adaptive local-global threshold method to balance the quality and quantity of pseudo-labels by dynamically adjusting the thresholds based on the model’s learning progress. By improving the quality of pseudo-labels through these complementary techniques at both the image and instance levels, we effectively transfer knowledge from the source domain to the target domain. Extensive experiments on multiple cross-domain object detection datasets demonstrate that our proposed method outperforms current state-of-the-art SFOD algorithms. The code and model will be released.

Affiliations: Fujian Key Laboratory of Pattern Recognition and Image Understanding, School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, China; Research Center for Frontier Fundamental Studies, Zhejiang Lab, Hangzhou, China; School of Electronics, Electrical Engineering and Computer Science, Institute of Electronics Communications and Information Technology, Queen’s University Belfast, Belfast, U.K.; School of Informatics, Xiamen University, Xiamen, China

Abstract:
Visible-infrared person re-identification (VI-ReID) aims to query the same pedestrian’s visible (infrared) images in the gallery set from the infrared (visible) images. VI-ReID not only needs to deal with the challenging factors like pose variation and occlusion, but also requires handling the large modality discrepancy. Previous methods mainly focus on learning single-scale modality-shared features and do not effectively explore the multi-scale features of two modalities from both short-range and long-range perspectives. In order to solve these problems, this paper proposes a novel Hierarchical Token-Aware Cross-Modality Reconstruction (HTCR) network to significantly mitigate the modality discrepancy for effective VI-ReID. The HTCR network consists of two main components, i.e., Hierarchical Token-aware Fusion (HTF) and Cross-modality Feature Reconstruction (CFR). The HTF module first bidirectionally exchanges the short-range and long-range multi-scale modality-shared features with a few learnable tokens to achieve discriminative pedestrian features by making full use of the advantages of both Convolutional Neural Network (CNN) and Transformer. Moreover, the CFR module reconstructs global and local pedestrian features of one modality by using the token sequence of the other modality with multi-scale cues to further explore the relationship between the two distinct modalities and alleviate the modality discrepancy. In addition, the Modality-shared feature Reconstruction (MR) loss is leveraged to reduce the noises between the reconstructed and the target features. Experimental results indicate that the proposed HTCR can significantly improve the VI-ReID performance and outperform the state-of-the-art methods on the cross-modality SYSU-MM01, RegDB, and LLCM datasets.

Affiliations: School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-Inspired Intelligence Technology, Tianjin University, Tianjin, China; School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-inspired Intelligence Technology, Tianjin University, Tianjin, China; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN) and the Key Laboratory of Intelligent Interaction and Applications, Ministry of Industry and Information Technology, Northwestern Polytechnical University, Xi’an, China

Abstract:
Multi-Query Image Retrieval (MQIR) aims to establish connections between vision and language by exploring fine-grained region-query alignments. It is still a challenging task owing to its intrinsical ambiguity, where a query matches with multiple semantically similar regions and introduces misleading noises. Although researchers have made great efforts to alleviate the ambiguity in many retrieval-related tasks, there are few attempts considering this bottleneck in MQIR, which greatly limits present performance. To this end, we propose a novel Visual Semantic Contextualization Network (VSCN) to mitigate ambiguity by capturing the contextual knowledge within each image-text pair. Specifically, we first develop a Context Semantic Perception (CSP) module to capture the dual-level context, where a visual context transformer explores the intra-context within regions, and a cross-modal context transformer mines the inter-context among concatenated visual-linguistic embeddings. Then, to yield superior contextual understanding, we strengthen the connotations in context via a Context Semantic Interaction (CSI) module. Particularly, knowledge distillation is first employed to transfer the CLIP-guided semantic into the regional intra-context to complement the potential background information. Then, the intra-context & inter-context interaction is conducted via the self-attention mechanism to link the dual-level context and obtain the interacted contextual knowledge. Our method is evaluated on the Visual Genome dataset and substantially outperforms the state-of-the-art methods (30.3% improvements on Recall@1 in the first round).

Abstract:
With growing privacy and portability concerns, source-free domain adaptation requires only a source pre-trained model and an unlabeled target domain, allowing for effective adaptation to the target data. Most existing self-training methods focus on selecting and exploiting samples with reliable predictions, often neglecting others. Inspired by the finding that deep models learn clean samples faster than noisy ones, we propose a domain-division based progressive learning method named DPL. Specifically, our approach consists of two alternating stages, each beginning with the division of the target domain into easy-to-adapt and hard-to-adapt subdomains based on adaptation difficulty, followed by neighborhood-based pseudo label assignment. In stage one, we enhance classification accuracy through uncertainty-aware self-training and alignment of corresponding classes between subdomains. Stage two then applies tailored learning strategies to each subdomain, starting with consistency learning on the easy-to-adapt samples and progressing to utilizing local structural information for the more challenging ones, thereby mining the intrinsic properties of the target data. Extensive experiments on several widely used benchmarks validate the effectiveness of our approach, demonstrating superior performance compared to state-of-the-art methods.

Abstract:
Diffusion models have demonstrated remarkable capabilities for text-to-video (T2V) editing tasks, relying on fine-tuning for pretrained text-to-image (T2I) diffusion models with only one video-prompt pair. However, conventional fine-tuning approaches require tuning and storing numerous parameters for each video, leading to substantial parameter and memory costs. To mitigate these issues, we propose Truncate Diffusion, an efficient fine-tuning method for video editing that optimizes both parameter and memory usage. Specifically, we propose the Truncate Diffusion module, which is designed with a focus on module architecture and initialization, specifically targeting optimization with a small training set, such as a single video-prompt pair. Theoretical analysis using the Johnson-Lindenstrauss lemma and the Eckart-Young-Mirsky theorem shows that Truncate Diffusion can achieve a minimal Frobenius norm distance to the original attention algorithm with appropriate initialization, which enhances the ease of optimization and improves video editing performance. During fine-tuning, the Truncate Diffusion module integrates seamlessly with the original diffusion model. We freeze the weights of the denoising network within the original pretrained diffusion model, updating only the introduced low-rank alternatives to ensure parameter and memory efficiency. Additionally, we propose Latent Flow Loss and Bidirectional Inter-Frame Attention (BIFA) to improve temporal consistency in synthesized videos. The Latent Flow Loss leverages global temporal information from the input video during training, while BIFA utilizes local temporal information from adjacent frames during inference. These enhancements do not incur additional memory or parameter costs during fine-tuning. Comparisons with state-of-the-art approaches demonstrate that Truncate Diffusion provides superior text alignment, video quality, and inter-frame temporal consistency in video editing. Importantly, Truncate Diffusion requires fine-tuning only 3.2% of the parameters and uses just 62% of the memory compared to the baseline model.

Abstract:
The goal of the fusion process in RGB-D reconstruction systems is to verify and update the 3D model while ensuring both completeness and accuracy. However, achieving precise dense correspondences in a point-to-point or pixel model during this process is challenging and computationally intensive. To address this challenge, we propose a Manifold Embedding framework that facilitates rapid point-to-surface fusion, removing the need for direct point-to-point or pixel correspondences. Our approach consists of three main steps: 1) Manifold Voxel: We transform discrete point sets into smooth surfaces using the Implicit Moving Least Squares (IMLS) method; 2) Two-Step Filtering: We enhance reconstruction accuracy through a two-step filtering technique that evaluates sampling points based on probabilistic measures; 3) Embedding for Smooth Surface: Lastly, we embed points into a smooth manifold surface represented via IMLS, ensuring high-quality reconstructed surfaces. Extensive experiments on both real and synthetic 3D scenes demonstrate the effectiveness of our Manifold Embedding framework. For instance, on the public Replica dataset, our method surpasses state-of-the-art fusion techniques regarding both completeness and accuracy. Our average accuracy is 2.11 cm and completeness is 2.80 cm, while NICE-SLAM achieves 2.85 cm and 3.00 cm, respectively (with lower values indicating better performance). Overall, our proposed method provides superior reconstruction quality and enhanced computational efficiency (See Fig. 1).

Abstract:
Image mosaic is a prevalent technique to conceal critical content in images. However, conventional mosaic techniques cannot be recovered using a small-sized key, as they require retransmission of the original images for perfect recovery. In this work, we propose a novel, computationally efficient, and effective recoverable image-mosaic technique. A key advantage of our proposed image-mosaic scheme is its robust performance across a range of adjustable key lengths. Our technique effectively conceals original information even with a small-sized key of only a few bits. To evaluate its performance, we introduce a new image-similarity metric based on the magnitude of the discrete cosine transform (DCT). This metric exhibits several advantageous mathematical properties, including the ability to quantify the perceptibility of major content in mosaicked images, invariance under image reflections and 180-degree rotations, and insensitivity to small translations. Finally, numerical experiments demonstrate that our method outperforms existing recoverable image-mosaic techniques and performs consistent across varying key lengths. We also compare the run-times required by our proposed new scheme with those required by other existing recoverable image-mosaic methods and the state-of-the-art image-encryption methods to exhibit the computational efficiency of our proposed new scheme.

Abstract:
Visual Question Answering (VQA) is a prevalent task that can facilitate the perception of the real world by the visually impaired. However, many VQA models tend to rely on superficial correlations in datasets for predictions rather than genuine reasoning, limiting their real-world applicability. While existing methods address this issue by incorporating debiasing strategies during training, they typically assume prior knowledge of out-of-distribution (OOD) test sets and then tailor debiasing strategies and select optimal models on the basis of the OOD samples. This reliance on OOD test data, however, is unrealistic in practical applications. To address this, some works introduce test-time adaptation techniques to mitigate dataset shifts during model deployment. Despite their potential, these methods risk catastrophic forgetting as they update models at test time without access to the ground-truth answers or the source data. An emerging solution involves leveraging the extensive knowledge embedded in Large Language Models (LLMs) to support reasoning tasks, yet their language-only input restricts flexibility in multimodal tasks. To bridge this gap, we propose leveraging the zero-shot capability of Multimodal Large Language Models (MLLMs). To optimise computational efficiency, we introduce a novel VQA Collaborative Inference framework (VQA-CI) that integrates MLLMs (e.g., BLIP-2 Flan T5) with VQA specialists (e.g., UpDn). This framework initially processes samples through VQA specialists and subsequently determines the necessity for re-evaluation with MLLMs based on predefined bias and reliability indicators. Experiments on the GQA-OOD and VQA-CP v2 datasets show that our VQA-CI achieves significant performance gains, with accuracy improvements of around 6% over state-of-the-art methods, underscoring the effectiveness of our VQA-CI.

Abstract:
Multi-person motion prediction is an emerging and intricate task with broad real-world applications. Unlike single person motion prediction, it considers not just the skeleton structures or human trajectories but also the interactions between others. Previous methods achieve impressive predictions using various networks but often overlook the distinct representations of joint relations within individuals (intra-relations) and interactions among groups (inter-relations), inevitably leading to undesired dependencies. To address this issue, we introduce a new collaborative framework for multi-person motion prediction that explicitly modeling these relations: a GCN-based network for intra-relations and a novel reasoning network for inter-relations. Specifically, we propose a distance-aware cross-attention that incorporates physical distance constraints into inter-relation learning through a learnable distance weighting coefficient. Moreover, we propose a novel plug-and-play aggregation module called the Interaction Aggregation Module (IAM), which employs an aggregate-attention mechanism to seamlessly integrate these relations. Experiments indicate that the module can also be applied to other dual-path models. Extensive experiments on the 3DPW, 3DPW-RC, CMU-Mocap, MuPoTS-3D, as well as synthesized datasets Mix1 & Mix2 (9～15 persons), demonstrate that our method achieves state-of-the-art performance.

Abstract:
Long-term action quality assessment poses a challenging visual task since it requires assessing technical actions at different skill levels in a long video. Recent state-of-the-art methods incorporate additional modality information to aid in understanding action semantics, which incurs extra annotation costs and imposes higher constraints on action scenes and datasets. To address this issue, we propose a Quality-Guided Vision-Language Learning (QGVL) method to map visual features into appropriate fine-grained intervals of quality scores. Specifically, we use a set of quality-related textual prompts as quality prototypes to guide the discrimination and aggregation of specific visual actions. To avoid fuzzy rule mapping, we further propose a progressive semantic learning strategy with a Granularity-Adaptive Semantic Learning Module (GSLM) that refines accurate score intervals from coarse to fine at clip, grade, and score levels. The quality-related semantics we designed are universal to all types of action scenarios without any additional annotations. Extensive experiments show that our approach outperforms previous work by a significant margin and establishes new state-of-the-art on four public AQA benchmarks: Rhythmic Gymnastics, Fis-V, FS1000, and FineFS.

Abstract:
Predicting where people look is a crucial for understanding human intentions. Gaze prediction, as a research hotspot, has evolved from predicting the gaze of a single person to simultaneously predicting the positions of all individuals and their corresponding gaze targets. However, the study of the correlation between humans and gaze as two interdependent tasks has largely been neglected. In this paper, inspired by the concept of “mutualistic symbiosis” in ecology, we propose a novel multitask mutualistic transformer (MMTR). MMTR captures paired dependencies by establishing information communication between different branches, thereby enabling comprehensive and interpretable gaze analysis for all individuals and gaze targets. Specifically, we first utilize a transformer encoder to capture the common features of all the tasks. Then, we design a mutualistic attention mechanism (MAM) in the dual-branch Transformer decoder to establish cross-task information interaction. The MAM can learn privileged information from other tasks that is helpful for the current task, thereby guiding the current branch to learn the most valuable and distinctive features. To the best of our knowledge, this is the first time that privileged information has been introduced into the gaze estimation task. Furthermore, to more flexibly learn pixel locality and long-range semantic dependencies for different tasks, we construct and embed a learnable global-local position encoding (GLPE) in different branches of MMTR. Experiments demonstrate that our proposed MMTR can guide the two branches to communicate through privileged information, effectively solve the information asymmetry problem between human detection and gaze prediction, and significantly outperform state-of-the-art gaze prediction methods on two standard benchmark datasets GazeFollowing and VideoAttentionTarget.

Affiliations: College of Mathematics, Sichuan University, Chengdu, China; National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University-Anker Embodied AI Lab, Peking University, Beijing, China; Department of Computer Science, University of Washington, Seattle, WA, USA; School of Computing and Information Technology, Great Bay University, Dongguan, China; Terminus Group, Beijing, China; NITFID, School of Statistics and Data Science, Nankai University, Tianjin, China; College of Computer Science, Sichuan University, Chengdu, China

Abstract:
This paper studies the problem of graph zero-shot learning, which aims at recognizing novel classes of nodes on the graph that are never seen during training. The key to graph zero-shot learning is establishing the mathematical relationship to transfer the prior knowledge of nodes from seen classes to unseen classes. However, the problem is largely under-explored and existing methods typically focus on acquiring supervision signals from seen classes or simply establishing connections between classes based solely on a semantic description matrix, such that the learned representations lack generalizable properties to unseen classes. To address this issue, this paper proposes GraphGCR that learns generalizable contrastive representations from the perspective of uniformity and alignment. Technically, GraphGCR leverages graph diffusion to extend supervised contrastive learning, encouraging the representations of semantics from different classes to be distributed uniformly and meanwhile achieve the alignment of node features and class semantics with the assistance of graph structural information. Moreover, to effectively enhance model generalizability, we further develop a class generator to synthesize features of unseen classes by embedding propagation and interpolation, thereby enriching the diversity of classes. Theoretical analysis also shows that our proposed framework exhibits strong discriminative property, which significantly enhances graph zero-shot learning. Experimental findings reveal that our GraphGCR achieves significant performance improvements over state-of-the-art methods across various benchmark datasets.

Abstract:
Open vocabulary 3D instance segmentation aims to align 3D instance segmentation results with natural language text, thereby achieving semantic prediction without relying on predefined class labels for specific scenes, which has been widely used in the field of multimedia. Current open vocabulary 3D instance segmentation methods mainly rely on 2D masks provided by various 2D segmentation foundation models. However, in complex scenes, the calculation of 2D masks often struggles to balance over-segmentation of large objects and under-segmentation of small objects. In this paper, we introduce OV-BIS, a novel zero-shot open vocabulary 3D instance segmentation method that leverages instance boundary information to improve 3D semantic segmentation performance. The key insight of our method is that the edge map as 3D boundary projection is suitable for multi-scale tasks and capable of compensating for the weakness of 2D masks in multi-scale adaptability for complex scenes. Our method aggregates multiview edge maps and 2D masks, iteratively guiding the merging of over-segmented point clouds with regions growing to cluster 3D primitives into distinct 3D instances. By projecting 3D instances onto images and using CLIP to calculate semantic features from multiple perspectives with an outliers filter, 3D semantic instance segmentation has been achieved. Experiments on multiple datasets demonstrate the superiority of our method.

Affiliations: School of Information Science and Engineering, Shandong Normal University, Jinan, China; School of Computer Science and Technology, Tongji University, Shanghai, China; School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China; Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo, NSW, Australia; Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia

Abstract:
Conventional domain adaptation (DA) for person re-identification (ReID) aims to bridge the domain gap but often requires direct use of fully labeled source and target domains, raising significant data privacy concerns due to the inclusion of personal identity information (PII) in raw data. Source-free domain adaptation (SFDA) for person ReID effectively preserves PII within the authorized source model. Nevertheless, these methods are vulnerable to data privacy (e.g., portrait rights) of the target domain during retrieval, where attackers can exploit pedestrian images for malicious generation, leading to damage to an individual’s reputation. Beyond these limitations, we propose a novel framework called SecureDA to address privacy-preserving SFDA for person ReID, which can generate a privacy key to defend against potential attacks on PII. Technically, we introduce domain-specific adversarial attacks into DA, where the protected query and gallery images are encrypted to ensure secure image retrieval. Furthermore, we employ two simultaneous processes: 1) The global–local adversarial pathway (GLAP) leverages encrypted and original images as adversarial pairs, thereby fostering the development of robust ReID models; 2) The global–local collaborative pathway (GLCP) is mastered through positive pairs collected from the same domain, effectively mitigating the pernicious catastrophic forgetting phenomenon. Extensive experiments show that SecureDA achieves state-of-the-art performance on multiple DA benchmarks and even outperforms the conventional DA and SFDA methods, which inherently compromise data privacy.

Abstract:
Video frame interpolation technology improves visual experience with the development of deep learning. However, capturing large motions while synthesizing fine texture details remains a challenging task. Regarding large motion scenarios, some pioneering Transformer-based methods primarily rely on local attention, which does not fully leverage the global receptive field advantage. To address this issue, this paper proposes to further broaden the receptive field of the Transformer to capture more correlations in the video frame interpolation task. Specifically, we propose a global self-attention mechanism in the form of spatial-temporal separation. Regarding texture details, since roughly enlarging the receptive field results in the loss of details, we propose to use large motion information in both feature and pixel spaces as a dual-guided prior to enhance detail synthesis. The separable attention mechanism and the straightforward frame synthesis design significantly enhance the resource efficiency of our model. Extensive experiments show that our method achieves state-of-the-art performance, effectively capturing large motions and preserving texture details.

Abstract:
Graph Contrastive Learning (GCL) has emerged as the foremost approach for self-supervised learning on graph-structured data. GCL reduces reliance on labeled data by learning robust representations from various augmented views. However, existing GCL methods typically depend on consistent stochastic augmentations, which overlook their impact on the intrinsic structure of the spectral domain, thereby limiting the model’s ability to generalize effectively. To address these limitations, we propose a novel paradigm called AS-GCL that incorporates asymmetric spectral augmentation for graph contrastive learning. A typical GCL framework consists of three key components: graph data augmentation, view encoding, and contrastive loss. Our method introduces significant enhancements to each of these components. Specifically, for data augmentation, we apply spectral-based augmentation to minimize spectral variations, strengthen structural invariance, and reduce noise. With respect to encoding, we employ parameter-sharing encoders with distinct diffusion operators to generate diverse, noise-resistant graph views. For contrastive loss, we introduce an upper-bound loss function that promotes generalization by maintaining a balanced distribution of intra- and inter-class distance. To our knowledge, we are the first to encode augmentation views of the spectral domain using asymmetric encoders. Extensive experiments on eight benchmark datasets across various node-level tasks demonstrate the advantages of the proposed method.

Abstract:
As a video understanding task, activity parsing aims at encompassing actions into multiple levels of activity components, including activity, sub-activity and atomic action, enabling understanding of complex video scenes within multimedia systems. Existing methods form activity parsing as a multi-task learning problem to predict multi-granular activity labels simultaneously, which ignores modeling the hierarchical structure and the fine-grained transitions of activity components at different levels. In this paper, we propose a Hierarchical Adaptive Reasoning Graph (HARG) to model the hierarchical structure (i.e., object level \rightarrow atomic action level \rightarrow activity level) dynamically and precisely. To achieve that, an object reasoning graph (ORG) and an atomic action reasoning graph (ARG) are designed to reason fine-grained information transitions between multiple actors at different levels. In addition, an adaptive segmentation module (ASM) is investigated for bridging the gap among different levels, permitting step-by-step reasoning from the object level to the atomic action level. Experimental results show our method outperforms state-of-the-art methods on two activity parsing datasets, achieving hierarchical modeling and fine-grained reasoning for activity understanding.

Abstract:
Currently, fine-grained skeleton action recognition based on graph convolutional networks (GCNs) has become an important research focus. Fine-grained action recognition refers to the accurate recognition of subtle, complex or detailed actions. This task is particularly challenging due to the limited appearance information in skeleton data and the limitations of predefined single-topology skeleton structures. To address these challenges, we propose an action-responsive contrastive network (ARCN). The network consists of two main components: an action-responsive graph convolutional network (ARGCN) with enhanced skeleton topology and a fine-grained action comparator (FAC) that uses feature contrastive learning to explore the latent space of motion features. The ARGCN contains two specialized modules: the action-responsive topology (ART) module, which captures important motion features through the learned action-specific topology structure matrix and multiscale temporal features; and the action-responsive attention (ARA) module, which learns complex spatiotemporal skeleton attention information. These modules jointly generate a multichannel cross-temporal dynamic skeleton joint attention topology map tailored for the specific action being analysed. To further clarify the fine-grained action feature differences, the FAC is integrated in some stages of the ARGCN. The FAC performs spatiotemporal decoupling of feature maps, classifies and contrasts similar and different fine-grained motion features, and builds a learnable latent space for fine-grained motion, thereby improving classification performance. Our model is evaluated on six public datasets: NTU RGB+D, NTU RGB+D 120, NW-UCLA, UAV-Human, Finegym, and Diving48. It achieves 91.2% accuracy on the NTU RGB+D 120 dataset X-Set, 97.2% accuracy on the NW-UCLA dataset, 44.6% accuracy on the UAV-Human dataset CSv1, 72.0% accuracy on the UAV-Human dataset CSv2, 95.3% accuracy on the Finegym dataset, and 54.3% accuracy on the Diving48 dataset, which are competitive results compared with the state-of-the-art methods.

Abstract:
Transformer displays the impressive capabilities on vision tasks. The built-in self-attention retains the quadratic computation burden in respect of the spatial resolution of image features. The traditional downsampling (e.g., average pooling) can reduce the resolution. Nonetheless, it may suffer from the dropping of detailed information. In this work, we propose an Efficient Wavelet Attention (EWA), which injects the wavelet transform and a Mean GELU (MGELU) function. Firstly, the wavelet transform enables the detailed information to participate in the efficient interaction modeling. Secondly, MGELU regards the statistical mean as reference and loosely passes the high relative responses. Building upon EWA, we present an effective Semantic-aware Wavelet Transformer (SWFormer), which is then employed for pyramid learning, including CNN feature hierarchy or Region of Interest (RoI) features. For the feature hierarchy, a Pyramid SWFormer (PSWFormer) incorporates SWFormer at each level to fit the bidirectional features. For RoIs, a Recognition-Localization SWFormer (RLSWFormer) is inserted into the head to fit their features from all levels. The effectiveness of our SWFormer is displayed experimentally on the MS COCO detection dataset and the Pascal VOC dataset. When exploiting Swin-small backbone, our SWFormer-based method acquires AP of 52.1 in the single-scale evaluation on the COCO test-dev set.

Abstract:
How to effectively explore inter-frame information is critical for video denoising. Existing methods often rely on complex architectures, such as optical flow estimation and cross-frame self-attention, which introduce high computational costs and limit their practicality in real-world scenarios. To address this limitation, we propose a simple yet efficient deep Frequency-Separable Temporal Network (FSTN) for video denoising. FSTN utilizes the multi-scale analysis capability of wavelet transform to extract high-frequency and low-frequency information at the feature level, enabling faster processing while maintaining high-quality reconstruction. To further reduce computational complexity and enhance detail preservation, we develop a learnable high-frequency processing module that adaptively filters noise and recovers edge details. Additionally, to effectively utilize information from long-range frames, we propose a low-frequency propagation method equipped with a temporal feature alignment module. This method enables the efficient transfer of structural information from distant frames, ensuring temporal consistency and enhancing denoising performance. Extensive experiments demonstrate that our method has 1.28× fewer network parameters than state-of-the-art efficient video denoising methods, such as BasicVSR++, and requires less computational cost while achieving comparable performance.

Abstract:
Contemporary display enables video content rendering with high dynamic range (HDR) and wide color gamut (WCG). However, the majority of existing content remains in standard dynamic range (SDR) format. Therefore, the conversion of SDR content to HDRTV standards holds significant value. This paper delineates and analyzes the SDRTV-to-HDRTV conversion by modeling the formation of SDRTV/HDRTV content. The findings reveal that a naive end-to-end supervised training pipeline suffers from severe gamut transition errors. To address this, we propose a new three-step solution called HDRTVNet++, which includes adaptive global color mapping, local enhancement, and highlight refinement. The adaptive global color mapping step utilizes global statistics for image-adaptive color adjustments, followed by a local enhancement network for detail improvement. These two components are integrated as a generator, with GAN-based joint training ensuring highlight consistency. Our method, tailored for ultra-high-definition TV content, offers both effectiveness and computational efficiency in processing 4 K resolution images. We also construct HDRTV1K, a dataset comprising HDR videos adhering to the HDR10 standard, featuring 1235 training and 117 testing images at 4 K resolution. Furthermore, we employ five metrics to assess SDRTV-to-HDRTV performance. Our results demonstrate state-of-the-art performance both quantitatively and visually.

Abstract:
Video anomaly detection (VAD) is an important intelligent system application, but most current research views it as a coarse binary classification task that lacks a fine-grained understanding of abnormal video sequences. We explore a new task for video anomaly analysis called Comprehensive Video Anomaly Caption (CVAC), which aims to generate comprehensive textual captions (containing scene information such as time, location, anomalous subject, anomalous behavior, etc.) for surveillance videos. CVAC is more consistent with human understanding than VAD, but it has not been well explored. We constructed a large-scale benchmark CVACBench to lead this research. For each video clip, we provide 6 fine-grained annotations, including scene information and abnormal keywords. A new evaluation metric Abnormal-F1 (A-F1) is also proposed to more accurately evaluate the caption generation performance of the model. We also designed a method called Anomaly-Led Generating Prompting Transformer (AGPFormer) as a baseline. In AGPFormer, we introduce an anomaly-led language modeling mechanism (Anomaly-Led MLM, AMLM) to focus on anomalous events in videos. To achieve more efficient cross-modal semantic understanding, we design the Interactive Generating Prompting (IGP) module and Scene Alignment Prompting (SAP) module to explore the divide between video and text modalities from multiple perspectives, and to improve the model’s performance in understanding and reasoning about the complex semantics of videos. We conducted experiments on CVACBench by using traditional caption metrics and the proposed metrics, and the experimental results demonstrate the effectiveness of AGPFormer in the field of anomaly caption.

Abstract:
Screenshot, which is a common tool in office work, has become a significant threat to organizations like companies and research institutions. Malicious users can easily leak sensitive information like business secrets and research data by taking a screenshot and spreading onto the Internet. While existing watermarking schemes serve as useful tools for leakage tracing, they fall short in the scenario of arbitrary screenshot. Most current methods are file-targeted, focusing on embedding watermark for a single file of one type at a time, making it hard to handle arbitrary content on screen. To address the issues above and better satisfy the need of the scenario, we propose ScreenGuard, a novel watermarking scheme targeted for the screen itself to protect arbitrary screen content shown on it. Unlike previous watermarking schemes, ScreenGuard does not modify the content itself. Instead, we generate a transparent mask template based on the watermark, tile it to the size of the screen to form a complete transparent mask, and overlay this mask onto the screen. This ensures that any screenshots taken will contain our watermark. We then train a locator and a decoder to extract watermarks from suspected leaked screenshots to trace leaks to their source. We summarized five properties that needs to be satisfied in the scenario of arbitrary screenshot (Generalizable, Unseeable, Adaptable, Robust, Dynamic) and evaluate our method on these criteria. Extensive experiments demonstrate that ScreenGuard meets these five properties effectively, showcasing its superiority and broad practical applications.

Abstract:
With the development of deep learning, salient object detection (SOD) has made significant progress. However, this advancement is often constrained by the requirement for extensive training data and expensive manual annotation. To eliminate the laborious cost of dataset collection and pixel-level annotation, in this work, we employ Stable Diffusion to synthesize data and subsequently automate annotation for the SOD task. Firstly, we design a unified prompt and ChatGPT4 driven diverse prompts, which guide generating images with simple and complex scenes using Stable Diffusion. Secondly, the reliable pseudo-labels of these synthetic images are generated. For simple images, we propose the simple pseudo-label generation (SPLG) strategy which combines SAM segmentation and CLIP classifier, then train the initial SOD model. For complex images, we utilize the inference capability of the initial SOD model to generate pseudo-labels using the complex pseudo-label generation (CPLG) strategy, and employ iterative training to dynamically update the pseudo-labels. Finally, we design a simple yet effective SOD model which combines a feature fusion module (FFM) and an edge enhancement module (EEM), the former is employed to extract saliency via fusing high-level features, and the latter extracts spatial positional information from low-level features to enhance the edges of saliency results. Experiments on five benchmarks show that our method outperforms the unannotated methods, and also demonstrates better or comparable performance than weak annotation based methods.

Abstract:
To further promote the development of multimodal point cloud completion, we contribute a large-scale multimodal point cloud completion benchmark ModelNet-MPC with richer shape categories and more diverse test data, which contains nearly 400,000 pairs of high-quality point clouds and rendered images of 40 categories. Besides the fully supervised point cloud completion task, two additional tasks including denoising completion and zero-shot learning completion are proposed in ModelNet-MPC, to simulate real-world scenarios and verify the robustness to noise and the transfer ability across categories of current methods. Meanwhile, considering that existing multimodal completion pipelines usually adopt a unidirectional fusion mechanism and ignore the shape prior contained in the image modality, we propose a Dual-Modality Feature Interaction Network (DuInNet) in this paper. DuInNet iteratively interacts features between point clouds and images to learn both geometric and texture characteristics of shapes with the dual feature interactor. To adapt to specific tasks such as fully supervised, denoising, and zero-shot learning point cloud completions, an adaptive point generator is proposed to generate complete point clouds in blocks with different weights for these two modalities. Extensive experiments on the ShapeNet-ViPC and ModelNet-MPC benchmarks demonstrate that DuInNet exhibits superiority, robustness and transfer ability in all completion tasks over state-of-the-art methods. The code and dataset will be available at https://github.com/xinpuliu/DuInNet.

Abstract:
Long-Tailed Recognition (LTR) poses significant challenges due to the heavily imbalanced nature of real-world data, which severely skews data-driven deep neural networks. Despite the rapid progress of Vision-Language Models (VLMs), they still face challenges in effectively learning from long-tailed visual data. In this paper, we present a comprehensive analysis of the reasons behind the underperformance of VLMs and propose a hierarchical inference framework to address this issue. Specifically, we prompt the large language models to generate sentence-level descriptors for class labels and conduct the open vocabulary classification by computing the average similarity between the image and each descriptor. A reweighting mechanism is further proposed to filter out uninformative descriptors. To mitigate model bias incurred by the long-tail distribution, we propose a feature adapter with the logit adjustment technique and fine-tune the CLIP model via visual prompt tokens. We introduce the Shared Feature space Mixup (SFM) to enhance the interaction between modalities to address tail visual feature insufficiency. Finally, we propose a hierarchical inference manner to combine the aforementioned proposals. Extensive evaluations demonstrate that our approach achieves state-of-the-art performance by fine-tuning only a few parameters on the Places-LT, ImageNet-LT, and iNaturalist 2018 benchmarks.

Abstract:
The training phase of deep neural networks requires substantial resources and as such is often performed on cloud servers. However, this raises privacy concerns when the training dataset contains sensitive content, e.g., facial or medical images. In this work, we propose a method to perform the training phase of a deep learning model on both an edge device and a cloud server that prevents sensitive content being transmitted to the cloud while retaining the desired information. The proposed privacy-preserving method uses adversarial early exits to suppress the sensitive content at the edge and transmits the task-relevant information to the cloud. This approach incorporates noise addition during the training phase to provide a differential privacy guarantee. We extensively test our method on different facial and medical datasets with diverse attributes using various deep learning architectures, showcasing its outstanding performance. We also demonstrate the effectiveness of privacy preservation through successful defenses against different white-box, deep and GAN-based reconstruction attacks. This approach is designed for resource-constrained edge devices, ensuring minimal memory usage and computational overhead.

Abstract:
Utilizing generative adversarial networks (GANs) for oversampling imbalanced data has demonstrated its effectiveness. However, many GAN-based oversampling methods are confronted with a significant challenge, namely, mode collapse, especially when dealing with tabular imbalanced data. In this paper, two unique penalty terms are respectively incorporated into the loss functions of the discriminator and the generator of GAN to promote the generated samples to exhibit not just statistical but also spatial information consistency with the minority samples, thereby alleviating the issue of mode collapse. In contrast to other studies that fix the coefficient of the penalty terms, the optimal coefficients of the penalty terms are adaptively searched using a meta-learning approach, where Bayesian optimization is firstly employed to effectively handle situations involving small size of minority samples in the imbalanced data. We call the proposed model as META_GAN. Experimental results demonstrate that META_GAN outperforms alternative oversampling methods on general tabular and image imbalanced datasets and long-tailed datasets in terms of different metrics.

Abstract:
Long Document Classification (LDC) has gained significant attention recently. However, multi-modal data in long documents such as texts and images are not being effectively utilized. Prior studies in this area have attempted to integrate texts and images in document-related tasks, but they have primarily focused on short text sequences and images of pages. How to classify long documents with hierarchical structure texts and embedding images is a new problem and faces multi-modal representation difficulties. In this paper, we propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. The HMT conducts multi-modal feature interaction and fusion between images and texts in a hierarchical manner. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features. Furthermore, we introduce a new interaction strategy called the dynamic mask transfer module to integrate these two transformers by propagating features between them. To validate our approach, we conduct cross-modal LDC experiments on two newly created and two publicly available multi-modal long document datasets, and the results show that the proposed HMT outperforms state-of-the-art single-modality and multi-modality methods.

Abstract:
Data uncertainty refers to the degree of uncertainty in model predictions caused by data variability, challenging the model robustness in multimedia applications. The increased variability of uncontrolled face images, such as mask occlusion and image blur, increases the intra-class differences and inter-class similarities. As a result, the ambiguity in the learned features exacerbates the uncertainty of sample-to-class membership. Traditional face recognition models are deterministic point embedding models that fail to measure data uncertainty. Probabilistic embedding models, such as advanced Data Uncertainty Learning (DUL), represent each face image as a Gaussian distribution to measure data uncertainty. However, these models perform random sampling from the distribution once per training, and the sampled points may fall into different class regions, leading to training oscillation. Therefore, we propose a robust Region Uncertainty Learning (RUL) method, which adopts the entire Gaussian distribution of each sample during each training epoch, and estimates the region relations between the sample distribution region and the class region to measure the sample-to-class membership. In fact, DUL is a special case of the proposed RUL. Specifically, DUL estimates point-with-region relations and only represents absolute membership and non-membership. In contrast, RUL estimates region-with-region relations, enabling it to additionally represent incomplete membership. This more comprehensive membership measurement fully represents the uncertainty of membership, enhancing the model performance and robustness in uncontrolled scenes. Furthermore, for robust face recognition, we propose two RUL-based angular margin losses, AngleFace and RegionFace, to adaptively adjust the learning weights according to the uncertainty of membership. Finally, we comprehensively evaluate the effectiveness of RUL on various face datasets, and profoundly analyze the role of region relations. In future, we will explore the applicability of RUL in other tasks.

Abstract:
Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy label learning, which can extract more high-quality samples with clean labels to improve the robustness of network training. Firstly, a novel Parallel Sample Division (PSD) module is designed to generate a certain training set with sufficient reliable positive and negative samples by jointly considering the sample structure in feature space and the human prior in loss space. Secondly, a novel Meta Sample Purification (MSP) module is further designed to mine adequate semi-hard samples from the remaining uncertain training set by learning a strong meta classifier with extra golden data. As a result, more and more high-quality samples will be distilled from the noisy training set to train networks robustly in every iteration. Extensive experiments on four benchmark datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet and Clothing-1 M, show that our method has achieved state-of-the-art results over its competitors.

Abstract:
Unsupervised anomaly detection generally aims to identify irregularities using only normal samples, where feature reconstruction-based methods demonstrate greater robustness to noise by comparing reconstructed results with original data. However, they encounter issues with detailed information loss and insufficient anomaly discriminability. To address these challenges, we propose a progressive-difference-aware feature reconstruction network for image anomaly detection, named PADNet. To enhance context interaction, we develop a harmonic symmetric reconstruction framework integrated with a progressive feature harmonizer (PFH). The PFH mitigates detailed information loss to reduce undesired reconstruction errors through the progressive fusion of information flows. To enhance anomaly discriminability, we introduce the neighbor-aided residual feature representation module (NRFR) to strengthen difference-aware feature representations. The NRFR innovatively captures discriminative cues by interacting with neighboring reference samples in the feature cache pool. Experimental results on the MVTec, Visa, and BTAD datasets demonstrate that our method achieves superior performance while requiring only 25.3% of the parameters compared to the state-of-the-art baseline.

Abstract:
With coordinates as the input and RGB pixel values as the output, a neural network can be used to represent an image, which is widely known as Implicit neural representations (INRs). Previous works on INR have mainly focused on learning an invariant image target without exploring the impact of learning strategies on learning INR. It is observed that there is a substantial variation in PSNR among different images, and our preliminary investigation shows that, in the early training stage, learning complex image content yields significantly better performance than simple image content. Inspired by this finding, we conjecture that increasing INR task complexity in the early stage of training might boost INR performance and thus propose to intentionally contaminate the target image with another complex image. Our proposed method is called Mix-INR, which adopts a two-stage training to first learn a pseudo-target image (contaminated target) and then learn the real-target image (uncontaminated target). To generate the pseudo-target image, we experiment with two contamination methods (blending and replacement), both of which show superior performance and verify our conjecture. INRs have gained popularity as a promising approach for representing a variety of data types, including images of the task complexity of the pseudo-target image, we set the contamination image from a complex natural image to a random-noise image. Moreover, we propose a dynamic contamination method to smoothly transition from the pseudo-target image to the real-target image. Experimental results demonstrate that our proposed method achieves competitive performance, which suggests that INR can be improved by manipulating the task complexity in the early stage of training.

Abstract:
The nonlocal low-rank (NLR) optimization has shown promise for generalized multispectral filter array (MSFA) demosaicing. However, it faces challenges in balancing efficiency and accuracy. To tackle these challenges, we report here the multi-channel global low-rank optimization technique, achieving efficient high-fidelity MSFA demosaicing. Inspired by the cross-band correlations of natural multispectral images, we introduce the multi-channel matching and low-rank strategies that jointly optimize image patches of all channels, exhibiting higher efficiency and accuracy than existing approaches. Furthermore, we present global structural matching (GSM) which performs structure-aware multi-channel matching across the entire multispectral image. GSM extracts structurally important patches and efficiently searches their similar patches via parallel correlation, providing an order-of-magnitude improvement in efficiency. By combining the aforementioned techniques, we have achieved superior performance over the state-of-the-art NLR demosaicing technique, leading to up to 3.9 dB peak signal-to-noise ratio (PSNR) gain and over a 150-fold increase in computational speed. Experiments validated that the technique outperforms existing methods in reconstructing fine textures and details and exhibits superior robustness to noise.

Abstract:
Point cloud oversegmentation method can obtain a series of superpoints by grouping points that are semantically and geometrically consistent. The generated superpoints can be treated as the basic processing units in various downstream tasks to improve task performance and processing efficiency. However, due to the high semantic and geometric complexity of point cloud scenes, obtaining high-quality superpoints is still challenging. Aiming to generate high-quality indoor superpoints, we propose an end-to-end supervised contrastive learning framework SCL-OverSeg for indoor point cloud oversegmentation. Firstly, to solve the challenge of balancing the importance of geometric similarity and spatial proximity constraint between points and superpoints in indoor scenes, we integrate the geometric similarity and spatial proximity constraint into the supervision signal by generating the superpoint ground truth. To solve the challenge of superpoints crossing objects, we propose to utilize instance labels rather than semantic labels to generate the ideal superpoint ground truth as the object-level supervision signal. Secondly, to construct the distinguishable embedding space facilitating to the assignments of points to superpoints, we propose point-superpoint contrastive learning to compel the network to project each point to be closer to the reasonable superpoint in embedding space. Besides, with the instance labels, to improve the superpoint performance on object boundaries, we propose the object boundary contrastive learning to enhance the feature distinguishability between tough points across the object boundaries. Extensive experiments demonstrate that SCL-OverSeg can effectively improve indoor oversegmentation performance, especially on object boundaries.

Abstract:
The goal of underwater image enhancement (UIE) is to increase the quality of acquired underwater images, which significantly increases the value of these images. However, without effective underwater enhanced image quality assessment (UEIQA) measures that benchmark UIE, the UIE process becomes driftless, and the enhanced results produced by different UIE algorithms cannot be fairly compared. To this end, in this work, we construct a dedicated UEIQA scheme on the basis of a deep investigation of the characteristics of enhanced underwater images. Specifically, in our proposed method, we design deep neural networks to represent the unique attributes of enhanced underwater images, such as color casts, local distortions, degrees of naturalness, sharpness levels, contrast levels, and fog densities, which are highly correlated with image quality. Then, we introduce a vision transformer (ViT) to capture the dependencies among different image attributes and infer the quality level of the examined images. Extensive experiments conducted on three typical UEIQA databases, i.e., SOTA, UID2021 and SAUD, show that the proposed UEIQA model yields notably higher prediction accuracy than do the representative IQA and UEIQA metrics, e.g., achieving SRCC values of 0.891 (vs. 0.749) on SAUD and 0.933 (vs. 0.798) on UID2021.

Abstract:
Temporal Action Localization (TAL) aims to localize the start and end timestamps of actions with specific categories in untrimmed videos. Despite great success, noisy action boundary labels may be included due to the inherent subjectivity of manual annotations. This can lead TAL models to learn inaccurate action boundaries during training, potentially impairing their localization performance. To systematically analyze and enhance the TAL models’ robustness against noisy action boundary labels, we introduce a new task termed TAL with Noisy Label. We demonstrate that introducing even minimal random noise to action boundary labels in training data can substantially degrade the performance of leading TAL methods, thereby underscoring their vulnerability to noisy action boundary labels. To be specific, we propose a novel plug-and-play method called Energy-based Meta Boundary Refinement (EMBR), where a meta-learning pipeline is employed to rectify noisy action boundary labels, ameliorating the misguidance of noisy labels on model training. Under this meta-learning pipeline, EMBR utilizes an energy function to calculate the magnitude of label noise and re-weights samples, assigning lower weights to samples with higher noise, alleviating the impact of noisy samples on model training. In addition, considering the energy difference between action and background segments, an energy-based loss function is proposed to achieve larger energy differences across the boundary, assisting in the boundary refinement. Experimental results on the THUMOS14, ActivityNet1.3, and HACS datasets demonstrate the effectiveness of EMBR in enhancing the robustness of TAL models.

Abstract:
The role of Camouflaged Object Detection (COD) is to identify the objects that integrate seamlessly with the surrounding environment. Due to the high intrinsic similarity between the objects and their background, this task presents greater challenges than traditional object detection. Most existing COD methods often have a large number of parameters and high computational complexity in the pursuit of detection accuracy, which hinders the application of COD in practical scenarios. To address this issue, we propose a UNet-like Transformer Network for COD, termed UTNet, which achieves competitive detection accuracy with a smaller parameter set. Specifically, we propose a Camouflaged Region Awareness Module (CRAM) consisting of a Hierarchical Attention Mechanism (HAM) that groups features to reveal intrinsic consistency between sub-features. This CRAM can be embedded into the backbone network, giving it powerful modeling capabilities. And, we present a Contextual Knowledge Collector (CKC) that exploits a cross-aggregation approach for neighboring feature layers, promoting the flow of semantic information from high-level to low-level features, and ensuring the integrity of camouflaged objects at each level of features. Furthermore, we introduce a progressive decoder that utilizes a cascade of attention units to filter noise and explores knowledge aggregation to emphasize features from different levels, ensuring that camouflaged objects have complete spatial details at the local level. Extensive experimental results show that UTNet achieves competitive results compared to 20 state-of-the-art methods.

Abstract:
Recently, graph convolutional network-based dual-view multimodal recommendation methods have achieved great success. They extract multimodal and behavior features based on item-item and user-item graphs, respectively. However, they still have two-fold limitations. First, the relevance between multimodal semantics and user preferences is ignored, resulting in the propagation and coupling of preference-irrelevant noise. Second, the direct use of uneven factual user-item graphs is suboptimal, as both redundant noisy edges and missing positive interaction edges impair recommendations. To solve the above issues, we propose a DisentAngled deNoising and Counterfactual balancE method for multimodal recommendation, dubbed as DANCE. Specifically, for multimodal features, we explicitly disentangle them into preference-relevant and preference-irrelevant representations, to absorb and discard irrelevant noise via the latter. An orthogonal regularization and a contrastive learning task on preference relevance score prediction are proposed as the dual safeguard to prevent preference-relevant representations from encoding irrelevant noise. For behavior feature extraction, we construct a balanced user-item graph by integrating factual and counterfactual graphs. In this process, we pre-train a behavior simulator to build the counterfactual graph with full interactions. Top-K sampling is adopted to omit noisy edges and add missing edges in the graph. The final recommendation is performed upon the fused representation of preference-relevant multimodal and behavior representations. Extensive experiments on three public datasets verify the power of our DANCE.

Abstract:
Recent advances in large pretrained text-to-image generation models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image from a person seen for the first time. More specifically, we employ a face encoder with the identity prior to encode the input face, and then calibrate the face representation to align the distribution of a space with the editability prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best of our knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models. The code is available: https://github.com/qinghew/StableIdentity.

Abstract:
Change detection has essential significance for the region’s development, in which pseudo-changes between bitemporal images induced by imaging environmental factors are key challenges. Existing transformation-based methods regard pseudo-changes as a kind of style shift and alleviate it by transforming bitemporal images into the same style using generative adversarial networks (GANs). However, their efforts are limited by two drawbacks: 1) Transformed images suffer from distortion that reduces feature discrimination. 2) Alignment hampers the model from learning domain-agnostic representations that degrades performance on scenes with domain shifts from the training data. Therefore, oriented from pseudo-changes caused by style differences, we present a generalizable domain-agnostic difference learning network (DonaNet). For the drawback 1), we argue for local-level statistics as style proxies to assist against domain shifts. For the drawback 2), DonaNet learns domain-agnostic representations by removing domain-specific style of encoded features and highlighting the class characteristics of objects. In the removal, we propose a domain difference removal module to reduce feature variance while preserving discriminative properties and propose its enhanced version to provide possibilities for eliminating more style by decorrelating the correlation between features. In the highlighting, we propose a cross-temporal generalization learning strategy to imitate latent domain shifts, thus enabling the model to extract feature representations more robust to shifts actively. Extensive experiments conducted on three public datasets demonstrate that DonaNet outperforms existing state-of-the-art methods with a smaller model size and is more robust to domain shift.

Abstract:
Assessing image quality is crucial in image processing tasks such as compression, super-resolution, and denoising. While subjective assessments involving human evaluators provide the most accurate quality scores, they are impractical for large-scale or continuous evaluations due to their high cost and time requirements. Pairwise comparison subjective assessment tests, which rank image pairs instead of assigning scores, offer more reliability and accuracy but require numerous comparisons, leading to high costs. Although objective quality metrics are more efficient, they lack the precision of subjective tests, which are essential for benchmarking and training learning-based quality metrics. This paper proposes an uncertainty-based sampling method to optimize the pairwise comparison subjective assessment process. By utilizing deep learning models to estimate human preferences and identify pairs that need human labeling, the approach reduces the number of required comparisons while maintaining high accuracy. The key contributions include modeling uncertainty for accurate preference predictions and for pairwise sampling. The experimental results demonstrate superior performance of the proposed approach compared to traditional active sampling methods.

Abstract:
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries, identifying the same object across different frames through the same query, lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they handle the temporal feature of a video and build visual-language interaction sequentially, integrating textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are as focused as much possible on the object of interest. This dynamic evolution of the queries across video also enables the proxy queries to establish inter-frame dependencies, enhancing the accuracy and coherence of object tracking throughout the video sequence. To mitigate the high computational costs associated with full spatio-temporal interactions between video and proxy queries, we propose to decouple cross-modality interactions into their temporal and spatial dimensions, respectively. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks, i.e., Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences, clearly demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

Abstract:
Open world object detection (OWOD) aims to identify both known instances of trained classes and unknown ones. Despite recent advancements, existing methods exhibit a detection bias towards known classes, as detectors are exclusively trained under the supervision of known classes. To address this problem, we construct a causal graph to scrutinize OWOD from a causal perspective, revealing that the bias problem primarily arises due to the confounding effect of known classes, and the causality between unknown objects and their predictions learned by the detector is weak. Therefore, we propose a causality-inspired debiasing framework for OWOD, aiming to bolster the performance of OWOD models by eliminating confounders and encouraging appropriate features. Specifically, a semantic causal intervention module is proposed to remove the confounding effect from known classes to unknown features, which introduces the known semantics to interact fairly with all unknown features through backdoor adjustment. Moreover, an unknown causality enhancement module is employed to enhance the causality of unknown objects and their predictions acquired by the model, which imposes constraints for different unknown classes in feature space with the contrastive learning paradigm from the perspective of intervention effect. Extensive experiments conducted on the commonly-used OWOD benchmarks demonstrate that our framework consistently yields superior results on unknown classes compared with state-of-the-art methods by a large margin (+25.0% UD-Pre, +10.2% Recall on unknown classes) and even better on known classes (+1.4% mAP on known classes).

Abstract:
Underwater imaging is essential in a variety of fields, including resource exploration, marine observation, and scientific research. However, the quality of underwater images is often compromised by environmental factors such as light scattering, absorption, and the presence of fog, leading to distortions such as color shifts, low contrast, and blurriness. To address these challenges, we propose a novel underwater image quality assessment (UIQA) method, the Attention and Mamba-driven Quality Index (AMQI). The AMQI model employs a multi-stage architecture designed to capture both local and global image features critical for underwater quality evaluation. First, a Shallow Feature Extractor (SFE) captures essential spatial details. Next, the Local Information Representation Network (LIR-Net), equipped with Channel Attention (CA) and Large Kernel-guided Spatial (LKS) mechanisms, enhances fine details and captures long-range dependencies to address underwater-specific distortions. The Global Information Representation Network (GIR-Net) further processes the features using a combination of the Visual State-Space Model (VSSM) and ResNet-50 to capture high-level semantic and contextual information. Finally, the Feature-Quality Mapping Network (FQM) converts the learned features into a quality score, ensuring precise predictions of image quality. Extensive experiments on the Underwater Image Quality Database (UIQD) demonstrate that AMQI outperforms current state-of-the-art IQA and UIQA models in terms of accuracy and correlation with human subjective evaluations. The model’s robustness and generalization capabilities are further validated through detailed ablation studies and cross-database evaluations, showcasing its strong performance across diverse underwater environments.

Abstract:
Recently, the Contrastive Language Image Pre-training (CLIP) model has shown significant generalizability by optimizing the distance between visual and text features. The mainstream CLIP-based action recognition methods mitigate the low “zero-shot” generalization of the 1-of-N paradigm but also lead to a significant degradation in supervised performance. Therefore, powerful supervision and competitive “zero-shot” need to be effectively traded off. In this work, a Multimodal Independent Prompt CLIP (MIP-CLIP) model is proposed to address this challenge. On the visual side, we propose novel Video Motion Prompt (VMP) to empower the visual encoder with motion perception, which performs short- and long-term motion modelling via temporal difference operation. Next, the visual classification branch is introduced to improve the discrimination of visual features. Specifically, the temporal difference and visual classification operations of the 1-of-N paradigm are extended to CLIP to satisfy the need for strong supervised performance. On the text side, we design Class-Agnostic text prompt Template (CAT) under the constraint of Semantic Alignment (SA) module to solve the label semantic dependency problem. Finally, a Dual-branch Feature Reconstruction (DFR) module is proposed to complete cross-modal interactions for better feature matching, which uses the class confidence of the visual classification branch as input. The experiments are conducted on four widely used benchmarks (HMDB-51, UCF-101, Jester, and Kinetics-400). The results demonstrate that our method achieves excellent supervised performance while preserving competitive generalizability.

Abstract:
Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.

Abstract:
3D shape segmentation is a crucial task in the field of multimedia analysis and processing, and recent years have seen a surge in research on this topic. However, many existing methods only consider geometric features of 3D shapes and fail to explore the potential connections between faces, limiting their segmentation performance. In this paper, we propose a novel segmentation approach that mines and enhances the potential consistency of 3D shapes to overcome this limitation. The key idea is to mine the consistency between different partitions of 3D shapes and to use the unique consistency enhancement strategy to continuously optimize the consistency features for the network. Our method also includes a comprehensive set of network structures to mine and enhance consistent features, enabling more effective feature extraction and better utilization of contextual information around each face when processing complex shapes. We evaluate our approach on public benchmarks through extensive experiments and demonstrate its effectiveness in achieving higher accuracy than existing methods.

Abstract:
Oriented object detection typically adds an additional rotation angle to the regressed horizontal bounding box (HBB) for representing the oriented bounding box (OBB). However, existing oriented object detectors based on regression angles face inconsistency between metric and loss, boundary discontinuity or square-like problems. To solve the above problems, we propose an anchor-free oriented object detector named PRA-Det, which assigns the center region of the object to regress OBBs represented by the polar radius vectors. Specifically, the proposed PRA-Det introduces a diamond-shaped positive region of category-wise attention factor to assign positive sample points to regress polar radius vectors. PRA-Det regresses the polar radius vector of the edges from the assigned sample points as the regression target and suppresses the predicted low-quality polar radius vectors through the category-wise attention factor. The OBBs defined for different protocols are uniformly encoded by the polar radius encoding module into regression targets represented by polar radius vectors. Therefore, the regression target represented by the polar radius vector does not have angle parameters during training, thus solving the angle-sensitive boundary discontinuity and square-like problems. To optimize the predicted polar radius vector, we design a spatial geometry loss to improve the detection accuracy. Furthermore, in the inference stage, the center offset score of the polar radius vector is combined with the classification score as the confidence to alleviate the inconsistency between classification and regression. The extensive experiments on public benchmarks demonstrate that the PRA-Det is highly competitive with state-of-the-art oriented object detectors and outperforms other comparison methods.

Abstract:
Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the WACV 2023 UPAR Challenge.

Abstract:
Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.

Abstract:
Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.

Abstract:
Deep hashing algorithms have demonstrated considerable success in recent years, particularly in cross-modal retrieval tasks. Although hash-based cross-modal retrieval methods have demonstrated considerable efficacy, the vulnerability of deep networks to adversarial examples represents a significant challenge for the hash retrieval. In the absence of target semantics, previous non-targeted attack methods attempt to attack depth models by adding disturbance to the input data, yielding some positive outcomes. Nevertheless, they still lack specific instance-level hash codes and fail to consider the diversity and semantic association of different modalities, which is insufficient to meet the attacker's expectations. In response, we present a novel Primary code Guided Targeted Attack (PGTA) against cross-modal hashing retrieval. Specifically, we integrate cross-modal instances and labels to obtain well-fused target semantics, thereby enhancing cross-modal interaction. Secondly, the primary code is designed to generate discriminable information with fine-grained semantics for target labels. Benign samples and target semantics collectively generate adversarial examples under the guidance of primary codes, thereby enhancing the efficacy of targeted attacks. Extensive experiments demonstrate that our PGTA outperforms the most advanced methods on three datasets, achieving State-of-the-Art targeted attack performance.

Abstract:
Self-attention mechanisms have revolutionized natural language processing and computer vision. However, in point cloud analysis, most existing methods focus on point convolution operators for feature extraction, but fail to model long-range and hierarchical dependencies. To overcome above issues, in this paper, we present PointAttention, a novel network for point cloud feature representation and propagation. Specifically, this architecture uses a two-stage Learnable Self-attention for long-range attention weights learning, which is more effective than conventional triple attention. Furthermore, it employs a Hierarchical Learnable Attention Mechanism to formulate momentous global prior representation and perform fine-grained context understanding, which enables our framework to break through the limitation of the receptive field and reduce the loss of contexts. Interestingly, we show that the proposed Learnable Self-attention is equivalent to the coupling of two Softmax attention operations while having lower complexity. Extensive experiments demonstrate that our network achieves highly competitive performance on several challenging publicly available benchmarks, including point cloud classification on ScanObjectNN and ModelNet40, and part segmentation on ShapeNet-Part.

Abstract:
Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (IoU%) and detection rate (DR%), respectively.

Abstract:
In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.

Abstract:
Unlike vanilla long-tailed recognition trains on imbalanced data but assumes a uniform test class distribution, test-agnostic long-tailed recognition aims to handle arbitrary test class distributions. Existing methods require prior knowledge of test sets for post-adjustment through multi-stage training, resulting in static decisions at the dataset-level. This pipeline overlooks instance diversity and is impractical in real situations. In this work, we introduce Prototype Alignment with Dedicated Experts (PADE), a one-stage framework for test-agnostic long-tailed recognition. PADE tackles unknown test distributions at the instance-level, without depending on test priors. It reformulates the task as a domain detection problem, dynamically adjusting the model for each instance. PADE comprises three main strategies: 1) parameter customization strategy for multi-experts skilled at different categories; 2) normalized target knowledge distillation for mutual guidance among experts while maintaining diversity; 3) re-balanced compactness learning with momentum prototypes, promoting instance alignment with the corresponding class centroid. We evaluate PADE on various long-tailed recognition benchmarks with diverse test distributions. The results verify its effectiveness in both vanilla and test-agnostic long-tailed recognition.

Abstract:
Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich textures and reasonable structures under different text prompts. However, adapting pre-trained diffusion models for visual perception is an open problem. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based visual perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs a frozen CLIP image encoder to directly generate implicit text embeddings that are fed to the diffusion model without explicit text prompts. The explicit branch uses the ground-truth labels of corresponding images as text prompts to condition feature extraction in diffusion model. During training, we jointly train the diffusion model by sharing the model weights of these two branches. As a result, the implicit and explicit branches can jointly guide feature learning. During inference, we employ only implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU^\textss score of 55.9% on ADE20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

Abstract:
In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.

Abstract:
3D object detection is an important but demanding task, which has become an active research topic in the field of multimedia. Much recent research has been devoted to exploiting end-to-end trainable object detection networks with point clouds. However, most state-of-the-art methods have bottlenecks in detecting occluded objects and small objects, because the sparseness of point clouds is exacerbated on these objects. In this paper, a Density-Aware 3D object detection network (DA-Net) is proposed to improve the perception performance for detecting occluded and small objects, which contains four components: a backbone module with an inverse density scoring module (IDM) and a point-wise attention module (PAM), a 3D intersection over union Estimation Module (3DEM), a Consistent Label Assignment (CLA) method and an Adaptive-Soft-NMS method. The proposed backbone module makes the network concentrate on low-density points of occluded objects, and suppresses outliers and background points. Then, the 3DEM is introduced to evaluate the localization quality of the prediction boxes. Furthermore, the proposed CLA method can more accurately select positive and negative samples for small objects. Finally, Adaptive-Soft-NMS is proposed in our method to reduce the number of false detections during inference and thereby improve detection performance substantially. Extensive experiments demonstrated that the proposed method achieves state-of-the-art performance on two large-scale datasets, SUN RGB-D (62.1% in terms of mAP@0.25) and ScanNetV2 (67.1% in terms of mAP@0.25), and in particular, the detection accuracy of small objects and occluded objects are extremely improved.

Affiliations: MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collaborative Innovation Center of Novel Software Technology and Industrialization, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Abstract:
This paper investigates an intriguing yet unsolved problem of cross-scene background subtraction for training only one deep model to process large-scale video streaming. We propose an end-to-end cross-scene background subtraction network via 3D optical flow, dubbed CrossNet. First, we design a new motion descriptor, hierarchical 3D optical flows (3D-HOP), to observe fine-grained motion. Then, we build a cross-modal dynamic feature filter (CmDFF) to enable the motion and appearance feature interaction. CrossNet exhibits better generalization since the proposed modules are encouraged to learn more discriminative semantic information between the foreground and the background. Furthermore, we design a loss function to balance the size diversity of foreground instances since small objects are usually missed due to training bias. Our whole background subtraction model is called Hierarchical Optical Flow Attention Model (HOFAM). Unlike most of the existing stochastic-process-based and CNN-based background subtraction models, HOFAM will avoid inaccurate online model updating, not heavily rely on scene-specific information, and well represent ambient motion in the open world. Experimental results on several well-known benchmarks demonstrate that it outperforms state-of-the-art by a large margin. The proposed framework can be flexibly integrated into arbitrary streaming media systems in a plug-and-play form.

Abstract:
Depth sensing is essential for intelligent computer vision applications, but it often suffers from low range precision and spatial resolution. To address this problem, we propose a novel framework that combines non-uniform sampling and reconstruction based on graph theory. Our framework consists of two main components: (1) a graph Laplacian induced non-uniform sampling (GLINUS) scheme that samples depth signals more densely around edges and contours than in smooth regions, and (2) an ensemble of priors (EoP) model that reconstructs the high-quality depth map using adaptive dual-tree discrete wavelet packets (ADDWP) transform, graph total variation regularizer, and graph Laplacian regularizer with color guidance. We solve the reconstruction problem using the alternating direction method of multipliers (ADMM). Our experiments demonstrate that our framework can capture fine structures and global information in depth signals and produce superior depth reconstruction results.

Abstract:
Point clouds captured by scanning devices are often incomplete due to occlusion. To overcome this limitation, point cloud completion methods have been developed to predict the complete shape of an object based on its partial input. These methods can be broadly classified as supervised or unsupervised. However, both categories require a large number of 3D complete point clouds, which may be difficult to capture. In this paper, we propose Cross-PCC, an unsupervised point cloud completion method without requiring any 3D complete point clouds. We only utilize 2D images of the complete objects, which are easier to capture than 3D complete and clean point clouds. Specifically, to take advantage of the complementary information from 2D images, we use a single-view RGB image to extract 2D features and design a fusion module to fuse the 2D and 3D features extracted from the partial point cloud. To guide the shape of predicted point clouds, we project the predicted points of the object to the 2D plane and use the foreground pixels of its silhouette maps to constrain the position of the projected points. To reduce the outliers of the predicted point clouds, we propose a view calibrator to move the points projected to the background into the foreground by the single-view silhouette image. To the best of our knowledge, our approach is the first point cloud completion method that does not require any 3D supervision. The experimental results of our method are superior to those of the state-of-the-art unsupervised methods by a large margin. Moreover, our method even achieves comparable performance to some supervised methods.

Abstract:
Unsupervised domain adaptation person re-identification (UDA person re-ID) aims at transferring the knowledge on the source domain with expensive manual annotation to the unlabeled target domain. Most of the recent papers leverage pseudo-labels for the target images to accomplish this task. However, the noise in the generated labels hinders the identification system from learning discriminative features. To address this problem, we propose a deep mutual distillation (DMD) to generate reliable pseudo-labels for UDA person re-ID. The proposed DMD applies two parallel branches for feature extraction, and each branch serves as the teacher of the other to generate pseudo-labels for its training. This mutually reinforcing optimization framework enhances the reliability of pseudo-labels, improving the identification performance. In addition, we present a bilateral graph representation (BGR) to describe the pedestrian images. BGR mimics the person re-identification of the human to aggregate the identity features according to the visual similarity and attribute consistency. Experimental results on Market-1501 and Duke demonstrate the effectiveness and generalization of the proposed method.

Abstract:
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features. In DG, the prevalent practice of constraining models to a fixed structure or uniform parameterization to encapsulate invariant features can inadvertently blend specific aspects. Such an approach struggles with nuanced differentiation of inter-domain variations and may exhibit bias towards certain domains, hindering the precise learning of domain-invariant features. Recognizing this, we introduce a novel method designed to supplement the model with domain-level and task-specific characteristics. This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization. Building on the emerging trend of visual prompts in the DG paradigm, our work introduces the novel Hierarchical Contrastive Visual Prompt (HCVP) methodology. This represents a significant advancement in the field, setting itself apart with a unique generative approach to prompts, alongside an explicit model structure and specialized loss functions. Differing from traditional visual prompts that are often shared across entire datasets, HCVP utilizes a hierarchical prompt generation network enhanced by prompt contrastive learning. These generative prompts are instance-dependent, catering to the unique characteristics inherent to different domains and tasks. Additionally, we devise a prompt modulation network that serves as a bridge, effectively incorporating the generated visual prompts into the vision transformer backbone. Experiments conducted on five DG datasets demonstrate the effectiveness of HCVP, outperforming both established DG algorithms and adaptation protocols.

Abstract:
The dense light field sampling of focused plenoptic images (FPIs) yields substantial amounts of redundant data, necessitating efficient compression in practical applications. However, the presence of discontinuous structures and long-distance properties in FPIs poses a challenge. In this paper, we propose a novel end-to-end approach for learned focused plenoptic image compression (LFPIC). Specifically, we introduce a local-global correlation learning strategy to build the nonlinear transforms. This strategy can effectively handle the discontinuous structures and leverage long-distance correlations in FPI for high compression efficiency. Additionally, we propose a spatial-wise context model tailored for LFPIC to help emphasize the most related symbols during coding and further enhance the rate-distortion performance. Experimental results demonstrate the effectiveness of our proposed method, achieving a 22.16% BD-rate reduction (measured in PSNR) on the public dataset compared to the recent state-of-the-art LFPIC method. This improvement holds significant promise for benefiting the applications of focused plenoptic cameras.

Abstract:
Multimedia file fragment classification (MFFC) aims to identify file fragment types, e.g., image/video, audio, and text without system metadata. It is of vital importance in multimedia storage and communication. Existing MFFC methods typically treat fragments as 1D byte sequences and emphasize the relations between separate bytes (interbytes) for classification. However, the more informative relations inside bytes (intrabytes) are overlooked and seldom investigated. By looking inside bytes, the bit-level details of file fragments can be accessed, enabling a more accurate classification. Motivated by this, we first propose Byte2Image, a novel visual representation model that incorporates previously overlooked intrabyte information into file fragments and reinterprets these fragments as 2D grayscale images. This model involves a sliding byte window to reveal the intrabyte information and a rowwise stacking of intrabyte n-grams for embedding fragments into a 2D space. Thus, complex interbyte and intrabyte correlations can be mined simultaneously using powerful vision networks. Additionally, we propose an end-to-end dual-branch network ByteNet to enhance robust correlation mining and feature representation. ByteNet makes full use of the raw 1D byte sequence and the converted 2D image through a shallow byte branch feature extraction (BBFE) and a deep image branch feature extraction (IBFE) network. In particular, the BBFE, composed of a single fully-connected layer, adaptively recognizes the co-occurrence of several some specific bytes within the raw byte sequence, while the IBFE, built on a vision Transformer, effectively mines the complex interbyte and intrabyte correlations from the converted image. Experiments on the two representative benchmarks, including 14 cases, validate that our proposed method outperforms state-of-the-art approaches on different cases by up to 12.2%.

Abstract:
Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.

Abstract:
Generalizable human neural rendering aims to render the target views of the human body by leveraging source views and the skinned multi-person linear (SMPL) model. Despite exhibiting promising performance, the target views rendered by previous methods usually contain corrupted parts of the human body. Two primary challenges hinder high-quality human neural rendering. These challenges involve non-correspondences between 2D pixels and 3D SMPL vertices induced by self-occlusion of the human body and erroneous appearance predictions caused by occlusion between the source and target views. To solve these two challenges, we propose an advancing generalizable occlusion modeling method for the neural human radiance field, in which the hurdles from the self-occlusion of the human body and the occlusion between source and target views are explored and solved. Specifically, to alleviate the non-correspondence problem induced by self-occlusion, a geometry perception module is designed to obtain 3D geometric representations of SMPL vertices, enabling the prediction of accurate density values. Furthermore, a visibility aggregation module is designed to estimate the visibility maps with respect to different source views by utilizing the predicted density. Then, the complementary information among multiple source views is integrated with the support of the visibility maps in the visibility aggregation module, thus effectively addressing the occlusion between views. Experiments on the ZJU-MoCap and THUman datasets show that the proposed method achieves promising performance compared with the existing state-of-the-art methods.

Abstract:
Incomplete Multi-view Clustering (IMvC) receives increasing attention due to its effectiveness in solving data-missing problems. With the information loss in incomplete situations, the core of IMvC needs to consider effectively overcoming the challenge of missing views, that is, exploring the underlying correlations from available data and recovering the missing information. However, most existing IMvC methods overemphasize the recovery-first principle with integrating the existing data from different views while neglecting the influence of view consistency in IMvC task together with valuable within view information. In this paper, a novel Between/Within View Information Completing for Tensorial Incomplete Multi-view Clustering (BWIC-TIMC) has been proposed, in which between/within view information is jointly exploited for effectively completing the missing views. Specifically, the proposed method designs a dual tensor constraint module, which focuses on simultaneously exploring the view-specific correlations of incomplete views and enforcing the between view consistency across different views. With the dual tensor constraint, between/within view information can be effectively integrated for completing missing views for IMvC task. Furthermore, in order to balance different contributions of multiple views and alleviate the problem of feature degeneration, BWIC-TIMC implements an adaptive fusion graph learning strategy for consensus representation learning. Extensive comparative experiments with the-state-of-art baselines can demonstrate the effectiveness of BWIC-TIMC.

Abstract:
Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. To this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the AV16.3 and CAV3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers.

Abstract:
In this work, we present DeepEraser, an effective deep network for generic text removal. DeepEraser utilizes a recurrent architecture that erases the text in an image via iterative operations. Our idea comes from the process of erasing pencil script, where the text area designated for removal is subject to continuous monitoring and the text is attenuated progressively, ensuring a thorough and clean erasure. Technically, at each iteration, an innovative erasing module is deployed, which not only explicitly aggregates the previous erasing progress but also mines additional semantic context to erase the target text. Through iterative refinements, the text regions are progressively replaced with more appropriate content and finally converge to a relatively accurate status. Furthermore, a custom mask generation strategy is introduced to improve the capability of DeepEraser for adaptive text removal, as opposed to indiscriminately removing all the text in an image. Our DeepEraser is notably compact with only 1.4 M parameters and trained in an end-to-end manner. To verify its effectiveness, extensive experiments are conducted on several prevalent benchmarks, including SCUT-Syn, SCUT-EnsText, and Oxford Synthetic text dataset. The quantitative and qualitative results demonstrate the effectiveness of our DeepEraser over the state-of-the-art methods, as well as its strong generalization ability in custom mask text removal.

Abstract:
Most current multi-view clustering methods necessitate that a sample's features be view-aligned or at least partially aligned across different views. Regrettably, real-world applications often fail to meet this requirement due to spatial, temporal, or spatiotemporal mismatches, resulting in the view-unaligned issue. To tackle this issue, we conceptualize the view-unaligned problem and demonstrate that it can be transformed into a view-aligned problem through reordering. Building on this concept, we introduce an innovative reorder matrix that realigns view-unaligned features. Utilizing these realigned features, we develop a sophisticated and efficient approach called Reordered k-means (RKM), which merges NMF with k-means. Unlike traditional k-means, our method converts the binary challenge into an \ell _0 problem, confirming the merit of this advancement. Furthermore, RKM's efficacy is affirmed on benchmarks, indicating substantial enhancements in handling the view-unaligned issue and maintaining competitive results with view-aligned problems.

Abstract:
Text-Pedestrian Image Retrieval employs textual description of pedestrian's appearance to identify the corresponding pedestrian image. This task involves modality discrepancy and the challenges posed by textual diversity of pedestrians with the same identity. Although advancements have been made in text-pedestrian image retrieval, current methods do not comprehensively address these challenges. Thus, this paper proposes a progressive feature mining and external knowledge- assisted feature purification method. Specifically, we implement a progressive mining mode, enabling the model to extract discriminative features from overlooked information. This enhances the model's feature representation capabilities and prevents the loss of discriminative information. To further mitigate the challenges posed by modality discrepancy and text diversity in cross-modal matching, we propose to use external knowledge of other samples from the same modality. This approach accentuates identity-consistent features and diminishes identity-inconsistent ones, refining feature representation and reducing interference from textual diversity and negative sample correlation features of the same modality. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed method, with its retrieval performance outstripping that of large-scale model-based methods on large-scale datasets.

Abstract:
Visible-infrared person re-identification is a challenging task in video surveillance. Most existing works achieve performance gains by aligning feature distributions or image styles across modalities, whereas the multi-granularity information and domain knowledge are usually neglected. Motivated by these issues, we propose a novel modality-aware domain alignment network (MDANet) for visible-infrared person re-identification (VI-ReID), which utilizes global-local context cues and the generalized domain alignment strategy to solve modal differences and poor generalization. Firstly, modality-aware global-local context attention (MGLCA) is proposed to obtain multi-granularity context features and identity-aware patterns. Secondly, we present a generalized domain alignment learning head (GDALH) to relieve the modality discrepancy and enhance the generalization of MDANet, whose core idea is to enrich feature diversity in the domain alignment procedure. Finally, the entire network model is trained by proposing cross-modality circle, classification, and domain alignment losses in an end-to-end fashion. We conduct comprehensive experiments on two standards and their corrupted VI-ReID datasets to validate the robustness and generalization of our approach. MDANet is obviously superior to the most state-of-the-art methods. Specifically, the proposed method can gain 8.86% and 2.50% in Rank-1 accuracy on SYSU-MM01 (all-search and single-shot mode) and RegDB (infrared to visible mode) datasets, respectively. The source code will be made available soon.

Affiliations: Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; School of Communication and Information Engineering, Shanghai University, Shanghai, China; IVIPLab, Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China; Research Center for Industries of the Future and the School of Engineering, Westlake University, Hangzhou, China; Department of Electrical Engineering and the Institute of Communications Engneering, National Tsing Hua University, Hsinchu, Taiwan

Abstract:
Lightweight image super-resolution aims to reconstruct high-resolution images from low-resolution images using low computational costs. However, existing methods result in the loss of middle-layer features due to activation functions. To minimize the impact of intermediate feature loss on reconstruction quality, we propose a Feature Interaction Weighted Hybrid Network (FIWHN), which comprises a series of Wide-residual Distillation Interaction Block (WDIB) as the backbone. Every third WDIB forms a Feature Shuffle Weighted Group (FSWG) by applying mutual information shuffle and fusion. Moreover, to mitigate the negative effects of intermediate feature loss, we introduce Wide Residual Weighting units within WDIB. These units effectively fuse features of varying levels of detail through a Wide-residual Distillation Connection (WRDC) and a Self-Calibrating Fusion (SCF). To compensate for global feature deficiencies, we incorporate a Transformer and explore a novel architecture to combine CNN and Transformer. We show that our FIWHN achieves a favorable balance between performance and efficiency through extensive experiments on low-level and high-level tasks.

Abstract:
In this paper, we propose an audio encryption scheme that supports differentiated decryption, called AES-AUDIO, in which an audio only needs to be encrypted once and can be decrypted into different resolutions as needed. First, we design four security levels, confidential, harsh, noisy, and clear, based on the audio resolution perceived by human auditory perception. Second, the audio data in decimal floating-point numbers (D-FPNs) are unfolded to 32 bits (B-FPNs). Third, we design a region of interest (RoI) encryption algorithm for the audio with the B-FPN format, where the result preserves some perceptual information as needed. Fourth, we construct the AES-AUDIO scheme based on the RoI encryption algorithm, which allows the audio to be encrypted once and then decrypted into different security levels. It supports changing parameters to alter the perception effect corresponding to the security level. Overall, it achieves a balance between the security and usability of the protected audio. User experiments verify that the audios produced by differential decryption can achieve the expected security levels. Some security tests also yielded excellent results, such as an NSCR value of 1.

Abstract:
Audio-visual active speaker detection (AV-ASD) aims to identify which visible face is speaking in a scene with one or more persons. Most existing AV-ASD methods prioritize capturing speech-lip correspondence. However, there is a noticeable gap in addressing the challenges from real-world AV-ASD scenarios. Due to the presence of low-quality noisy videos in such cases, AV-ASD systems without a selective listening ability are short of effectively filtering out disruptive voice components from mixed audio inputs. In this paper, we propose a Multi-modal Speech Extraction-to-Detection framework named ‘MuSED’, which is pre-trained with audio-visual target speech extraction to learn the denoising ability, then it is fine-tuned with the AV-ASD task. Meanwhile, to better capture the multi-modal information and deal with real-world problems such as missing modality, MuSED is modelled on the time domain directly and integrates the multi-modal plus-and-minus augmentation strategy. Our experiments demonstrate that MuSED substantially outperforms the state-of-the-art AV-ASD methods and achieves 95.6% mAP on the AVA-ActiveSpeaker dataset, 98.3% AP on the ASW dataset, and 97.9% F1 on the Columbia AV-ASD dataset, respectively. We will publicly release the code in due course.

Abstract:
Recently, deep learning-based video salient object detection (VSOD) has achieved some breakthroughs, but these methods rely on expensive annotated videos with pixel-wise annotations or weak annotations. In this paper, based on the similarities and differences between VSOD and image salient object detection (SOD), we propose a novel VSOD method via a progressive framework that locates and segments salient objects in sequence without utilizing any video annotation. To efficiently use the knowledge learned in the SOD dataset for VSOD efficiently, we introduce dynamic saliency to compensate for the lack of motion information of SOD during the locating process while maintaining the same fine segmenting process. Specifically, we utilize the coarse locating model trained on the image dataset, to identify frames with both static and dynamic saliency. Locating results of these frames are selected as spatiotemporal location labels. Moreover, by tracking salient objects in adjacent frames, the number of spatiotemporal location labels is increased. On the basis of these location labels, a two-stream locating network with an optical flow branch is proposed to capture salient objects in videos. The results with respect to five public benchmarks demonstrate that our method outperforms the state-of-the-art weakly and unsupervised methods.

Abstract:
The advent of deep learning has precipitated a surge in public machine learning as a service (MLaaS) for multimedia analysis. However, reliance on a single MLaaS can result in product dependency and a loss of better performance offered by multiple MLaaSes. Consequently, many enterprises opt for an intercloud broker capable of managing jobs across various clouds. Though existing works explore the efficient utilization of inter-cloud computational resources and the enhancement of inter-cloud data transfer throughput, they disregard improving the overall accuracy of multiple MLaaSes. In response, we conduct a measurement study on object detection services, which are designed to identify and locate various objects within an image. We discover that combining predictions from multiple MLaaSes can improve analytical performance. However, more MLaaSes do not necessarily equate to better performance. Therefore, we propose SkyML, a user-side MLaaS federation broker that selects a subset of MLaaSes based on the characteristics of the request to achieve optimal multimedia analytical performance. Initially, we design a combinatorial reinforcement learning approach to select the sound MLaaS combination, thereby maximizing user experience. We also present an ingenious, automated taxonomy unification algorithm to minimize human efforts in merging MLaaS-specific labels into a user-preferred label space. Moreover, we devise an optimized ensemble strategy to aggregate predictions from the selected MLaaSes. Evaluations indicate that our similarity-based taxonomy unification approach can reduce annotation costs by 90%. Moreover, real-world trace-driven evaluations further prove that our MLaaS selection method can achieve similar levels of accuracy with a 67% reduction in inference fees.

Abstract:
Three-dimensional (3D) point cloud, as an emerging visual media format, is increasingly favored by consumers as it can provide more realistic visual information than two-dimensional (2D) data. Similar to 2D plane images and videos, point clouds inevitably suffer from quality degradation and information loss through multimedia communication systems. Therefore, automatic point cloud quality assessment (PCQA) is of critical importance. In this work, we propose a novel no-reference PCQA method by using a graph convolutional network (GCN) to characterize the mutual dependencies of multi-view 2D projected image contents. The proposed GCN-based PCQA (GC-PCQA) method contains three modules, i.e., multi-view projection, graph construction, and GCN-based quality prediction. First, multi-view projection is performed on the test point cloud to obtain a set of horizontally and vertically projected images. Then, a perception-consistent graph is constructed based on the spatial relations among different projected images. Finally, reasoning on the constructed graph is performed by GCN to characterize the mutual dependencies and interactions between different projected images, and aggregate feature information of multi-view projected images for final quality prediction. Experimental results on two publicly available benchmark databases show that our proposed GC-PCQA can achieve superior performance than state-of-the-art quality assessment metrics.

Abstract:
Single hyperspectral image super-resolution aims to enhance the spatial resolution of a hyperspectral image without relying on any auxiliary information. Despite the abundant spectral information, the inherent high-dimensionality in hyperspectral images still remains a challenge for memory efficiency. Recently, recursion-based methods have been proposed to reduce memory requirements. However, these methods utilize the reconstruction features as feedback embedding to explore context information, leading to sub-optimal performance as they ignore the complementarity of different hierarchical levels of information in the context. Additionally, existing methods equivalently compensate the previous feedback information to the current band, resulting in an indistinct and untargeted introduction of the context. In this paper, we propose a hierarchical context measurement network to construct corresponding measurement strategies for different hierarchical information, capturing comprehensive and powerful complementary knowledge from the context. Specifically, a feature-wise similarity measurement module is designed to calculate global cross-layer relationships between the middle features of the current band and those of the context, so as to explore the embedded middle features discriminatively through generated global dependencies. Furthermore, considering the pixel-wise correspondence between the reconstruction features and the super-resolved results, we propose a pixel-wise similarity measurement module for the complementary reconstruction features embedding, exploring detailed complementary information within the embedded reconstruction features by dynamically generating a spatially adaptive filter for each pixel. Experimental results reported on three benchmark hyperspectral datasets reveal that the proposed method outperforms other state-of-the-art peers in both visual and metric evaluations.

Abstract:
Food recognition applications in human health have recently garnered significant attention in the field of computer vision. With the advancement of mobile devices, robust food recognition in wireless communication has become a practical and challenging application scenario. We propose a novel Multi-scale Spiking Pyramid Transmission Network (MSPTN) to tackle this challenge. The MSPTN learns diverse and complementary local and global feature maps simultaneously, generating a comprehensive description of food images that capture the correlations of feed-specific features. The feature sender uses a three-layer Spiking Neural Network (SNN). The proposed sender compresses features into sparse and discrete spike trains, significantly reducing the required transmission bandwidth and improving channel utilization and energy efficiency. Our model introduces the Compressed Factorized Bilinear block (CFB), which employs a low-rank feature approximation to reduce computational complexity and feature transmission volume while preserving the discriminate features. The enhancement reasoning module is proposed to enhance the received features by projecting them into a higher-dimensional space and utilizing the self-attention mechanism and sum pooling to compress them back to the original dimension. We conduct extensive experiments on the ETH Food-101 and Food2k datasets. Our results reveal that the MSPTN demonstrates state-of-the-art recognition performance, even with binary spike trains. Meanwhile, the MSPTN also exhibits remarkable robustness in wireless communication scenarios. With the combination of CFB, SNN, and EFB, our model achieves significant efficiency gains, including a nearly nine-fold decrease in feature transmission volume and a three-fold improvement in runtime & computational memory speed.

Abstract:
Acquiring nutrition information and health-related knowledge about food is a common need among individuals. However, using conventional food names as search queries often fails to yield accurate matches to entries within food nutrition knowledge bases (FoodnKB), which frequently utilize scientific or product names. In this study, we present a method for enriching FoodnKB entries with imagery and facilitating visual access to food-related knowledge through image recognition. We start with an official food nutrition database and propose a consensus-based approach using Large Language Models to identify visually discernible and directly edible foods, expanding food synonyms and harnessing diverse web-based food images for comprehensive visual representation. To minimize manual annotation of noisy web images, we introduce a cyclic training-based area under the margin metric (cAUM) approach that effectively distinguishes appropriate images, including rare instances, from noisy ones. Additionally, we design a generic accuracy gap (AccGap) algorithm to automatically estimate the noise ratio of the web-harnessed data. Our integrated cAUM and AccGap method demonstrates superior performance in noise detection and enhancement of image recognition accuracy compared to existing noise-robust frameworks. Furthermore, we successfully apply the visually enriched FoodnKB and food recognition capabilities within a smart nutritionist mobile application.

Abstract:
Despite significant strides in the field of 3D scene editing, current methods encounter substantial challenge, particularly in preserving 3D consistency during the multi-view editing process. To tackle this challenge, we propose a progressive 3D editing strategy that ensures multi-view consistency via a Trajectory-Anchored Scheme (TAS) with a dual-branch editing mechanism. Specifically, TAS facilitates a tightly coupled iterative process between 2D view editing and 3D updating, preventing error accumulation yielded from the text-to-image process. Additionally, we explore the connection between optimization-based methods and reconstruction-based methods, offering a unified perspective for selecting superior design choices, supporting the rationale behind the designed TAS. We further present a tuning-free View-Consistent Attention Control (VCAC) module that leverages cross-view semantic and geometric reference from the source branch to yield aligned views from the target branch during the editing of 2D views. To validate the effectiveness of our method, we analyze 2D examples to demonstrate the improved consistency with the VCAC module. Extensive quantitative and qualitative results in text-guided 3D scene editing clearly indicate that our method can achieve superior editing quality compared with state-of-the-art 3D scene editing methods.

Abstract:
By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B～4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.

Abstract:
Robotic grasping is a crucial topic in robotics and computer vision, with broad applications in industrial production and intelligent manufacturing. Although some methods have begun addressing instance-level grasping, most remain limited to predefined instances and categories, lacking flexibility for open-vocabulary grasp prediction based on user-specified instructions. To address this, we propose RoG-SAM, a language-driven, instance-level grasp detection framework built on Segment Anything Model (SAM). RoG-SAM utilizes open-vocabulary prompts for object localization and grasp pose prediction, adapting SAM through transfer learning with encoder adapters and multi-head decoders to extend its segmentation capabilities to grasp pose estimation. Experimental results show that RoG-SAM achieves competitive performance on single-object datasets (Cornell and Jacquard) and cluttered datasets (GraspNet-1Billion and OCID), with instance-level accuracies of 91.2% and 90.1%, respectively, while using only 28.3% of SAM's trainable parameters. The effectiveness of RoG-SAM was also validated in real-world environments.

Abstract:
Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.

Abstract:
Efforts in weakly-supervised video anomaly detection center on detecting abnormal events within videos by coarse-grained labels, which has been successfully applied to many real-world applications. However, a significant limitation of most existing methods is that they are only effective for specific objects in specific scenarios, which makes them prone to misclassification or omission when confronted with previously unseen anomalies. Relative to conventional anomaly detection tasks, Open-world Weakly-supervised Video Anomaly Detection (OWVAD) poses greater challenges due to the absence of labels and fine-grained annotations for unknown anomalies. To address the above problem, we propose a multi-scale evidential vision-language model to achieve open-world video anomaly detection. Specifically, we leverage generalized visual-language associations derived from CLIP to harness the full potential of large pre-trained models in addressing the OWVAD task. Subsequently, we integrate a multi-scale temporal modeling module with a multimodal evidence collector to achieve precise frame-level detection of both seen and unseen anomalies. Extensive experiments on two widely-utilized benchmarks have conclusively validated the effectiveness of our method. The code will be made publicly available.

Abstract:
3D point cloud object tracking (3D PCOT) plays a vital role in applications such as autonomous driving and robotics. Adversarial attacks offer a promising approach to enhance the robustness and security of tracking models. However, existing adversarial attack methods for 3D PCOT seldom leverage the geometric structure of point clouds and often overlook the transferability of attack strategies. To address these limitations, this paper proposes an adversarial geometric attack method tailored for 3D PCOT, which includes a point perturbation attack module (non-isometric transformation) and a rotation attack module (isometric transformation). First, we introduce a curvature-aware point perturbation attack module that enhances local transformations by applying normal perturbations to critical points identified through geometric features such as curvature and entropy. Second, we design a Thompson sampling-based rotation attack module that applies subtle global rotations to the point cloud, introducing tracking errors while maintaining imperceptibility. Additionally, we design a fused loss function to iteratively optimize the point cloud within the search region, generating adversarially perturbed samples. The proposed method is evaluated on multiple 3D PCOT models and validated through black-box tracking experiments on benchmarks. For P2B, white-box attacks on KITTI reduce the success rate from 53.3% to 29.6% and precision from 68.4% to 37.1%. On NuScenes, the success rate drops from 39.0% to 27.6%, and precision from 39.9 to 26.8%. Black-box attacks show a transferability, with BAT showing a maximum 47.0% drop in success rate and 47.2% in precision on KITTI, and a maximum 22.5% and 27.0% on NuScenes.

Abstract:
Large Multimodal Models (LMMs) excel in English multimedia tasks but face challenges in adapting to other languages due to linguistic diversity, limited non-English multimodal data, and high training costs. Existing approaches rely on machine-translated multimodal corpora or multilingual large language models, yet they demand substantial resources and achieve only modest zero-shot cross-lingual transfer performance, as shown in the IGLUE benchmark. In this work, we propose SMSA, a Syntax-aware Multimodal Semantic Adaptation approach, which efficiently extends vision-language models (VLMs) to multiple languages via a lightweight adaptation module. Instead of learning from scratch, SMSA transfers multimodal knowledge from English-trained models using two key components: (1) a Syntax-aware Adapter (SAA), which restructures multilingual text representations to align better with English syntax, reducing cross-lingual misalignment; (2) a Multimodal Semantic Distillation (MSD) method, which enables the model to mimic English sequence processing and retain multimodal associations across languages. This allows efficient adaptation to new languages while preserving the original model's strong multimodal capabilities. We extend an MoE-based VLM to 8 languages using a small translation dataset. Evaluations on the IGLUE benchmark show that SMSA achieves strong zero-shot transfer, outperforming some multilingual LMMs and demonstrating its effectiveness in cross-lingual vision-language adaptation.

Abstract:
Most recent popular Role-Playing Games (RPGs) allow players to create in-game characters with hundreds of adjustable parameters, including bone positions and various makeup options. Although text-driven auto-customization systems have been developed to simplify the complex process of adjusting these intricate character parameters, they are limited by their single-round generation and lack the capability for further editing and fine-tuning. In this paper, we propose an Interactive Character Editing framework (ICE) to achieve a multi-round dialogue-based refinement process. In a nutshell, our ICE offers a more user-friendly way to enable players to convey creative ideas iteratively while ensuring that created characters align with the expectations of players. Specifically, we propose an Instruction Parsing Module (IPM) that utilizes large language models (LLMs) to parse multi-round dialogues into clear editing instruction prompts in each round. To reliably and swiftly modify character control parameters at a fine-grained level, we propose a Semantic-guided Low-dimension Parameter Solver (SLPS) that edits character control parameters according to prompts in a zero-shot manner. Our SLPS first localizes the character control parameters related to the fine-grained modification, and then optimizes the corresponding parameters in a low-dimension space to avoid unrealistic results. Extensive experimental results demonstrate the effectiveness of our proposed ICE for in-game character creation and the superior editing performance of ICE.

Abstract:
Generating 3D shapes according to specific textual input is a crucial topic in the multimedia application, with its potential enhancement to the VR/AR/XR usage that enables more diverse virtual scenes. Due to the recent success of diffusion models, text-guided 3D object generation has drawn a lot of attention recently. However, current latent diffusion-based methods are restricted to shape-only generation, requiring time-consuming and computationally expensive post-processing to obtain colored objects. In this paper, we propose an end-to-end Shape-Color Diffusion Prior framework (SCDiff) to achieve colored text-to-3D object generation. Given a general text description as input, our SCDiff is able to distinguish shape and color-related priors in the text and generate a shape latent and a color latent for a pre-trained 3D object auto-encoder to derive colored 3D objects. Our SCDiff contains two 3D latent diffusion models (LDM), where one generates the shape latent from the input text and the other generates the color latent. To help the two LDMs focus on shape/color-related information, we further adopt a Large Language Model (LLM) to separate the input text into a shape phrase and a color phrase via an in-context learning technique so that our shape/color LDM would not be influenced by irrelevant information. Due to the separation of shape and color latent, we are able to manipulate the color of an object by giving different color phrases while maintaining the original shape. Experiments on a benchmark dataset would quantitatively and qualitatively verify the effectiveness and practicality of our proposed model. As an extension, we show the capability of our SCDiff on 3D object generation and manipulation based on various modality conditions, which further confirms the scalability and applications in multimedia of our proposed framework.

Abstract:
Existing approaches for all-in-one weather-degraded image restoration suffer from inefficiencies in leveraging degradation-aware priors, resulting in sub-optimal performance in adapting to different weather conditions. To this end, we develop an adaptive degradation-aware self-prompting model (ADSM) for all-in-one weather-degraded image restoration. Specifically, our model employs the contrastive language-image pre-training model (CLIP) to facilitate the training of our proposed latent prompt generators (LPGs), which represent three types of latent prompts to characterize the degradation type, degradation property and image caption. Moreover, we integrate the acquired degradation-aware prompts into the time embedding of diffusion model to improve degradation perception. Meanwhile, we employ the latent caption prompt to guide the reverse sampling process using the cross-attention mechanism, thereby guiding the accurate image reconstruction. Furthermore, to accelerate the reverse sampling procedure of diffusion model and address the limitations of frequency perception, we introduce a wavelet-oriented noise estimating network (WNE-Net). Extensive experiments conducted on eight publicly available datasets demonstrate the effectiveness of our proposed approach in both task-specific and all-in-one applications.

Abstract:
Infrared image nonuniformity correction aims to remove the column-wise stripe noise. Most existing methods just consider stripe noise whereas failing to handle real captured nonuniformity, as directional characteristic of stripe is severely disrupted by random Gaussian noise. Moreover, deep learning-based methods proposed in recent years are blocked by limited receptive field thus cannot accurately distinguish vertical structure and vertical stripes. To address these issues, we propose a universal infrared image nonuniformity correction method based on stripe-aware attention network. We seek to improve the performance of our algorithm by first restoring the damaged stripe directional characteristics, then maximizing the utilization of the prior characteristics. On the one hand, we construct the two-stage framework, in which denoising network is firstly applied to eliminate Gaussian noise and preserve stripes as scene information. As a result, the prior directional characteristics are restored, thereby enhancing the ability of subsequent sub-network to perceive stripe noise. On the other hand, due to the distinct long-range pixel correlations of vertical structures and vertical textures, we introduce a column-wise stripe attention mechanism (CSA) that can capture long-range dependencies of target pixels in the vertical direction. This significantly improves the discriminative ability of algorithm towards vertical structures and stripes, with minimal computational cost. Extensive experiments show that the proposed method can achieve promising results and has better universality for different infrared scenarios.

Abstract:
Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.

Abstract:
Heterogeneous domain adaptation seeks to learn an effective classifier or regression model for unlabeled target samples by using the well-labeled source samples but residing in different feature spaces and lying different distributions. Most recent works have concentrated on learning domain-invariant feature representations to minimize the distribution divergence via target pseudo-labels. However, two critical issues need to be further explored: 1) new feature representations should be not only domain-invariant but also category-correlative and discriminative and 2) alleviating the negative transfer caused by the incorrect pseudo-labeling target samples could boost the adaptation performance during the iterative learning process. To address these issues, in this paper, we put forward a novel heterogeneous domain adaptation method to learn category-correlative and discriminative representations, referred to as correlative and discriminative feature learning (CDFL). Specifically, CDFL aims to learn a feature space where class-specific feature correlations between the source and target domains are maximized, the divergences of marginal and conditional distribution between the source and target domains are minimized, and the distances of inter-class distribution are forced to be maximized to ensure the discriminative ability. Meanwhile, a selective pseudo-labeling procedure based on the correlation coefficient and classifier prediction is introduced to boost class-specific feature correlation and discriminative distribution alignment in an iteration way. Extensive experiments certify that CDFL outperforms the State-of-the-Art algorithms on five standard benchmarks.

Abstract:
Underwater image quality assessment (UIQA) is a challenging task due to the complexities of underwater environments. Traditional UIQA methods primarily rely on fitting mean opinion scores (MOS), which are limited by human visual biases. To address the above limitation, we propose a no-reference underwater image quality assessment paradigm using reinforcement sequences. Our paradigm leverages reinforcement learning to iteratively merge the input image with the corresponding ground truth, generating an optimized sequence of images. A classifier generates probability arrays for the optimized sequence, which are converted into objective scores by a regression model. Unlike existing methods that focus solely on the final quality score, our paradigm emphasizes dynamic quality changes throughout the image-enhancement process. By employing objective mixing ratio labels, our reinforcement sequence dataset reduces subjective bias. The multiscale classifier captures local and global information differences between the input and ground truth images, effectively preserving the contrast and detail in diverse lighting conditions. Our paradigm combines multi-source data classification with support vector regression, optimizing the mapping of feature vectors to quality scores through fine-tuning libsvm kernel parameters. Experimental results on multiple benchmark datasets demonstrate that our paradigm outperforms the state-of-the-art UIQA methods, providing an effective solution for Underwater Image quality Assessment via Reinforcement Sequences (RSUIA).

Abstract:
For the security risks and high transmission/storage consumption in cloud-based medical images storage systems (CMISS), reversible data hiding in encrypted images (RDHEI) provide an effective solution. Nevertheless, challenges persist concerning the security risks cause by key transmission and the large file size of encrypted medical images. Consequently, a cloud-based privacy-preserving medical images storage scheme with low consumption is proposed in this paper. First, RDHEI is applied to CMISS, where image encryption achieves privacy protection, reversible data hiding eliminates extra space consumption by index data self-hiding, and the reversibility enables lossless recovery and extraction of medical images and index data. Then, hybrid encryption is designed to achieve high security. The security of encrypted images is guaranteed by combining a one-time cryptosystem with symmetric XOR encryption, which makes our scheme can resist various attacks. Time-varying key used in XOR is encrypted by asymmetric RSA, and only public key is used in RSA, avoiding the risk of private key transmission. Finally, to reduce the file size of encrypted images and achieve low consumption, context Huffman coding is proposed to adaptively selects the block coding method by context and thresholds, and has at most 98 056 bits shorter than Huffman coding in encoded stream length. Experimental results show that the proposed scheme has better performance in terms on security, consumption, and reversibility. The minimum compression ratio in databases is 32.46%, which is 2.63% lower than the existing schemes. And the medical image and index data can be restored lossless.

Abstract:
The purpose of weakly-supervised temporal action localization (WTAL) task is to simultaneously classify and localize action instances in untrimmed videos with only video-level labels. Previous works fail to extract multi-scale temporal features to identify action instances with different durations, and they do not fully use the temporal cues of action video to learn discriminative features. In addition, the classifiers trained by current methods usually focus on easy-to-distinguish snippets while ignoring other semantically ambiguous features, which leads to incomplete and over-complete localization. To address these issues, we introduce a new Snippet-inter Difference Attention Network (SDANet) for WTAL, which can be trained end-to-end. Specifically, our model presents three modules, with primary contributions lying in the snippet-inter difference attention (SDA) module and potential feature mining (PFM) module. Firstly, we construct a simple multi-scale temporal feature fusion (MTFF) module to generate multi-scale temporal feature representation, so as to help the model better detect short action instances. Secondly, we consider the temporal cues of video features and design SDA module based on the Transformer to capture global discriminative features for each modality based on multi-scale features. It calculates the differences between temporal neighbor snippets in each modality to explore salient-difference features, and then utilizes them to guide correlation modeling. Thirdly, after learning discriminative features, we devise PFM module to excavate potential action and background snippets from ambiguous features. By contrastive learning, potential actions are forced closer to discriminative actions and away from the background, thereby learning more accurate action boundaries. Finally, two losses (i.e., similarity loss and reconstruction loss) are further developed to constrain the consistency between two modalities and help the model retain original feature information for better localization results. Extensive experiments show that our model achieves better performance against current WTAL methods on three datasets, i.e., THUMOS14, ActivityNet1.2 and ActivityNet1.3.

Abstract:
Video white balance is to correct the scene color of video frames to the color under the standard white illumination. Due to the camera movement, video white balance usually suffers temporal instability with unnatural color change between frames. This paper presents a video white balance stabilization method for spatially correct and temporally stable color correction. It exploits the color invariance at the position of the same object to obtain the consistent illumination color estimation through frames. Specifically, it detects gray pixels that inherit the potential illumination color, and their inter-frame motion calculated with the assistance of inertial measurement unit (IMU) is used to carry gray pixels for establishing their correspondence and color fusion between adjacent frames. Because the IMU has more robust and accurate motion cues against large camera movement and texture-less regions in the scene, our method can generate better gray pixel correspondences and illumination color estimation for the white balance stabilization. Besides, our method is computationally efficient to be deployed on mobile phones. Experimental results show that our method can significantly improve the temporal stability as well as maintain the spatial correctness of white balance for videos recorded by cameras equipped with IMU sensors.

Affiliations: Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian, China; School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China; School of Mechanical Engineering, Dalian University of Technology, Dalian, China; School of Computer Science and Technology, Dalian University of Technology, Dalian, China; Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland, New Zealand

Abstract:
Multi-modality image fusion (MMIF) entails synthesizing images with detailed textures and prominent objects. Existing methods tend to use general feature extraction to handle different fusion tasks. However, these methods have difficulty breaking fusion barriers across various modalities owing to the lack of targeted learning routes. In this work, we propose a multi-scenario feature joint learning architecture, MLFuse, that employs the commonalities of multi-modality images to deconstruct the fusion progress. Specifically, we construct a cross-modal knowledge reinforcing network that adopts a multipath calibration strategy to promote information communication between different images. In addition, two professional networks are developed to maintain the salient and textural information of fusion results. The spatial-spectral domain optimizing network can learn the vital relationship of the source image context with the help of spatial attention and spectral attention. The edge-guided learning network utilizes the convolution operations of various receptive fields to capture image texture information. The desired fusion results are obtained by aggregating the outputs from the three networks. Extensive experiments demonstrate the superiority of MLFuse for infrared-visible image fusion and medical image fusion. The excellent results of downstream tasks (i.e., object detection and semantic segmentation) further verify the high-quality fusion performance of our method.

Abstract:
Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few-shot semantic segmentation, pixel-level annotations are time-consuming and costly. Therefore, in this paper, we utilize the more challenging image-level annotations and propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation (WFSS). Specifically, we first propose a cross-granularity frequency-aware module (CFM) that decouples RGB images into high-frequency and low-frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi-modal language-vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP-guided spatial-adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross-modal semantic information for CFM. Extensive experiments on the Pascal-5i and COCO-20i datasets demonstrate that AFANet has achieved state-of-the-art performance.

Abstract:
Weakly supervised LiDAR semantic segmentation has made significant strides with limited labeled data. However, most existing methods focus on the network training under weak supervision, while efficient annotation strategies remain largely unexplored. To tackle this gap, we implement LiDAR semantic segmentation using scatter image annotation, effectively integrating an efficient annotation strategy with network training. Specifically, we propose employing scatter images to annotate LiDAR point clouds, combining a pre-trained optical flow estimation network with a foundational image segmentation model to rapidly propagate manual annotations into dense labels for both images and point clouds. Moreover, we propose ScatterNet, a network that includes three pivotal strategies to reduce the performance gap caused by such annotations. First, it utilizes dense semantic labels as supervision for the image branch, alleviating the modality imbalance between point clouds and images. Second, an intermediate fusion branch is proposed to obtain multimodal texture and structural features. Finally, a perception consistency loss is introduced to determine which information needs to be fused and which needs to be discarded during the fusion process. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our method requires less than 0.02% of the labeled points to achieve over 95% of the performance of fully-supervised methods. Notably, our labeled points are only 5% of those used in the most advanced weakly supervised methods.

Abstract:
Compositional zero-shot learning (CZSL) aims to recognize novel compositions formed by known primitives (attribute and object). The key challenge of CZSL is the visual diversity of the primitive caused by the dependencies of attributes and objects. To solve this problem, most existing methods attempt to mine primitive-invariant features shared in all compositions or learn primitive-variant features specialized for each composition. However, these methods overlook that the primitives have inherent similarities and differences in different compositions, i.e., one primitive may exhibit a common visual appearance under some compositions, but have different expressions in other partial compositions. To sufficiently explore the partial similarity and visual diversity of primitives, we propose a compact latent primitive space learning framework, which explicitly leverages various codewords to encode the primitive features to make a balance between generality and diversity. Specifically, we borrow the idea from discriminative sparse coding to learn these representative codewords to build the latent primitive space. Through the sparse reconstruction loss, contrastive loss and orthogonal constraint, our model can adaptively reconstruct the primitive features according to the similarity weights between the primitive features and codewords. Comprehensive experiments on four benchmarks demonstrate that the proposed method achieves better performance than previous methods.

Abstract:
Pedestrian detection plays a crucial role in autonomous driving systems. To ensure reliable and effective detection in challenging conditions, researchers have proposed RGB–T (RGB–thermal) detectors that integrate thermal images with color images for more complementary feature representations. However, existing methods face challenges in capturing the spatial and geometric correlations between different modalities, as well as in assuming perfect synchronization of the two modalities, which is unrealistic in real-world scenarios. In response to these challenges, we present a new deformable-attention-based approach for weakly aligned RGB–T pedestrian detection. The proposed method uses a dual-branch cross-attention mechanism to capture the inherent spatial and geometric correlations between color and thermal images. Furthermore, it incorporates positional information for each image pixel into the sampling offset generation to enhance robustness in scenarios where modalities are not precisely aligned or registered. To reduce computational complexity, we introduce a local attention mechanism that samples only a small set of keys and values within a limited region in the feature maps for each query. Extensive experiments and ablation studies conducted on multiple public datasets confirm the effectiveness of the proposed framework.

Abstract:
The occurrence of frequency aliasing between the camera and high-frequency scene elements causes moiré patterns in images, leading to color distortions and a loss of fine details, thereby reducing image quality. The intricate frequency characteristics and diverse appearances inherent in moiré patterns render their removal, commonly referred to as demoiréing, particularly challenging. Recent advancements in deep learning-based demoiréing methods have showcased notable efficacy. However, prevailing techniques often specialize in mitigating moiré patterns exclusively within either the frequency or spatial domains. Additionally, these methods generally perform well at specific image resolutions, but struggle to maintain effectiveness across different resolutions due to less generalized architectures. To address these issues, we propose a Dual-domain Multi-level Multi-scale Network DMMNet, working in both spatial and frequency domains sequentially. The Multi-scale Multi-level Demoire Stage (MMDS) in our framework focuses on moiré patterns removal in the spatial domain. To adeptly integrate features from various semantic levels, we introduce a pioneering plug-and-play Adjacent Cross Attention (ACA) module within the MMDS. Subsequently, the Frequency Separation and Reconstruction Stage (FSRS) restores high-frequency texture details, reconstructs color information, and eliminates residual moiré patterns in the wavelet frequency domain. Ultimately, the clean image is obtained by converting it back to the spatial domain. Extensive experimental assessments, spanning both quantitative metrics and qualitative visual evaluations, attest to the superior efficacy of DMMNet to State-Of-The-Art (SOTA) demoiréing methods, concurrently exhibiting enhanced generalization for demoiréing across diverse image resolutions. We posit that the proposed methodology presents a viable solution for broader applications in the realm of demoiréing. Code will be available on https://github.com/Mr-Ma-yikun/DMMNet.

Abstract:
Existing cross-domain image retrieval (CDIR) methods exhibit a strong dependency on prior knowledge of training categories, which leads to problems of class confusion and domain shift when encountering unseen categories in open-set environments. In this paper, we explore the CDIR task towards open-set environments and introduce the Hypergraph-Based Remaining Prototype Alignment (RePro) framework for this task. Specifically, to address the problem of unseen class confusion caused by the category differences, we utilize the Remaining Prototype Embedding (RPE) module to generate the remaining embeddings of images and treat these embeddings as domain noise, rather than directly mapping them to the explicit domain-unified prototypes. To overcome the problem of domain shift, our method leverages the high-order correlations among both domains and categories through the Heterogeneous Structure Alignment (HSA) module, by constructing a heterogeneous hypergraph based on intra-domain and inter-category correlations. Besides, we build two multi-domain datasets for open-set cross-domain image retrieval, i.e., OCD-PACS and OCD-VLCS. Each dataset is divided into seen and unseen categories for training and testing, and each class has four different domains of images. Extensive experiments and ablation studies on these two datasets demonstrate the superiority of our method over current state-of-the-art methods.

Abstract:
Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6 M and MPI-INF-3DHP datasets.

Abstract:
Estimating depth maps from monocular underwater images poses one of the most challenging problems in underwater applications. Due to the lack of large-scale paired underwater color-depth datasets for effective training, existing style transfer-based and self-supervision-based approaches can improve the performance of depth estimation to some extent, but they remain unsatisfactory. Leveraging the power of massive training datasets, foundation models designed for terrestrial monocular depth estimation have demonstrated superior performance across various scenes. These models provide rich prior knowledge of 3D perception, which can be valuable for underwater depth estimation. Upon this, we introduce tunable adapters (UW-Adapter) that tailor a pre-trained foundation model specifically for underwater depth estimation, customizing it to the unique characteristics of underwater imagery. Our approach involves freezing the parameters of the pre-trained model and updating only the adapters through self-supervision. To address the complex degradation of underwater images, we propose two adapters: the transmission adapter and the high-frequency adapter. These adapters incorporate depth clues and high-frequency information as prior knowledge, thereby enhancing the performance of pre-trained model in underwater depth estimation. Experimental results demonstrate that by integrating lightweight adapters into off-the-shelf depth estimation foundation models, our method achieves superior performance across multiple datasets.

Abstract:
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial distributions. However, these methods extract semantic features simply. They ignore that proper prompt design for rich and fine-grained action cues can provide robust representation space clustering. In order to alleviate the problem of insufficient information available for skeleton sequences, we design an information compensation learning framework from an information-theoretic perspective to improve zero-shot action recognition accuracy with a multi-granularity semantic interaction mechanism. Inspired by ensemble learning, we propose a multi-level alignment (MLA) approach to compensate information for action classes. MLA aligns multi-granularity embeddings with visual embedding through a multi-head scoring mechanism to distinguish semantically similar action names and visually similar actions. Furthermore, we introduce a new loss function sampling method to obtain a tight and robust representation. Finally, these multi-granularity semantic embeddings are synthesized to form a proper decision surface for classification. Significant action recognition performance is achieved when evaluated on the challenging NTU RGB+D, NTU RGB+D 120, and PKU-MMD benchmarks and validate that multi-granularity semantic features facilitate the differentiation of action clusters with similar visual features.

Abstract:
With the advantages of deep learning (DL) techniques, various infrared (IR) small target detection networks have been proposed. While many networks aim at single-frame detection through supervised learning, ignoring abundant spatial-temporal information and causing heavy labeling costs. In this paper, we develop a 3-D spatial-temporal knowledge aware-based unsupervised network for IR target detection (STUTD). Specifically, we transform IR sequences into 3-D spatial-temporal tensors as data foundation. Based on the designed spatial-temporal Swin Transformer block (ST-STB), we introduce a multiscale feature extraction and aggregation (MFEA) module for effective feature extraction. And a Variational Autoencoder (VAE)-style background reconstruction module with a multihead gating mechanism is designed for background reconstruction. Besides, a designed sparse cardinality selection of residuals performs element-wise filtering on the residuals between the original tensors and the reconstructed background to obtain a pure target tensor. By an unsupervised learning approach, STUTD can achieve IR small target detection. Comprehensive experiments illustrate the superiority of STUTD among state-of-the-art methods. It can be concluded that STUTD has satisfactory overall performance and real-time performance.

Abstract:
Reconstructing 3D indoor scenes presents significant challenges, requiring models capable of inferring both planar surfaces and intricate details. Although recent methods can generate complete surfaces, they often struggle to simultaneously reconstruct low-texture regions and high-frequency details due to non-local effects. In this paper, we introduce a novel triangle-based triplane representation, named (tri^2plane), specifically designed to account for the diverse spatial feature distribution and information density of indoor environments. Our method begins by projecting point clouds onto three orthogonal planes, followed by 2D Delaunay triangulation. This representation enables adaptive encoding of low-texture and high-frequency regions by employing triangles of variable sizes. Moreover, we develop a dual tri^2 plane framework that incorporates both geometric and semantic information, significantly enhancing the reconstruction quality. We combine these key modules and evaluate our method on benchmark indoor scene datasets. The results unequivocally demonstrate the superiority of our proposed method over the state-of-the-art Occ-SDF. Specifically, our method achieves significant improvements over Occ-SDF, with margins of 1.3, 1.7, and 2.3 in F-score on the ScanNet, Tanks & Temples, and Replica datasets, respectively. To facilitate further research, we will make our code publicly available.

Abstract:
Visual Question Answering (VQA) presents a challenging task at the intersection of computer vision and natural language processing, aiming to bridge the semantic gap between visual perception and linguistic comprehension. Traditional VQA approaches do not distinguish between data processing and reasoning, limiting their interpretability and generalizability in complex and diverse scenarios. Conversely, Programmatic Visual Question Answering (PVQA) models leverage large language models (LLMs) to generate executable codes, providing answers with detailed and interpretable reasoning processes. However, existing PVQA models typically rely on simplistic input-output prompting, which struggles to elicit domain-specific knowledge from LLMs and often produces unclear or extraneous outputs. Furthermore, PVQA models typically rely on a basic in-context example (ICE) selection methodology that is heavily influenced by individual word similarity rather than the overall sentence context. This leads to suboptimal ICE selection and a reliance on dataset-specific ICE candidates. In this paper, we propose ContextualCoder, a novel prompting framework tailored for PVQA models. ContextualCoder leverages frozen LLMs for code generation and pre-trained visual models for code execution, eliminating the need for extensive training and enhancing model flexibility. By incorporating an innovative prompting methodology and a novel ICE selection strategy, ContextualCoder facilitates the use of diverse in-context information for code generation, thereby improving the performance of PVQA models. Our approach surpasses state-of-the-art models, as evidenced by comprehensive experiments across diverse VQA datasets, including multilingual scenarios.

Abstract:
Domain adaptive semantic segmentation aims to reduce domain shifts / discrepancies between source and target domains, improving the source domain model's generalization ability to the target domain. Recently, prototypical methods, which primarily use single-source or single-target domain prototypes as category centers to aggregate features from both domains, have achieved competitive performance in this task. However, due to large domain shifts, single-source domain prototypes have finite generalization ability and not all source domain knowledge is conducive to model generalization. Single-target domain prototypes are noisy because they are prematurely initialized with all features filtered by pseudo labels, which causes error accumulation in the prototypes. To address these issues, we propose a covariance-aware cross-domain prototypes method (CACP) to achieve robust domain adaptation. We propose to use both domain prototypes to dynamically rectify pseudo labels in the target domain, effectively reducing the recognition difficulty of hard target domain samples and narrowing the gap between features of the same category in both domains. In addition, to further generalize the model to the target domain, we propose two modules based on covariance correlation, FSPC (Features Selection by Prototypes Covariances) and WSPC (Weighting Source by Prototypes Coefficients), to learn discriminative characteristics. FSPC selects highly correlated features to update target domain prototypes online, denoising and enhancing discriminativeness between categories. WSPC utilizes the correlation coefficients between target domain prototypes and source domain features to weight each point in the source domain, eliminating the information interference from the source domain. In particular, CACP achieves excellent performance on the GTA5 \to Cityscapes and SYNTHIA \to Cityscapes tasks with minimal computational resources and time.

Affiliations: School of Computer Science and Technology, Guangdong University of Technology (GDUT), Guangzhou, China; School of Computer Science, Fudan University (FDU), Shanghai, China; School of Computer Science and Engineering, South China University of Technology (SCUT), Guangzhou, China; Department of Computer Science and Engineering (CSE), Shanghai Jiao Tong University (SJTU), Shanghai, China; Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan

Abstract:
Rotation invariance is a crucial requirement for the analysis of 3D point clouds. However, current methods often achieve rotation invariance by employing specific network designs. These networks, though perform well on rotation-aware tasks, is inferior in general tasks such as classification and segmentation. On the other hand, many powerful point processing networks, such as PointNet++, DGCNN, etc., have general point processing abilities, but do not own the property of rotation invariance. In this paper, we propose a standalone rotation-invariant convolution operator called SGGConv (Spherical Geometric Graph-based Convolution) and two ways integrating it with common point-based networks. The networks equipped with SGGConvs are called SGG-Nets which promote the rotation-invariance ability of regular point networks without modifying their network architectures much. Our contributions are three-fold. First, we propose a rotation-invariant feature descriptor, namely Spherical Geometry Descriptor (SGD), which captures point-pair features in a Local Spherical Coordinate System (LSCS). Second, we propose the SGGConv based on SGD and LSCS with an efficient Graph-based Spherical Feature Passing (GSFP) mechanism. Thirdly, we define two modules S-SGGConvMdl and M-SGGConvMdl, which are used to integrate SGGConv into baseline point nets. We test SGG-Nets, such as SGG-PointNet++, SGG-DGCNN, SGG-RIConv++, on representative point cloud datasets. These models, equipped with our SGGConvs, not only enhance the rotation-invariance of the baseline network but also improve its performance on point cloud analysis tasks such as classification and part segmentation, without incurring too much computational overhead.

Abstract:
By converting low-frame-rate, low-resolution videos into high-frame-rate, high-resolution ones, space-time video super-resolution techniques can enhance visual experiences and facilitate more efficient information dissemination. We propose a convolutional neural network (CNN) for space-time video super-resolution, namely GIRNet. Our method combines long-term global information and short-term local information from the video to better extract complete and accurate spatial-temporal information. To generate highly accurate features and thus improve performance, the proposed network integrates a feature-level temporal interpolation module with deformable convolutions and a global spatial-temporal information-based residual convolutional long short-term memory (convLSTM) module. In the feature-level temporal interpolation module, we leverage deformable convolution, which adapts to deformations and scale variations of objects across different scene locations. This provides a more efficient solution than conventional convolution for extracting features from moving objects. Our network effectively uses forward and backward feature information to determine inter-frame offsets, leading to the direct generation of interpolated frame features. In the global spatial-temporal information-based residual convLSTM module, the first convLSTM is used to derive global spatial-temporal information from the input features, and the second convLSTM uses the previously computed global spatial-temporal information feature as its initial cell state. This second convLSTM adopts residual connections to preserve spatial information, thereby enhancing the output features. Experiments on the Vimeo90 K dataset show that the proposed method outperforms open source state-of-the-art techniques in peak signal-to-noise-ratio (by 1.45 dB, 1.14 dB, and 0.2 dB over STARnet, TMNet, and 3DAttGAN, respectively), structural similarity index(by 0.027, 0.023, and 0.006 over STARnet, TMNet, and 3DAttGAN, respectively), and visual quality.

Abstract:
The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning (dubbed DAIR) in one stage. Particularly, after extracting video features using a base network, we design three consecutive modules for simultaneously learning object detection and interaction reasoning. Firstly, we build a Patch-based Object Decoder (PatchDec) to generate object proposals from video patch tokens. Then, we design an Interactive Object Refining and Aggregation (IRA) to identify the interactive objects that are important for action recognition. The IRA module adjusts the interactiveness scores of proposals based on their relative position and appearance, and aggregates the object-level information into global video representation. Finally, we build an Object Relation Modeling (ORM) module to encode the object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets. We hope our DAIR can provide a new perspective for object-centric action recognition.

Abstract:
Over the recent years, multimodal named entity recognition has gained increasing attentions due to its wide applications in social media. The key factor of multimodal named entity recognition is to effectively fuse information of different modalities. Existing works mainly focus on reinforcing textual representations by fusing image features via the cross-modal attention mechanism. However, these works are limited in reinforcing the text modality at the token level. As a named entity usually contains several tokens, modeling token-level inter-modal interactions is suboptimal for the multimodal named entity recognition problem. In this work, we propose a multimodal named entity recognition approach dubbed Adaptive Multi-scale Language Reinforcement (AMLR) to implement entity-level language reinforcement. To this end, our model first expands token-level textual representations into multi-scale textual representations which are composed of language units of different lengths. After that, the visual information reinforces the language modality by modeling the cross-modal attention between images and expanded multi-scale textual representations. Unlike existing token-level language reinforcement methods, the word sequences of named entities can be directly interacted with the visual features as a whole, making the modeled cross-modal correlations more reasonable. Although the underlying entity is not given, the training procedure can encourage the relevant image contents to adaptively attend to the appropriate language units, making our approach not rely on the pipeline design. Comprehensive evaluation results on two public Twitter datasets clearly demonstrate the superiority of our proposed model.

Abstract:
Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.

Abstract:
With the escalating demand for three-dimensional visual applications such as gaming, virtual reality, and autonomous driving, novel view synthesis has become a critical area of research. Current methods mainly depend on multiple views of the same subject to achieve satisfactory results, but there is often a significant lack of available data. Typically, only a single degraded image is available for reconstruction, which may be affected by occlusion, low resolution, or absence of color information. To overcome this limitation, we propose a two-stage feature matching approach designed specifically for single degraded images, leading to the synthesis of high-quality novel perspective images. This method involves the sequential use of an encoder for feature extraction followed by the fine-tuning of a generator for feature matching. Additionally, the integration of an information filtering module proposed by us during the GAN inversion process helps eliminate misleading information present in degraded images, thereby correcting the inversion direction. Extensive experimental results show that our method outperforms existing state-of-the-art single-view novel view synthesis techniques in handling challenges like occluded, grayscale, and low-resolution images. Moreover, the efficacy of our method remains unparalleled even when aforementioned method integrated with image restoration algorithms.

Abstract:
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets.

Abstract:
Trigger-based backdoor watermarking is an extensively utilized and effective method to safeguard the copyright of deep neural networks (DNNs), in which the trigger set could be taken as the key of the watermark. However, during the verification stage, there is a risk that the trigger set could be leaked and exposed to adversaries. If this occurs, the adversaries might apply this leaked trigger set to claim ownership of the model, posing significant copyright issues for the watermarked DNN. To address such an evidence exposure problem, a secure neural network watermarking protocol is put forward in this paper. In the proposed protocol, the trigger set is not fixed, once the trigger is utilized for verification, it is invalid and cannot be used for verification in the future. As a result, even if the trigger set is leaked during the verification process and obtained by the attacker, they cannot use it for copyright verification since it is invalid. To assist the protocol, a trigger set generation method is designed, in which the auxiliary classifier generative adversarial network (ACGAN) and the target classification model are trained together. The special logits distribution and the labels of the generated trigger samples can be ensured and verified effectively in this way. The performance of the trigger generation methods regarding effectiveness, fidelity, and robustness is verified by experiments, and the security analysis of the designed watermarking protocol is conducted.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a cross-modality retrieval task that aims to match visible and infrared pedestrian images across non-overlapped cameras. However, we observe that three crucial challenges remain inadequately addressed by existing methods: (i) limited discriminative capacity for modality-shared representation, (ii) modality misalignment, and (iii) neglect of identity consistency knowledge. To solve the above issues, we propose a novel dual space alignment framework (DSAF) to constrain the modality in two specific spaces. Specifically, for (i), we design a lightweight and plug-and-play modality invariant enhancement (MIE) module to capture fine-grained semantic information and render identity discriminative. This facilitates the establishment of correlations between visible and infrared modalities, enabling the model to learn robust modality-shared features. To tackle (ii), a dual space alignment (DSA) is introduced to conduct the pixel-level alignment in both Euclidean space and Hilbert space. DSA establishes an elastic relationship between these two spaces, remaining invariant knowledge across two spaces. To solve (iii), we propose an adaptive identity-consistent learning (AIL) to discover identity-consistent knowledge between visible and infrared modalities in a dynamic manner. Extensive experiments on mainstream VI-ReID benchmarks show the superiority and flexibility of our proposed method, achieving competitive performance on mainstream datasets.

Abstract:
Multi-source domain adaptation (MSDA) has garnered significant attention due to its emphasis on transferring knowledge from multiple labeled source domains to a single unlabeled target domain. MSDA requires sufficient labeled data from multiple source domains, but in practice, massive unlabeled data exist instead of well-labeled data. Multiple target domains also provide plenty of information, which is useful for domain adaptation. However, most MSDA studies overlook the critical scenario of multi-source and multi-target domain adaptation (MMDA). To address these problems, we propose a Multiple Adaptation Network (MAN) approach for MMDA, which utilizes multiple alignment strategies for each source-target domain pair-group to align relevant specific feature spaces. MAN also aligns multiple classifiers for the relevant feature spaces to optimize the decision boundaries of multiple target domains. Moreover, to consider the task relations of multiple classifiers, we minimize the semantic differences between the target-conditioned classifiers and utilize a weight learning category to optimize this process. To fully utilize the information from multiple target domains, we transfer the style information of the target data to the source data, aiding in the training of multiple classifiers. Extensive experiments in challenge domain adaptation benchmarks, including the ImageCLEF-DA, Office-Home, DomainNet, and RGB-to-thermal datasets, demonstrate the superiority of our method over the state-of-the-art approaches.

Abstract:
Image hazing refers to adding haze to a clear image, which is important for improving the data amount and diversity of synthetic hazy images that are required to train deep image dehazing models. However, existing image hazing works generate hazy images from a given clear image with a single transmission map. This violates the fact that hazy images are diverse for a natural scene at different times. The domain shift issue between synthetic and real-world hazy images constrains the robustness of deep dehazing models when dealing with real-world hazy images. In this work, we propose an unsupervised haze generation work to synthesize multiple hazy images with diverse haze distributions from a clear image, which requires only an atmospheric scattering model without extra labeling information. Instead of estimating a transmission map from a clear image, we propose to customize the transmission maps by redefining the transmission function. In such a controllable way, hazy images with diverse haze distributions are generated, which avoids the labor-intensive collection of paired data and alleviates the common domain-shift issue of deep image dehazing. Incorporating the unsupervised hazy images generator, we also construct a generalizable self-supervised image dehazing (SSID) framework, where deep image dehazing models can be trained without any human annotations. Extensive experiments on real-world hazy images show that the proposed approach is superior to state-of-the-art unsupervised dehazing works, and achieves competitive performance with the supervised works. Moreover, the proposed SSID framework can be easily generalized to the existing deep dehazing models, greatly improving dehazing robustness on real-world hazy images.

Abstract:
Adversarial attacks have challenged the security of deep neural networks (DNNs) recently. The most prominent adversarial attack methods include backdoor attacks, adversarial examples, etc. These attack methods inject triggers or perturbations into images, leading to extremely dangerous security vulnerability in deep learning domain. The various forms of adversarial attacks can contaminate DNNs with their distinct characteristics. The complexity of adversarial attack poses a great challenge to designing a general defense strategy. In this paper, we propose a novel defense method against most of adversarial attacks through Image Decomposition and Reconstruction (IDR). Our method can be applied to poisoned images without the need for internal information about the model or any prior knowledge of the clean/poisoned images. We apply a linear transformation on the poisoned image to destroy the perturbations or triggers and deploy a pre-trained diffusion model to reconstruct the original information. In particular, we propose a novel reverse process that utilizes the consistency of range-null space decomposition to guide the generation of purified images. The decomposition of the range-null space can guarantee the retrieval of image information, which enhances the robustness of our method and contributes to the reliable purification of poisoned images. We assess the effectiveness of our proposed IDR against various prevalent backdoor attacks, adversarial examples and Image-Scaling attack methods. The experimental results highlight the outstanding defensive capabilities of our proposed IDR, demonstrating an exceptionally high defense success rate.

Abstract:
Referring Expression Segmentation (RES), which aims to identify and segment objects based on natural language expressions is garnering increased research attention. While substantial progress has been made in RES, the emergence of Generalized Referring Expression Segmentation (GRES) introduces new challenges by allowing the expressions to describe multiple objects or lack specific object references. Existing RES methods usually rely on sophisticated encoder-decoder and feature fusion modules, and have difficulty generating class prototypes that match each instance individually when confronted with the complex referent and binary labels of GRES. In this paper, reevaluating the differences between RES and GRES, we propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region. It enables different query vectors to match instances of different categories, or different parts of the same instance, significantly expanding the decoder's flexibility, dispersing global pressure across all the queries, and easing the demands on the encoder. The experimental results demonstrate that MABP significantly outperforms the state-of-the-art methods in all three splits on the gRefCOCO dataset. Moreover, MABP outperforms the state-of-the-art methods on the RefCOCO+ and G-Ref datasets, and achieves very competitive results on RefCOCO.

Abstract:
Many XR applications require the delivery of volumetric video to users. Point Cloud has become a popular volumetric video format. A dense point cloud consumes much higher bandwidth than a 2D/360 ^\circ video frame. User Field of View (FoV) is more dynamic with 6-DoF movement than 3-DoF movement. To save bandwidth, FoV-adaptive streaming predicts a user's FoV and only downloads point cloud data falling in the predicted FoV. However, it is vulnerable to FoV prediction errors, which can be significant when a long buffer is utilized for smoothed streaming. In this work, we propose a multi-round progressive refinement framework for point cloud video streaming. Instead of sequentially downloading point cloud frames, our solution simultaneously downloads/patches multiple frames falling into a sliding time-window, leveraging the inherent scalability of octree-based point-cloud coding. The optimal rate allocation among all tiles of active frames are solved numerically using the heterogeneous tile rate-quality functions calibrated by the predicted user FoV. Multi-frame downloading/patching simultaneously takes advantage of the streaming smoothness resulting from long buffer and the FoV prediction accuracy at short buffer length. We evaluate our streaming solution using simulations driven by real point cloud videos, real bandwidth traces, and 6-DoF FoV traces of real users. Our solution is robust against the bandwidth/FoV prediction errors, and can deliver high and smooth view quality in the face of bandwidth variations and dynamic user and point cloud movements.

Abstract:
Class-incremental 3D object detection demands a 3D detector to locate and recognize novel categories in a stream fashion while preserving its base detection ability. However, existing methods require delicate 3D annotations for learning novel categories, resulting in significant labeling costs. To this end, we explore a label-efficient approach called Weakly Incremental 3D Detection (WI3D), which teaches a 3D detector to learn incrementally with off-the-shelf vision foundation models. We propose a novel dual-teaching framework incorporating both intra-modal and inter-modal knowledge from pseudo labels and feature space. Specifically, our framework features a class-agnostic pseudo-label refinement module, designed for the generation of high-quality 3D pseudo labels. This module is built on a lightweight transformer that models the spatial relationships between pseudo labels and their interactions with rich contextual information in point clouds. Additionally, we introduce a cross-modal knowledge transfer module to enhance the representation learning of novel classes, along with a reweighting knowledge distillation strategy that dynamically assesses and distills knowledge from previously learned categories. Extensive experiments show that our approach can efficiently learn novel concepts while preserving knowledge of base classes in WI3D scenarios, and surpass baseline approaches on both SUN-RGBD and ScanNet.

Abstract:
Audio-driven talking head generation has drawn growing attention. To produce talking head videos with desired facial expressions, previous methods rely on extra reference videos to provide expression information, which may be difficult to find and hence limits their usage. In this work, we propose TalkCLIP, a framework that can generate talking heads where the expressions are specified by natural language, hence allowing for specifying expressions more conveniently. To model the mapping from text to expressions, we first construct a text-video paired talking head dataset where each video has diverse text descriptions that depict both coarse-grained emotions and fine-grained facial movements. Leveraging the proposed dataset, we introduce a CLIP-based style encoder that projects natural language-based descriptions to the representations of expressions. TalkCLIP can even infer expressions for descriptions unseen during training. TalkCLIP can also use text to modulate expression intensity and edit expressions. Extensive experiments demonstrate that TalkCLIP achieves the advanced capability of generating photo-realistic talking heads with vivid facial expressions guided by text descriptions.

Abstract:
How to learn multi-scale and multi-level representations is crucial for robust tracking. However, most current one-stream structure based trackers with visual transformers (dubbed ViTs) cannot effectively capture multi-scale representations due to the structure of their adopted ViTs is non-hierarchical. Meanwhile, they often only use the output features from the final layer for predicting results (i.e., ignoring the utilization of low-level features from the shallow layers) which may result in a certain degree of lacking multi-level representation learning ability. To address these issues, we propose a robust multi-stage tracker that effectively combines the advantages of both hierarchical and one-stream structured ViT as a tracking backbone to improve the multi-scale and multi-level representation learning abilities. Specifically, first of all, we design a hierarchical tracker with a three-stage backbone. In the first two stages of our tracker, we utilize a dual-branch structure to obtain multi-scale features of the template and search region separately. Especially, We design the local scale awareness modules based on simple MLP layers to capture multi-scale features. These modules remove complex operations such as convolutions or shifted window attentions, thus avoiding the performance degradation caused by traditional hierarchical ViTs. In the third stage (i.e. the main stage), we construct a global encoder based on the one-stream ViT to achieve efficient feature extraction and feature interaction for our tracker. Then, we design a multi-level feature integration module in the main stage to explicitly utilize the representation information learned from the shallow layers and fuse them with the features of the final layer to obtain multi-level representation information. Lastly, benefit from the these designs, our tracker can effectively capture more multi-scale and multi-level representations for robust tracking. Comprehensive experiments on GOT-10 k, LaSOT, LaSOT_ext, TNL2K, UAV123, TrackingNet and VOT2020 benchmarks validate the effectiveness and robustness of our method.

Abstract:
In recent years, point cloud analysis methods based on the Transformer architecture have made significant progress, particularly in the context of multimedia applications such as 3D modeling, virtual reality, and autonomous systems. However, the Transformer architecture’s high computational demands limit its scalability and deployment on resource-constrained platforms, hindering its practical applications in on-device multimedia processing. To address this challenge, we propose an efficient point cloud analysis architecture, Point MLP-Transformer (PointMT). This study tackles the quadratic complexity of the self-attention mechanism by introducing a linear complexity local attention mechanism for effective feature aggregation. Additionally, to counter the Transformer’s focus on token differences while neglecting channel differences, we introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel, enhancing the precision of feature aggregation. To improve the Transformer’s slow convergence speed due to the limited scale of point cloud datasets, we propose an MLP-Transformer hybrid module, which significantly enhances the model’s convergence speed. Furthermore, to boost the feature representation capability of point tokens, we refine the classification head, enabling point tokens to directly participate in prediction. Experimental results on multiple evaluation benchmarks demonstrate that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy. This research provides an innovative solution for efficient point cloud analysis, offering significant potential for multimedia applications and other domains.

Abstract:
In this paper, we propose a novel approach for segmenting and tokenizing a video scene recording into a sequence of cascade units, known as visual segment units and modeled with visual segment models (VSMs) for video scene classification (VSC). Specifically, the proposed VSM framework takes deep visual features extracted from pre-trained encoders as inputs and models the temporal interactions between segment units by hidden Markov models. Next, we use unit co-occurrence statistics to introduce relationships between VSM units within a video scene recording. Furthermore, the VSM approach is extended to an acoustic-visual variant, subsequently integrating itself into a deep learning-based multi-modal scene classification system. This combination serves to further exploit the complementary nature of audio and video data. By incorporating a set of visual segment units into modeling a video scene class, it captures both inter-class similarity and intra-class diversity, facilitating improved scene classification, especially within categories prone to confusion. Extensive experimental results on a benchmark published by the DCASE (Detection and Classification of Acoustic Scenes and Events) 2021 Challenge show that the proposed framework can effectively handle the confusion issue among similar video scenes. In addition, our multi-modal integration system achieves state-of-the-art performance in the audio-visual scene classification task in the DCASE 2021 Challenge, thereby demonstrating the effectiveness of our proposed approach.

Abstract:
In cloud environments, privacy-preserving content-based image retrieval (PPCBIR) enables users to retrieve images, while protecting image privacy. Existing PPCBIR systems often use a single image key, which causes low efficiency and makes it difficult to achieve fine-grained access control over images. This paper proposes a lightweight and controllable privacy-preserving image retrieval in multi-user settings (named LCPIRM) to improve time efficiency and access control performance. A one-time image encryption method based on reversible embedding is proposed to balance the contradiction between complexity and security without increasing the difficulty of key management. A robust hash generation method is designed by combining piece-wise mean quantization and encryption image features, which can effectively improve retrieval efficiency because the robust hashes embedded in the encrypted images can be extracted and establish inverted indexing in the cloud. When dealing with authorized encrypted images, the cloud server uses proxy re-encryption to convert the image keys embedded within themselves from the owner’s public key protection to the authorized user’s public key protection, achieving fine-grained access control over images in a multi-user setting. Theoretical analysis and experimental results show that LCPIRM has better performance in terms of retrieval accuracy, consumption, and search efficiency while meeting security requirements. In the real datasets Caltech256 and Caltech101, the search efficiency has increased by 74% and 58% respectively compared to the existing schemes.

Abstract:
Low-light Image Enhancement (LLIE) aims to rectify inadequate illumination conditions and achieve superior visual quality in images, which plays a pivotal role in the domain of low-level computer vision. Due to poor illumination in images, many high-frequency details are obscured, which leads to an uneven distribution of low- and high-frequency information. However, most existing LLIE methods do not pay special attention to the restoration of high-frequency detail information and some challenging-to-recover areas in images. To address this issue, we propose a novel progressive prompt-driven LLIE framework with frequency aware learning, through a two-stage coarse-to-fine learning mechanism. Specifically, the proposed method fully utilizes both the specially designed brightness-aware prompt and detail-aware prompt on the prior trained model, to achieve an excellent enhanced image that exhibits more natural brightness and richer detail information. Furthermore, the proposed frequency aware learning objective can adaptively adjust the contribution of individual pixels for image reconstruction based on the statistics of high- and low-frequency features, which enables the network to focus on learning intricate details and other challenging areas in low-light images. Extensive experimental results demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods on representative real-world and synthetic datasets.

Abstract:
Point cloud sampling aims to derive a sparse point cloud from a relatively dense point cloud, which is essential for efficient data transmission and storage. While existing deep sampling methods prioritize preserving the perception of sampled point clouds for downstream networks, few studies have critically examined the rationale behind this goal. Specifically, we observe that sampling can lead to a perceptual degradation phenomenon in many influential downstream networks, impairing their ability to effectively process sampled point clouds. We theoretically reveal the nature of the phenomenon and attempt to construct a novel sampling target by uniting upsampling and perceptual reconstruction. Accordingly, we propose a Maximum A Posteriori (MAP) sampling framework named Sample for Reconstruct (S4R), which impels the sampling stage to infer upsampling-guided perception. In S4R, we design very simple but effective sampling and upsampling networks using residual-based graph convolutions and incorporate a pseudo-residual connection to introduce prior knowledge. This architecture takes advantage of reconstruction properties and allows the sampling network to be trained in an unsupervised manner. Extensive experiments on classical networks demonstrates the excellent performance of S4R compared with the previous sampling schemes and reveals its advantages on different point cloud downstream tasks, i.e., classification, reconstruction and segmentation.

Abstract:
Few-shot segmentation (FSS) aims at training a model on base classes with sufficient annotations and then tasking the model with predicting a binary mask to identify novel class pixels with limited labeled images. Mainstream FSS methods adopt a support-query matching paradigm that activates target regions of the query image according to their similarity with a single support class prototype. However, this prototype vector is inclined to overfit the support images, leading to potential under-matching in latent query object regions and incorrect mismatches with base class features in the query image. To address these issues, this study reformulates conventional single foreground prototype matching to a multi-prototype matching paradigm. In this paradigm, query features exhibiting high confidence with non-target prototypes will be categorized as background. Specifically, the target query features are drawn closer to the novel class prototype through a Masked Cross-Image Encoding (MCE) module and a Semantic Multi-prototype Matching (SMM) module is incorporated to collaboratively filter unexpected base class regions on multi-scale features. Furthermore, we devise an adaptive class activation map, termed target-aware class activation map (TCAM) to preserve semantically coherent regions that might be inadvertently suppressed under pixel-wise matching guidance. Experimental results on PASCAL-5^i and COCO-20^i datasets demonstrate the advantage of the proposed novel modules, with the holistic approach outperforming compared state-of-the-art methods.

Abstract:
The increasing interest in learning from paired medical images and textual reports highlights the need for methods that can achieve multi-grained alignment between these two modalities. However, most existing approaches overlook fine-grained semantic alignment, which can constrain the quality of the generated representations. To tackle this problem, we propose the Multi-Grained Vision-and-Language Alignment (MGVLA) model, which effectively leverages multi-grained correspondences between medical images and texts at different levels, including disease, instance, and token levels. For disease-level alignment, our approach adopts the concept of contrastive learning and uses medical terminologies detected from textual reports as soft labels to guide the alignment process. At the instance level, we propose a strategy for sampling hard negatives, where images and texts with the same disease type but differing in details such as disease locations and severity are considered as hard negatives. This strategy helps our approach to better distinguish between positive and negative image-text pairs, ultimately enhancing the quality of our learned representations. For token-level alignment, we employ a masking and recovery technique to achieve fine-grained semantic alignment between patches and sub-words. This approach effectively aligns the different levels of granularity between the image and language modalities. To assess the efficacy of our MGVLA model, we conduct comprehensive experiments on the image-text retrieval and phrase grounding tasks.

Abstract:
Multimodal sentiment analysis aims at exploiting complementary information from multiple modalities or data sources to enhance the understanding and interpretation of sentiment. While existing multi-modal fusion techniques offer significant improvements in sentiment analysis, real-world scenarios often involve missing modalities, introducing complexity due to uncertainty of which modalities may be absent. To tackle the challenge of incomplete modality-specific feature extraction caused by missing modalities, this paper proposes a Cosine Margin-Aware Network (CMANet) which centers on the Cosine Margin-Aware Distillation (CMAD) module. The core module measures distance between samples and the classification boundary, enabling CMANet to focus on samples near the boundary. So, it effectively captures the unique features of different modal combinations. To address the issue of modality imbalance during modality-specific feature extraction, this paper proposes a Weak Modality Regularization (WMR) strategy, which aligns the feature distributions between strong and weak modalities at the dataset-level, while also enhancing the prediction loss of samples at the sample-level. This dual mechanism improves the recognition robustness of weak modality combination. Extensive experiments demonstrate that the proposed method outperforms the previous best model, MMIN, with a 3.82% improvement in unweighted accuracy. These results underscore the robustness of the approach under conditions of uncertain and missing modalities.

Abstract:
Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation, and multi-round conversation. To facilitate FoodLMM in dealing with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks.

Abstract:
Video summarization and captioning condense content by selecting keyframes and generating language descriptions, integrating both visual and textual perspectives. Existing video-and-language learning models typically select multiple frames as proxies rather than analyzing all frames, which improves computational efficiency but may not adequately represent the original content without redundancy. In this paper, we propose an adaptive dual video summarization framework and demonstrate its effectiveness within the context of video captioning. Given the video frames, we extract visual representations using a video-domain fine-tuned ViT model to narrow the domain shift. The keyframes are summarized based on the frame-level scores. To minimize the number of keyframes while ensuring captioning quality, we introduce a cross-modal video summarizer that selects the most semantically consistent frames according to pseudo score labels. Furthermore, we incorporate an adaptive keyframe selector that determines the optimal number of keyframes based on the video’s complexity and content, enhancing the framework’s adaptability and generalization. The proposed adaptive keyframe selector enables the framework to handle diverse video content, making it more generalizable and applicable to real-world scenarios.We designed a ranking scheme to assess the video’s static appearance and temporal dynamics from score-based and time-based perspectives. To conclude, we use a lightweight LSTM decoder to generate descriptions. Experimental results on the MSR-VTT, MSVD and VATEX benchmarks demonstrate that our adaptive dual video summarization framework can effectively convey the same semantic information as the original video while using a significantly reduced number of keyframes, leading to improved video captioning performance.

Abstract:
Local sampling plays a key role in modeling 3D point clouds. Due to the disordered and unstructured nature of point cloud data, conventional 3D deep models such as PointNet++ and its variants usually employ random or fixed rules to sample local neighborhoods, leading to considerable redundancy in the feature aggregation process. In this paper, we propose a self-supervised method for learning to adaptively select effective neighbors. Firstly, we observe that only a part of sampled points contributes to the aggregated features after the max-pooling operation in existing point cloud models. Then, based on this observation, we propose a simple and task-oriented metric to evaluate the sampling efficiency by measuring the effective neighbors in the feature aggregation process. The metric is also used to supervise a lightweight neighborhood scoring module (NSM), which is designed to efficiently select effective neighboring points from a wider range of neighbors to reduce the computational cost and keep the performance superior. To further improve the performance, we introduce Neighborhood Attention in the feature aggregation process according to the importance score of neighborhood points predicted by NSM. Experimental results show that our method is simple and efficient, and can be applied to most tasks and models to reduce the computational cost and keep the performance superiority.

Abstract:
Most collection-based style transfer methods require training a separate model for each individual collection of styles, making the extension to multiple collections of styles less flexible. Besides, the existing collection-based methods are also less flexible in extending to new style collections in a continual manner. To address these issues, we propose a novel MultI-Dictionary Generative Adversarial Network framework (MID-GAN) for multi-collection style transfer. Specifically, we design a multi-dictionary architecture within a GAN, with each dictionary consisting of a set of local style codes for a specific style collection. Benefiting from the local style codes used in the dictionary, a stylization module with aligned skip connections is further proposed, which can better preserve both the local details and the overall image structure. The dictionary design allows a flexible extension to new style collections by readily adding new dictionaries and we propose a continual training strategy that can both preserve the style transfer ability of old styles and achieve good transfer results for newly added styles. Extensive experiments are performed to show that the proposed method is better than existing collection-based style transfer methods. We also demonstrate the proposed method can generate diverse meaningful style transfer results of the same style collection.

Abstract:
Solely based on given prompts, text-guided diffusion models have enjoyed a unique capability in generatingdiverse and creative images. Nevertheless, the conveyance of image information through text presents a series of challenges, particularly in controlling the positioning of objects in synthesized images. Despiteattempts of recent efforts in exploring alternative conditions, such as bounding box/mask-image pairs, the requirement of a substantial amount of paired data and time-consuming fine-tuning emerge as new issues. Given the observations that not only prompt-related cross-attention maps reveal the spatial arrangement and centroid positions of the objects, but also out-of-prompt markers enjoy rich semantic information, we thus engineer a weighted optimization loss. Specifically, three spatial sub-losses, namely inner box reinforcement loss, outer box attenuation loss, and centroid loss, are devised and seamlessly integrated into the sampling step of current vanilla diffusion models. Without any annotations of layout data required, the final approach runs in a training-free fashion. Extensive experiments with new performance scores demonstrate that our proposal not only successfully addresses the issue of object positioning but also boosts the capabilities of most current models, such as Stable Diffusion and GLIGEN, in high-quality synthesis and coverage of various concepts. Moreover, the proposed mechanism plays a plug-and-play role.

Abstract:
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Language Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.

Abstract:
Due to the requirement of target domain data in existing unsupervised domain adaptation (UDA) techniques, researchers have shifted their focus to a more practical and challenging scenario, i.e., zero-shot domain adaptation (ZSDA). However, ZSDA remains a significant challenge, with existing approaches in ZSDA often relying heavily on a carefully crafted and highly compatible auxiliary domain. This is impractical in real-world applications. To address the mentioned problems, we propose conditional prompt-induced style reconstruction with contrastive language-image pre-training (CPSR-CLIP), which leverages the rich semantic embedding of CLIP to synthesize target-like features, effectively bypassing the need for auxiliary dual-domain samples. CPSR-CLIP adopts a multi-phase optimization strategy and every optimization phase is a prerequisite for the next phase. Firstly, we propose dynamic prompt disentanglement to facilitate the model in differentiating the discrepancy between the source and target prompts, thus paving the way for conditional prompt-induced style reconstruction phase. This phase meticulously strips away domain-specific styles to reserve domain-invariant features and injects target style characteristics through target domain prompts. Finally, with the target-like features in hand, we adaptively adjust the learnable part of target prompts for further fitting. Extensive experiments have been conducted on several datasets and the results demonstrate the superiority of CPSR-CLIP over the state-of-the-art methods.

Abstract:
Fusing complementary information in the visible-infrared image offers a promising approach to enhance the performance of downstream computer vision tasks (e.g., object detection, segmentation) in complicated imaging conditions (e.g., low illumination). However, due to the robust imaging capacity of the infrared sensor, most existing methods primarily rely on the salient object intensity information in the infrared modality, while the visible information (e.g., color, texture) is not adequately utilized, thus limiting their generalization capacity. In this study, we present a novel image fusion framework, i.e., MCInet, which attempts to Maximize and merge the Complementary Information across visible-infrared modalities for more informative image fusion. To this end, we first introduce the modality-specific processing module to improve the information representation of each modality. For visible images, a pretrained low-light enhance module is adopted to enhance its color and texture. For infrared images, a nonlinear mapping module is constructed to suppress excessive salient object intensity. Then we establish a reusable MCI block that embeds a crossimage mutual information minimization scheme into an input-aware fusion module. This empowers us to dynamically maximize and merge complementary information between two input images according to their feature representation. In addition, we introduce a cycle reconstruction loss to self-supervised regularize the fusion results. Experiments on image fusion, object detection, and segmentation demonstrate that the proposed framework can produce more informative fusion results and exhibit better performance in downstream tasks.

Abstract:
Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments that align with viewer interests. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP’s visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with cross-attention mechanism to attend to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large-scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.

Abstract:
As a crucial component of object detectors, current detection heads often lack the capability to effectively utilize contextual information, adapt to deformable objects, and align features and tasks. However, most existing methods prioritize a single capability, lacking comprehensive approaches to introduce them simultaneously. In this paper, we propose the Enhanced Head to integrate the above three capabilities into the detectors concurrently. Specifically, we propose three attention blocks with linear complexity: Global Concentrated Attention (GCA), Local Deformable Cross-Task Attention (LDCA), and Boundary-Aware Cross-Task Attention (BACA). The GCA captures long-range dependencies efficiently by employing Spatial Information Concentration (SIC). The LDCA improves feature alignment and deformation adaptability by enabling local deformable cross-task feature interactions. The BACA aligns classification features with localization results, enhancing task alignment and further improving deformation adaptability through a region-deformable interaction scheme. We implement Enhanced Head as a plug-and-play detection head and evaluate its effectiveness through extensive experiments on the MS COCO and VisDrone datasets. For instance, on the COCO detection benchmark, our Enhanced Head achieves +3.6 AP gain for FSAF, +3.3 AP for RetinaNet, and +2.9 AP for ATSS while reducing the FLOPs.

Abstract:
Input diversity is an effective technique for crafting transferable adversarial examples that can deceive unknown AI models. Existing input-diversity-based methods typically use single input transformation, limiting targeted transferability and defense robustness. Combining different transformation types is challenging, as keeping increasing types would degrade semantic information and targeted transferability. This paper proposes a quality-aware transformation combination attack (TCA) that selects high-quality transformation combinations. The quality-aware selection enables expansion of transformation types, enhances input diversity, and hence improves targeted transferability and defense robustness. We first design a quality-evaluation framework to quantify the effectiveness of transformation combinations, which jointly considers convergence, transferability, and robustness. Only a small group (up to 10) of images are required for computation-efficient quality evaluation. Experiments validate TCA’s superiority over state-of-the-art baselines in adversarial transferability and robustness. When defenses are secured, the average targeted success rate of TCA with four transformation types (i.e., TCA-t4) outperforms the best baseline by 26%～42% on ImageNet.

Abstract:
Image splicing is a common technique used in image forgery. With the rapid development of digital image processing technology, detecting image splicing forgery has become increasingly challenging. Existing splicing forgery localization methods lack exploration in effectively utilizing tampered region boundary information. To address this issue, we propose a novel model for detecting image splicing forgery called boundary-assisted network (BASNet). We introduce a boundary-motivated module (BMM) to explore valuable and additional boundary features related to tampered regions, enhancing representation learning for detecting tampered regions. Additionally, we present a boundary-enhanced module (BEM) to enhance boundary information using the cross-channel attention mechanism. To efficiently merge features from various levels and boundary features, we further present the feature fusion module (FFM). To optimize performance, the BASNet incorporates weighted binary cross-entropy loss, dice loss, and boundary loss, which can effectively leverage edge supervision while mitigating imbalance between positive and negative samples. Evaluation of five widely-used forgery detection datasets demonstrates the state-of-the-art performance of the BASNet. Robustness experiments verify that the BASNet is robust enough to detect image splicing forgery across various common attacks.

Abstract:
Underwater images are typically characterized by color cast, haze, blurring, and uneven illumination due to the selective absorption and scattering when light propagates through the water, which limits their practical applications. Underwater image enhancement and restoration (UIER) is one crucial mode to improve the visual quality of underwater images. However, most existing UIER methods concentrate on enhancing contrast and dehazing, rarely pay attention to the local illumination differences in the image caused by illumination variations, thus introducing some undesirable artifacts and unnatural color. To address this issue, an effective variational framework is proposed based on an extended underwater image formation model (UIFM). Technically, dual high-order regularizations are successfully integrated into the variational model to acquire smoothed local ambient illuminance and structure-revealed reflectance in a unified manner. In our proposed framework, the weight factors-based color compensation is combined with the color balance to compensate for the attenuated color channels and remove the color cast. In particular, the local ambient illuminance with strong robustness is acquired by performing the local patch brightest pixel estimation and an improved gamma correction. Additionally, we design an iterative optimization algorithm relying on the alternating direction method of multipliers (ADMM) to accelerate the solution of the proposed variational model. Considerable experiments conducted on three real-world underwater image datasets demonstrate that the proposed method outperforms several state-of-the-art methods with regard to visual quality and quantitative assessments. In the quantitative assessments, the proposed method achieves average scores of 0.205 FADE, 7.688 Entropy, 0.628 UCIQE, and 0.775 FDUM across the UIEB and UIQS datasets. Moreover, the proposed method can also be extended to outdoor image dehazing and low-light image enhancement tasks.

Affiliations: Department of Ultrasound, the Second Affiliated Hospital of Xi’an Jiaotong University, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China; School of Software Engineering, Xi’an Jiaotong University, Xi’an, China; State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, China; Internet of Things Thrust and the Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China

Abstract:
Self-supervised Learning (SSL), including mainstream contrastive learning, has achieved significant success in learning visual representations without the need for data annotations in 3D vision. While most contrastive learning methods focus on instance-level information through random affine transformations, they pay limited attention to the intrinsic structures within point clouds. In this work, we propose a novel SSL paradigm for point cloud representation learning, called CCPoint, which incorporates a novel form of data corruption as a negative augmentation strategy. Specifically, we degrade the input point cloud with various corruptions and conduct contrastive learning among the augmented, raw, and corrupted points to learn robust and discriminative representations. To preserve the semantic structure of the point cloud even under heavy degradation, an auxiliary reconstruction decoder is introduced into the corruption branch to provide an additional supervision signal. We explore four families of corruptions—affine, noise, masking, and combined transformations. Different from previous methods that rely on multi-modal data or complex network architectures, CCPoint achieves state-of-the-art performance on three widely used datasets (ModelNet40, ScanObjectNN, and ShapeNetPart) with a lightweight and efficient structure, reaching top linear accuracies of 92.4% and 86.2% on ModelNet40 and ScanObjectNN, respectively.

Abstract:
Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various textto-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGCT23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively. Specifically, the proposed method utilizes the projection videos of text-to-3D assets to extract 3D shape, texture and text-asset correspondence features, then fuses them to calculate the final three preference scores respectively. Extensive experimental results demonstrate the effectiveness of the proposed T23DAQA method in evaluating the quality of AI generated 3D asset, which is more consistent with human perception. This is the first work that studies the problem of text-guided 3D generation quality assessment, and The database is released at https://github.com/ZedFu/T23DAQA.

Abstract:
Vision Graph Neural Network (ViG) is the first graph neural network model capable of directly processing image data. The community primarily focuses on the model structures to improve ViG’s performance but lacks attention to its graph construction method. To avoid quadratic computational complexity, ViG uses clustering algorithms (K-nearest neighbor) to construct graph structures. Nevertheless, clustering algorithms introduce biases, which limit ViG’s ability to obtain global information. To address this problem, we propose RandomViG, which abandons clustering algorithms and uses a random manner to obtain relationships between nodes. Our RandomViG is sparse in computation and can approximate a complete graph, enabling ViG to gain global interaction capability. In order to obtain the local dependence, we design a local feature extraction module for RandomViG. In addition, to alleviate the over-smoothing problem, we propose a novel method called MRN (maintaining relationships among nodes). Considering that the increased feature diversity does not necessarily lead to better performance, MRN does not aim to maximize the feature diversity of the model but instead strives to maintain consistency between the feature similarity and the inherent similarity of the original image. We validate our proposal in three major computer visual tasks, including image classification, object detection, and instance segmentation. Without extra data, RandomViG-Ti achieves 79.4% ImageNet-1 K top-1 accuracy, outperforming the baseline (ViG) by 1.2% . Under the same model scale, our RandomViG performs better with fewer FLOPs compared with existing state-of-the-art models.

Abstract:
The perception-distortion-tradeoff reveals the limitation of current low-level deep learning paradigms, i.e., minimizing reconstruction distortion does not guarantee improved perceptual quality. Acknowledging the lack of a reliable perception-oriented optimization function, we are motivated to explore a flexible approach for enhancing perceptual quality by steering the tradeoff to prioritize perception. To this end, we reconsider the perception-distortion function by incorporating the Just-Noticeable-Distortion (JND) mechanism. We mathematically demonstrate that in the common image restoration process, altering the optimization target from natural images to distorted images—where the distortion intensity is constrained by the JND threshold and the distortion type aligns with that arising from the restorer itself—effectively obtained improved perception indices without any changes to the restorer or optimization function. Accordingly, to facilitate various low-level learning models, we are motivated to construct the first large-scale CNN-oriented JND image dataset. Our dataset comprises 500 natural images and 4,500 degraded versions generated by a series of autoencoders, as well as the actual JND judgment results collected through rigorous subjective testing from twenty volunteers. Finally, a learning-based JND inference model is established on the proposed dataset and employed in the proposed JND-based adaptation scheme, where the inferred JND images serve as pseudo-ground truth for the training or fine-tuning processes of low-level vision models. Extensive experiments on image super-resolution and end-to-end image compression across multiple models have shown encouraging improvements in perceptual quality, demonstrating the effectiveness of the proposed scheme.

Abstract:
Evidence-based deep learning represents a burgeoning paradigm for uncertainty estimation, offering reliable predictions with negligible extra computational overheads. Existing methods usually adopt Kullback-Leibler divergence to estimate the uncertainty of network predictions, ignoring domain gaps among various modalities. To tackle this issue, this paper introduces a novel algorithm based on Hölder Divergence (HD) to enhance the reliability of multi-view learning by addressing inherent uncertainty challenges from incomplete or noisy data. Generally, our method extracts the representations of multiple modalities through parallel network branches, and then employs HD to estimate the prediction uncertainties. Through the Dempster-Shafer theory, integration of uncertainty from different modalities, thereby generating a comprehensive result that considers all available representations. Mathematically, HD proves to better measure the “distance” between real data distribution and predictive distribution of the model and improve the performances of multi-class recognition tasks. Specifically, our method surpasses the existing state-of-the-art counterparts on all evaluating benchmarks. We further conduct extensive experiments on different backbones to verify our superior robustness. It is demonstrated that our method successfully pushes the corresponding performance boundaries. Finally, we perform experiments on more challenging scenarios, i.e., learning with incomplete or noisy data, revealing that our method exhibits a high tolerance to such corrupted data.

Abstract:
The state-of-the-art YOLO detection algorithms still suffer from the issue of redundant extraction of similar features during feature propagation, and the simplistic stacking approach of connecting different features limits the flexibility of feature fusion. We propose a new feature recombination mechanism involving refining feature extraction and flexible concatenation. It includes the HFConv (Hybrid Flexibility Convolution) module, the MFD (Multivariate Flexibility Downsampling) module, and the DFSPP (Deformable and Flexible Spatial Pyramid Pooling) module. Specifically, the HFConv module employs feature refinement and flexible connection strategies to optimize feature representation and reduce redundancy in a dynamic way, acquiring diverse feature information from local and surrounding regions. The MFD module leverages multiple downsampling methods to address the issue of feature redundancy that may arise from a single downsampling method, thereby enhancing feature diversity. The DFSPP module learns an offset corresponding to the pooling kernel size, allowing for the extraction of the most critical information in a dynamic manner. By incorporating these modules into the YOLO architecture, we develop a more robust network called FRFCNet, and the experimental results show a notable 4.1% and 2.8% improvement in AP values on the VOC2012 and COCO2017 datasets, respectively, compared to the baseline (YOLOV7-Tiny-SiLu), outperforming current one-stage detectors.

Affiliations: State Key Laboratory of Information Security (SKLOIS), Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Security Department of Alibaba Group, Hangzhou, China; School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China

Abstract:
In recent years, there has been a growing interest in multimodal recommendation systems due to the rapid growth of multimedia and the explosion of information. Despite notable advancements, current models often fuse multimodal embeddings with ID (name or concept) embeddings in a weighted or concatenated manner for items. Under this circumstance, they may overlook the heterogeneity problem between different modalities, and lack theoretical guarantees, potentially leading to suboptimal item representations. To overcome this challenge, we introduce a novel model named OTRec, which employs optimal transport (OT) to align heterogeneous multimodal embeddings with ID embeddings. Specifically, OTRec captures co-occurrence features across modalities and distinctive features within modalities, enabling the formation of the unified representation from both modal-invariant and modal-specific perspectives. This dual strategy ensures a comprehensive alignment of heterogeneous multimodal data, significantly improving the accuracy of capturing user preferences. Additionally, traditional recommendation models typically match an item’s ID with its multimodal data as positive samples for contrastive learning, neglecting the potential complementary information from other items’ multimodal data. To address this issue, we introduce a semantic-enhanced contrastive learning module, which can learn latent semantic correlations across items by a semantic-similarity weighting matrix. It can be integrated as a plug-in for other models to effectively explore latent semantics. On top of this, we provide theoretical guarantees that demonstrate the effectiveness of OTRec in aligning multimodal and ID information and in enhancing the mutual information between them. Extensive evaluations on three public datasets illustrate OTRec’s effectiveness and achieve state-of-the-art performance.

Abstract:
Guided depth super-resolution is essential in many applications, which enhances low-resolution (LR) depth maps using high-resolution (HR) RGB images from the same scene. However, the challenge lies in avoiding the texture-copy artifacts issue caused by structural inconsistencies between two modalities. To mitigate, we propose a cross-modality and cross-scale guided depth super-resolution network (D2CNet). We first design a novel two-stage feature integration module to effectively fuse multi-modal RGB and depth while minimizing texture-copy artifacts. That is, a cross-modality fusion stage transfers consistent structures from RGB to depth in a multi-scale manner, and a cross-scale refinement stage mitigates inconsistent structures across modalities. In addition, we design a convolution group as the basic module to well extract high-frequency features and an LR and HR domain projection strategy to enrich features between the fusion and refinement stages. We then develop a new network architecture by progressively repeating the feature integration module and the convolution group, which is flexibly controllable to strike a balance between accuracy and cost for easy implementation in real world. Extensive experiments on multiple benchmarks demonstrate that our D2CNet consistently achieves superior accuracy and generalization ability across sampling scales in both qualitative and quantitative evaluations, when compared to state-of-the-art baselines.

Abstract:
Open-vocabulary object detection (OVD) aims to detect objects beyond the training annotations, where detectors are usually aligned to a pre-trained vision-language model, e.g., CLIP, to inherit its generalizable recognition ability so that detectors can recognize new or novel objects. However, previous works directly align the feature space with CLIP and fail to learn the semantic knowledge effectively. In this work, we propose a hierarchical semantic distillation framework named HD-OVD to construct a comprehensive distillation process, which exploits generalizable knowledge from the CLIP model in three aspects. In the first hierarchy of HD-OVD, the detector learns fine-grained instance-wise semantics from the CLIP image encoder by modeling relations among single objects in the visual space. Besides, we introduce text space novel-class-aware classification to help the detector assimilate the highly generalizable class-wise semantics from the CLIP text encoder, representing the second hierarchy. Lastly, abundant image-wise semantics containing multi-object and their contexts are also distilled by an image-wise contrastive distillation. Benefiting from the elaborated semantic distillation in triple hierarchies, our HD-OVD inherits generalizable recognition ability from CLIP in instance, class, and image levels. Thus, we boost the novel AP on the OV-COCO dataset to 46.4% with a ResNet50 backbone, which outperforms others by a clear margin. We also conduct extensive ablation studies to analyze how each component works.

Abstract:
As a crucial representation of 3D data, a point cloud (PC) can accurately capture the geometry, structure, and color information of objects. However, various quality problems arise owing to device noise, data acquisition errors, and compression algorithms, limiting the application of PCs. Therefore, assessing PC quality to determine its suitability for applications is a challenging task. In this work, a local and global structure-guided feature extraction and attention network (LGS-Net) is introduced for no-reference PC quality assessment (PCQA). This approach incorporates cluster construction (CC), local structure-guided cluster feature extraction (LSFE), and global structure-guided attention (GSA) modules. First, owing to the heightened sensitivity of the human visual system (HVS) to structural information, a graph filter is employed to identify high-frequency clusters. Within the LSFE module, a multiscale strategy is employed to ensure that structural information effectively influences both the geometry and color information. Simultaneously, the multiscale features within the cluster are dynamically fine-tuned using feature channel weight reassignment. To account for the impact of interclusters on overall quality, a GSA module is introduced to establish global dependencies between local clusters. This approach enables the extraction of final geometry, color, and structure information, which are ultimately used for accurate quality assessment. Extensive experimental results show that the proposed method outperforms the existing state-of-the-art PCQA methods using two publicly available subjective datasets.

Abstract:
Weakly supervised temporal action localization (WTAL) aims to identify action instances in untrimmed videos with only video-level supervision. Despite recent advances in WTAL methods, achieving accurate boundary localization remains a significant challenge. A key reason is that WTAL networks following a localization-by-classification pipeline tend to focus on the most discriminative features, neglecting some ambiguous features that may contain action instances. To make the WTAL model focus on low-discriminative features that include action instances, we propose an action-to-action diffusion (ActionDiff) network. This network leverages the smoothness of data generated by the diffusion model, using the diffusion model to output smooth and high-quality features that weaken the discriminative action features from the base branch, thereby enhancing the performance of the WTAL task. First, we develop a topk-based masking strategy to generate binary masks that serve as pseudo-labels for diffusion model learning. Then, we propose a diffusion branch to generate high-quality latent action space by iteratively removing noise guided by the designed pseudo-labels and conditional information. To enhance the diffusion branch’s capability to generate human behavioral features, we design an action-related conditional strategy to obtain conditional information and use it to guide the modeling of human behavior knowledge by the diffusion branch. Our comprehensive experiments demonstrate that the proposed method achieves a promising performance on three benchmark datasets: THUMOS14, ActivityNet v1.2, and v1.3.

Abstract:
Color-guided Depth map Super-Resolution (DSR) based on Convolutional Neural Networks (CNN) is a crucial technology to remedy the defects of mainstream commercial depth cameras and has made significant progress in recent years. Nevertheless, this technology is inevitably facing some huge challenges. Firstly, existing CNN-based DSR methods are designed as black-box network architectures. Secondly, few approaches study single model to achieve arbitrary-scale DSR. Thirdly, due to structural inconsistency between dual-modality, color-guided DSR methods always face texture-copying issue. To this end, we propose a novel joint DSR and high-low frequency decomposition optimization model and this model is unfolded into Deep Arbitrary-Scale Unfolding Network (DASU-Net). DASU-Net can achieve robust continuous representation ability by alternately-iterative updating of high-low frequencies and depth features. More importantly, Arbitrary-scale Up-sampling Fusion (AUF) module is proposed to achieve arbitrary-scale up-sampling and dual-modality feature fusion. Specifically, two essential components make up the cores of AUF module including arbitrary-scale up-sampling block as well as Feature Enhancement and Multiple Strategies Fusion (FEMSF) blocks. In FEMSF block, color features are first enhanced to highlight its inherently-correlated structure with the guidance of depth features, and then the enhanced features are modulated according to different fusion strategies. Furthermore, a fast version of DASU-Net is proposed to fit real-time scenes, named FDASU-Net, which can diminish the runtime by several times for a depth map with a size of 640 × 480 during inference. A large number of experiments can demonstrate that our DASU-Net and FDASU-Net can transcend many state-of-the-art DSR methods in terms of several quantitative and qualitative indexes.

Abstract:
Infrared small target detection (IRSTD) based on deep learning has received extensive research and application. However, deep learning models require a large amount of data to perform well, and the collection and standardization of infrared small target data is challenging, limiting the applicability of such models. To address this issue, this study proposes a data augmentation scheme for infrared small targets based on Generative Adversarial Networks (GANs). The proposed method is a two-step approach: the first step is the generation of clean backgrounds, and the second is the adaptive fusion of targets and backgrounds. In the background generation stage, we first use the Fast Marching Method (FMM) to fill background targets and obtain clean backgrounds. Then, we design a multi-generator and multi-discriminator GAN model (MGD-GAN) to generate high-quality and diverse background images. In the adaptive target-background fusion stage, we propose a dual-discriminator GAN network (FusionGAN), which allows the target mask to be adaptively fused with the background pixels. By combining real targets with generated backgrounds, new infrared small target images are generated, achieving the goal of data augmentation. Experiments conducted across three different scenarios demonstrate that the proposed data augmentation scheme effectively enhances the performance of both traditional and advanced detection models.

Abstract:
Multi-image steganography refers to the technique of embedding multiple secret images into a single cover image while ensuring that the secret images remain imperceptible and can be perfectly recovered by the recipient. Traditional single-image-based steganography often leads to noticeable contour shadows or color distortions in the cover image, making the hidden image more detectable. In contrast, cascaded invertible neural network-based steganography introduces a large number of parameters, complicating the network structure and resulting in a time-consuming learning and training process. To address the above problems, this paper proposes a novel flow-based, end-to-end multi-image invertible steganography framework (StegFlow), which effectively integrates forward and backward data flows for image hiding and recovery. The framework employs cascading operations to enable deep hiding of multiple secret images. To enhance the coupling capabilities, we introduce an invertible permutation layer that disrupts the channel arrangement order, allowing the coupling layer to more accurately guide the embedding of secret information into regions of the image that are easy to hide and recover. In addition, a high-frequency distribution mapping (HFDM) is designed to model the lost high-frequency information during image hiding process, significantly improving the recovery performance of the secret images. Extensive experiments are performed over multiple classical datasets, and the results demonstrate that compared to state-of-the-art (SOTA) models, the proposed framework can achieve a superior overall performance in terms of visual quality and anti-steganalysis capability. Specifically, our scheme can improve the hiding accuracy (measured by PSNR) by over 3 dB and the recovery accuracy by over 1 dB when hiding two secret images.

Abstract:
Most deep learning-based rain removal methods consider rain streaks as a part of the high-frequency information in the image to extract and remove rain streak features. This usually leads to the problem of excessive or insufficient removal of rain streaks, leading to blurred edges or residual rain streaks in the rain removal results. To resolve this problem, a multi-frequency feature joint learning network (MFJLN) is proposed, which is constructed as a U-shaped structure including an encoder and a decoder. At each scale layer, a full-frequency feature fusion module (3FM) consisting of a spatial domain branch and a frequency domain branch is constructed to achieve accurate feature extraction at different scales. In the spatial domain branch, considering that rain streaks not only exist in high-frequency components, a multi-frequency feature extraction block (MFEB) is constructed to extract rich rain streak features from multiple frequency layers, namely low-frequency, mid-frequency, and high-frequency layers. In addition, in the low-frequency layer, a masked self-attention block (MSAB) is designed by defining an adaptive position weight mask to learn accurate long-distance features. To achieve the reuse of encoding features, a feature fusion block (FFB) is constructed to fuse the features from all encoding layers with the features of each decoding layer through dense links for feature reconstruction. Numerous experimental results on both synthetic and real-world datasets have shown that our MFJLN is superior to some state-of-the-art (SOTA) rain removal methods in terms of visual effects and quantitative metrics. In addition, MFJLN can not only remove rain streaks from rainy images, but also effectively remove raindrops and snow marks.

Abstract:
While current skeleton action recognition models demonstrate impressive performance on large-scale datasets, their adaptation to new application scenarios remains challenging. These challenges are particularly pronounced when facing new action categories, diverse performers, and varied skeleton layouts, leading to significant performance degeneration. Additionally, the high cost and difficulty of collecting skeleton data make large-scale data collection impractical. This paper studies one-shot and limited-scale learning settings to enable efficient adaptation with minimal data. Existing approaches often overlook the rich mutual information between labeled samples, resulting in sub-optimal performance in low-data scenarios. To boost the utility of labeled data, we identify the variability among performers and the commonality within each action as two key attributes. We present SkeletonX, a lightweight training pipeline that integrates seamlessly with existing GCN-based skeleton action recognizers, promoting effective training under limited labeled data. First, we propose a tailored sample pair construction strategy on two key attributes to form and aggregate sample pairs. Next, we develop a concise and effective feature aggregation module to process these pairs. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and PKU-MMD with various GCN backbones, demonstrating that the pipeline effectively improves performance when trained from scratch with limited data. Moreover, it surpasses previous state-of-the-art methods in the one-shot setting, with only 1/10 of the parameters and much fewer FLOPs.

Abstract:
As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols—cross-subject (cs), cross-view (cv), and cross-subject-view (csv)—demonstrate that the PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.

Abstract:
Referring video object segmentation (RVOS) aims to segment the object instances referred to by linguistic expressions in video frames. The prevailing approaches mainly rely on simplistic fusion strategies, wherein textual features are directly interacted with video features without considering the impact of textual semantics at different levels. These coarse-grained fusion strategies hinder the model’s ability to perceive changes in object appearance and movement, resulting in performance degradation. To mitigate this issue, we propose a Progressive Perspective Mining (P^2M) framework, which leverages a coarse-to-fine perspective to mine latent information from text and video, enabling precise segmentation of referred objects. P^2M consists of two key components: Progressive Vision-Language Interaction (PVLI) and Vision-Language Synergistic Fusion (VLSF). Specifically, PVLI leverages language features across subject, word, and sentence levels to mine textual information, enabling a progressive interaction with video features within an integrated representational space. Concurrently, VLSF focuses on generating semantically rich object queries for segmentation by employing slot attention mechanisms to mine and integrate relevant visual features with linguistic semantics. Furthermore, we introduce two query optimization losses: (1) the Matching Optimization Loss constrains the best queries between frame-level and video-level, effectively preventing the queries of the tracking target from drifting along the temporal dimension during the inference phase; (2) the Vision-Language Semantic Alignment Loss performs a word-by-word matching between object queries and expression, aligning the multi-modal joint space and enhancing the framework’s understanding of the textual description. We conducted various experiments on the RVOS task, achieving new state-of-the-art results across all benchmarks, thereby demonstrating the effectiveness of P^2M.

Abstract:
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve the alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency.

Abstract:
Graph contrastive learning (GCL) has garnered significant attention for its self-supervised graph representation learning without label information and excellent generalization to downstream tasks. However, data augmentation for graph-structured data is more challenging than that for images. We argue that simple data augmentations for GCL may risk damaging the intrinsic structure of the graph or creating views that are not diverse enough. Additionally, typical layer-by-layer feature propagation processes compress or discard pretext task-irrelevant feature information, resulting in unstable and suboptimal performance for unaligned downstream tasks. In this paper, we propose a novel framework termed Rev-GCL, which aims to maintain multi-level graph semantics without information loss via reversible column disentangled model augmentation tricks. Specifically, we propose a multi-column network with reversible connections as our encoder, where all columns share the same structure and receive a copy of the input graph. The reversible connections between columns ensure lossless transmission, allowing representations to be gradually disentangled from low-level to high-level semantics. Based on this, we introduce two model augmentation tricks, random propagation and asymmetric column, to construct different sibling encoders. These methods generate diverse graph views that can filter out high-frequency noise in contrastive learning, thereby yielding more generalizable node feature representations. Extensive experiments on eight commonly benchmark datasets demonstrate that Rev-GCL consistently outperforms existing state-of-the-art methods in node classification, clustering and link prediction tasks.

Abstract:
Adversarial attacks have become a critical focus in visual object tracking (VOT) research. Small, carefully crafted adversarial perturbations to video frames can easily disrupt the visual object tracker, leading to tracking failure. Therefore, studying adversarial attacks contributes to the development of more robust and reliable trackers. Considering that trackers are agnostic in real-world scenarios, research on decision-based black-box attacks is straightforward and practical. However, existing decision-based black-box attacks neither comprehensively analyze the unique characteristics of object tracking nor sufficiently consider the imperceptibility of adversarial perturbations. In this paper, we propose invisible local attack (ILA), a novel decision-based adversarial attack specifically for VOT with imperceptible perturbations. We assume that a significant number of pixels in a frame, irrelevant to the tracked object, do not substantially contribute to the functioning mechanism of a deep tracker. Based on this consideration, we propose a search algorithm to identify the pixel set focused on by the tracker during object tracking. The adversarial noise is then confined to these pixels and iteratively optimized through a heuristic algorithm of ILA. By perturbing only the key pixels, ILA significantly enhances both the attack performance and imperceptibility when it is applied to visual object trackers. Extensive experiments demonstrate that our ILA method achieves a 121% increase in the robustness metric and a 137% improvement in the structural similarity index measure (SSIM) across multiple datasets for various trackers compared with the state-of-the-art (SOTA) method.

Abstract:
Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data, achieving a progressive learning process that addresses tasks of increasing complexity. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification.

Affiliations: Department of Data Science and Artificial Intelligence, Chang’an University, Xi’an, China; Department of Information Engineering, Chang’an University, Xi’an, China; Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, International Research Center for Intelligent Perception and Computation, Joint International Research Laboratory of Intelligent Perception and Computation, School of Artificial Intelligence, Xidian University, Xi’an, China

Abstract:
Restoring rain-hazy images is vital for intelligent decision-making in autonomous driving and outdoor surveillance systems, which is a challenging ill-posed problem due to the irreversible nature of image degradation. Despite remarkable success achieved through deep learning, current algorithms are primarily evaluated using given kind of images, and the texture details and frequency domain information are insufficiently explored in most approaches, which greatly limits the performance of the model. To alleviate the above challenges, the frequency-aware and uncertainty-guiding network (FUNet) is proposed for rain-hazy image restoration. The FUNet consists of an end-to-end encoder-decoder architecture with the uncertainty-guided feature refinement (UGFR) and the confidence feature feedback module (CFF). First, the UGFR is designed with the uncertainty estimation (UE), uncertainty local global feature extraction module (ULG), and the frequency component decomposition and fusion (FCDF), which learns the abundant intermediate information in detail for clear image restoration. Second, in order to adequately learn rich semantic features, the CFF module is proposed to provide feedback and guidance on the learning process of the decoder. Third, the frequency-based loss function is designed to ensure training stability, which effectively guarantees the spatial and spectral details of images. Experiments on seven synthetic outdoor datasets and the real-world dataset DQA demonstrate the superiority of the proposed model quantitatively and qualitatively.

Abstract:
Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.

Abstract:
Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.

Abstract:
Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.

Abstract:
Self-training has been shown to achieve remarkable gains in semi-supervised semantic segmentation by creating pseudo-labels using unlabeled data. This approach, however, suffers from the quality of the generated pseudo-labels, and generating higher quality pseudo-labels is the main challenge that needs to be addressed. In this paper, we propose a novel method for semi-supervised semantic segmentation based on Multi-perspective pseudo-label Generation and Confidence-weighted Training (MGCT). First, we present a multi-perspective pseudo-label generation strategy that considers both global and local semantic perspectives. This strategy prioritizes pixels in all images by the global and local predictions, and subsequently generates pseudo-labels for different pixels in stages according to the ranking results. Our pseudo-label generation method shows superior suitability for semi-supervised semantic segmentation compared to other approaches. Second, we propose a confidence-weighted training method to alleviate performance degradation caused by unstable pixels. Our training method assigns confident weights to unstable pixels, which reduces the interference of unstable pixels during training and facilitates the efficient training of the model. Finally, we validate our approach on the PASCAL VOC 2012 and Cityscapes datasets, and the results indicate that we achieve new state-of-the-art performance on both datasets in all settings.

Abstract:
The reconstruction from three to dozens of spectral bands, known as spectral super resolution (SSR) has achieved remarkable progress with the continuous development of deep learning. However, the reconstructed hyperspectral images (HSIs) still suffer from the spatial degeneration due to the insufficient retention of high-frequency (HF) information during the SSR process. To remedy this issue, a novel Wavelet-based Hybrid Asymmetric Network (WHANet) is proposed to establish a RGB-to-HSI translation in wavelet domain, thus reserving and emphasizing the HF features in hyperspectral space. Basically, the backbone is designed in a hybrid asymmetric structure that learns the exact representations of decomposed wavelet coefficients in hyperspectral domain in a parallel way. Innovatively, a CNN-based HF reconstruction module (HFRM) and a transformer-based low frequency (LF) reconstruction module (LFRM) are delicately devised to perform the SSR process individually, which are able to process the discriminative wavelet coefficients contrapuntally. Furthermore, a hybrid loss function incorporated with the Fast Fourier loss (FFL) is proposed to directly regularize and emphasis the missing HF components. Eventually, experimental results over three benchmark datasets and one remote sensing dataset demonstrate that our WHANet is able to reach the state-of-the-art performance quantitatively and qualitatively.

Abstract:
Synthetic faces have been extensively researched and applied in various fields, such as face parsing and recognition. Compared to real face images, synthetic faces engender more controllable and consistent experimental stimuli due to the ability to precisely merge expression animations onto the facial skeleton. Accordingly, we establish an eye-tracking database with 780 synthetic face images and fixation data collected from 22 participants. The use of synthetic images with consistent expressions ensures reliable data support for exploring the database and determining the following findings: (1) A correlation study between saliency intensity and facial movement reveals that the variation of attention distribution within facial regions is mainly attributed to the movement of the mouth. (2) A categorized analysis of different demographic factors demonstrates that the bias towards salient regions aligns with differences in some demographic categories of synthetic characters. In practice, inference of facial saliency distribution is commonly used to predict the regions of interest for facial video-related applications. Therefore, we propose a benchmark model that accurately predicts saliency maps, closely matching the ground truth annotations. This achievement is made possible by utilizing channel alignment and progressive summation for feature fusion, along with the incorporation of Sinusoidal Position Encoding. The ablation experiment also demonstrates the effectiveness of our proposed model. We hope that this paper will contribute to advancing the photorealism of generative digital humans.

Abstract:
Existing multi-dataset detection works mainly focus on the performance of detector on each of the datasets, with different label spaces. However, in real-world applications, a unified label space across multiple datasets is usually required. To address such a gap, we propose a progressive pseudo labeling (PPL) approach to detect objects across different datasets, over a unified label space. Specifically, we employ the widely used architecture of teacher-student model pair to jointly refine pseudo labels and train the unified object detector. The student model learns from both annotated labels and pseudo labels from the teacher model, which is updated by the exponential moving average (EMA) of the student. Three modules, i.e. Entropy-guided Adaptive Threshold (EAT), Global Classification Module (GCM) and Scene-Aware Fusion (SAF) strategy, are proposed to handle the noise of pseudo labels and fit the overall distribution. Extensive experiments are conducted on different multi-dataset benchmarks. The results demonstrate that our proposed method significantly outperforms the State-of-the-Art and is even comparable with supervised methods trained using annotations of all labels.

Abstract:
In this work, we observe that indoor 3D object detection across varied scene domains encompasses both universal attributes and specific features. Based on this insight, we propose SOFW, a synergistic optimization framework that investigates the feasibility of optimizing 3D object detection tasks concurrently spanning several dataset domains. The core of SOFW is identifying domain-shared parameters to encode universal scene attributes, while employing domain-specific parameters to delve into the particularities of each scene domain. Technically, we introduce a set abstraction alteration strategy (SAAS) that embeds learnable domain-specific features into set abstraction layers, thus empowering the network with a refined comprehension for each scene domain. Besides, we develop an element-wise sharing strategy (ESS) to facilitate fine-grained adaptive discernment between domain-shared and domain-specific parameters for network layers. Benefited from the proposed techniques, SOFW crafts feature representations for each scene domain by learning domain-specific parameters, whilst encoding generic attributes and contextual interdependencies via domain-shared parameters. Built upon the classical detection framework VoteNet without any complicated modules, SOFW delivers impressive performances under multiple benchmarks with much fewer total storage footprint. Additionally, we demonstrate that the proposed ESS is a universal strategy and applying it to a voxels-based approach TR3D can realize cutting-edge detection accuracy on all S3DIS, ScanNet, and SUN RGB-D datasets.

Abstract:
Surface reconstruction from raw point clouds has been studied for decades in the computer graphics community, which is highly demanded by modeling and rendering applications nowadays. Classic solutions, such as Poisson surface reconstruction, require point normals as extra input to perform reasonable results. Modern transformer-based methods can work without normals, while the results are less fine-grained due to limited encoding performance in local fusion from discrete points. We introduce a novel normalized matrix attention transformer (Tensorformer) to perform high-quality reconstruction. The proposed matrix attention allows for simultaneous point-wise and channel-wise message passing, while the previous vector attention loses neighbor point information across different channels. It brings more degree of freedom in feature learning and thus facilitates better modeling of local geometries. Our method achieves state-of-the-art on two commonly used datasets, ShapeNetCore and ABC, and attains 4% improvements on IOU on ShapeNet. Our implementation will be released upon acceptance.

Abstract:
Object detection through LiDAR-based point cloud has recently been important in autonomous driving. Although achieving high accuracy on public benchmarks, the state-of-the-art detectors may still go wrong and cause a heavy loss due to the widespread corruptions in the real world like rain, snow, sensor noise, etc. Nevertheless, there is a lack of a large-scale dataset covering diverse scenes and realistic corruption types with different severities to develop practical and robust point cloud detectors, which is challenging due to the heavy collection costs. To alleviate the challenge and start the first step for robust point cloud detection, we propose the physical-aware simulation methods to generate degraded point clouds under different real-world common corruptions. Then, for the first attempt, we construct a benchmark based on the physical-aware common corruptions for point cloud detectors, which contains a total of 1,122,150 examples covering 7,481 scenes, 25 common corruption types, and 6 severities. With such a novel benchmark, we conduct extensive empirical studies on 12 state-of-the-art detectors that contain 6 different detection frameworks. Thus we get several insight observations revealing the vulnerabilities of the detectors and indicating the enhancement directions. Moreover, we further study the effectiveness of existing robustness enhancement methods based on data augmentation, data denoising, test-time adaptation. The benchmark can potentially be a new platform for evaluating point cloud detectors, opening a door for developing novel robustness enhancement methods.

Abstract:
Point cloud is one of the most widely used digital representation formats for three-dimensional (3D) contents, the visual quality of which may suffer from noise and geometric shift distortions during the production procedure as well as compression and downsampling distortions during the transmission process. To tackle the challenge of point cloud quality assessment (PCQA), many PCQA methods have been proposed to evaluate the visual quality levels of point clouds by assessing the rendered static 2D projections. Although such projection-based PCQA methods achieve competitive performance with the assistance of mature image quality assessment (IQA) methods, they neglect that the 3D model is also perceived in a dynamic viewing manner, where the viewpoint is continually changed according to the feedback of the rendering device. Therefore, in this paper, we evaluate the point clouds from moving camera videos and explore the way of dealing with PCQA tasks via using video quality assessment (VQA) methods. First, we generate the captured videos by rotating the camera around the point clouds through several circular pathways. Then we extract both spatial and temporal quality-aware features from the selected key frames and the video clips through using trainable 2D-CNN and pre-trained 3D-CNN models respectively. Finally, the visual quality of point clouds is represented by the video quality values. The experimental results reveal that the proposed method is effective for predicting the visual quality levels of the point clouds and even competitive with full-reference (FR) PCQA methods. The ablation studies further verify the rationality of the proposed framework and confirm the contributions made by the quality-aware features extracted via the dynamic viewing manner.

Abstract:
Multi-view learning is a machine learning paradigm that utilizes multiple feature sets or data sources to improve learning performance and generalization. However, existing multi-view learning methods often do not capture and utilize information from different views very well, especially when the relationships between views are complex and of varying quality. In this paper, we propose a novel multi-view learning framework for the multi-view classification task, called Gated Cross-Correlation Network (GCCNet), which addresses these challenges by integrating the three key operational levels in multi-view learning: representation, fusion, and decision. Specifically, GCCNet contains a novel component called the Multi-View Gated Information Distributor (MVGID) to enhance noise filtering and optimize the retention of critical information. In addition, GCCNet uses cross-correlation analysis to reveal dependencies and interactions between different views, as well as integrates an adaptive weighted joint decision strategy to mitigate the interference of low-quality views. Thus, GCCNet can not only comprehensively capture and utilize information from different views, but also facilitate information exchange and synergy between views, ultimately improving the overall performance of the model. Extensive experimental results on ten benchmark datasets show GCCNet's outperforms state-of-the-art methods on eight out of ten datasets, validating its effectiveness and superiority in multi-view learning.

Abstract:
View synthesis of aerial scenes has gained attention in the recent development of applications such as urban planning, navigation, and disaster assessment. This development is closely connected to the recent advancement of the Neural Radiance Field (NeRF). However, when autonomousaerial vehicles(AAVs) encounter constraints such as limited perspectives or energy limitations, NeRF degrades with sparsely sampled views in complex aerial scenes. On this basis, we aim to solve this problem in a few-shot manner. In this paper, we propose Uncertainty Guided Perception NeRF (UPNeRF), an uncertainty-guided perceptual learning framework that focuses on applying and improving NeRF in few-shot aerial view synthesis (FSAVS). First, simply optimizing NeRF in complex aerial scenes with sparse input can lead to overfitting in training views, resulting in a collapsed model. To address this, we propose a progressive learning strategy that utilizes the uncertainty present in sparsely sampled views, enabling a gradual transition from easy to hard learning. Second, to take advantage of the inherent inductive bias in the data, we introduce an uncertainty-aware discriminator. This discriminator leverages convolutional capabilities to capture intricate patterns in the rendered patches associated with uncertainty. Third, direct optimization of NeRF lacks prior knowledge of the scene. This, coupled with a reduction in training views, can result in unrealistic rendering. To overcome this, we present a perceptual regularizer that incorporates prior knowledge through prompt tuning of a self-supervised pre-trained vision transformer. In addition, we adopt a sampled scene annealing strategy to enhance training stability. Finally, we conducted experiments with two public datasets, and the positive results indicate our method is effective.

Abstract:
The effective extraction of spatial-angular features plays a crucial role in light field image super-resolution (LFSR) tasks, and the introduction of convolution and Transformers leads to significant improvement in this area. Nevertheless, due to the large 4D data volume of light field images, many existing methods opted to decompose the data into a number of lower-dimensional subspaces and perform Transformers in each sub-space individually. As a side effect, these methods inadvertently restrict the self-attention mechanisms to a One-to-One scheme accessing only a limited subset of LF data, explicitly preventing comprehensive optimization on all spatial and angular cues. In this paper, we identify this limitation as subspace isolation and introduce a novel Many-to-Many Transformer (M2MT) to address it. M2MT aggregates angular information in the spatial subspace before performing the self-attention mechanism. It enables complete access to all information across all sub-aperture images (SAIs) in a light field image. Consequently, M2MT is enabled to comprehensively capture long-range correlation dependencies. With M2MT as the foundational component, we develop a simple yet effective M2MT network for LFSR. Our experimental results demonstrate that M2MT achieves state-of-the-art performance across various public datasets, and it offers a favorable balance between model performance and efficiency, yielding higher-quality LFSR results with substantially lower demand for memory and computation. We further conduct in-depth analysis using local attribution maps (LAM) to obtain visual interpretability, and the results validate that M2MT is empowered with a truly non-local context in both spatial and angular subspaces to mitigate subspace isolation and acquire effective spatial-angular representation.

Abstract:
Recent years have seen a surge of interest in anomaly detection. However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. In contrast, anomaly detectors demonstrate superior performance in the language modality due to the unimodal nature of the data. This paper tackles the aforementioned challenges for vision modality from a multimodal point of view. Specifically, we propose Cross-modal Guidance (CMG), comprising of Cross-modal Entropy Reduction (CMER) and Cross-modal Linear Embedding (CMLE), to address the issues of redundant information and sparse latent space, respectively. CMER involves masking portions of the raw image and computing the matching score with the corresponding text. Essentially, CMER eliminates irrelevant pixels to direct the detector's focus towards critical content. To learn a more compact latent space for the vision anomaly detection, CMLE learns a correlation structure matrix from the language modality. Then, the acquired matrix compels the distribution of images to resemble that of texts in the latent space. Extensive experiments demonstrate the effectiveness of the proposed methods. Particularly, compared to the baseline that only utilizes images, the performance of CMG has been improved by 16.81%. Ablation experiments further confirm the synergy among the proposed CMER and CMLE, as each component depends on the other to achieve optimal performance.

Abstract:
Compared to static anchor selection, existing dynamic anchor learning could automatically learn more flexible anchors to improve the performance of large-scale multi-view clustering. Despite improving the flexibility of anchors, these methods do not pay sufficient attention to the alignment and fairness of learned anchors. Specifically, within each cluster, the positions and quantities of cross-view anchors may not align, or even anchor absence in some clusters, leading to severe anchor misalignment and imbalance issues. These issues result in inaccurate graph fusion and a reduction in clustering performance. Besides, in practical applications, missing information caused by sensor malfunctions or data losses could further exacerbate anchor misalignment and imbalance. To overcome such challenges, a novel Incomplete Multi-view Clustering with Paired and Balanced Dynamic Anchor Learning (PBDAL) is proposed to ensure the alignment and fairness of anchors. Unlike existing unsupervised anchor learning, we first design a paired and balanced dynamic anchor learning scheme to supervise dynamic anchors to be aligned and fair in each cluster. Meanwhile, we develop an enhanced bipartite graph tensor learning to refine paired and balanced anchors. Our superiority, effectiveness, and efficiency are all validated by performing extensive experiments on multiple public datasets.

Abstract:
Networked 360^\circ video has become increasingly popular. Despite the immersive experience for users, its sheer data volume, even with the latest H.266 coding and viewport adaptation, remains a significant challenge to today's networks. Recent studies have shown that integrating deep learning into video coding can significantly enhance compression efficiency, providing new opportunities for high-quality video streaming. In this work, we conduct a comprehensive analysis of the potential and issues in applying neural codecs to 360^\circ video streaming. We accordingly present \mathsf NETA, a synergistic streaming scheme that merges neural compression with traditional coding techniques, seamlessly implemented within an edge intelligence framework. To address the non-trivial challenges in the short viewport prediction window and time-varying viewing directions, we propose implicit-explicit buffer-based prefetching grounded in content visual saliency and bitrate adaptation with smart model switching around viewports. A novel Lyapunov-guided deep reinforcement learning algorithm is developed to maximize user experience and ensure long-term system stability. We further discuss the concerns towards practical development and deployment and have built a working prototype that verifies \mathsf NETA’s excellent performance. For instance, it achieves a 27% increment in viewing quality, a 90% reduction in rebuffering time, and a 64% decrease in quality variation on average, compared to state-of-the-art approaches.

Abstract:
Diverse human motion prediction (HMP) aims to predict multiple plausible future motions given an observed human motion sequence. It is a challenging task due to the diversity of potential human motions while ensuring an accurate description of future human motions. Current solutions are either low-diversity or limited in expressiveness. Recent denoising diffusion probabilistic models (DDPM) demonstrate promising performance in various generative tasks. However, introducing DDPM directly into diverse HMP incurs some issues. While DDPM can enhance the diversity of potential human motion patterns, the predicted human motions gradually become implausible over time due to significant noise disturbances in the forward process of DDPM. This phenomenon leads to the predicted human motions being unrealistic, seriously impacting the quality of predicted motions and restricting their practical applicability in real-world scenarios. To alleviate this, we propose a novel conditional diffusion-based generative model, called DivDiff, to predict more diverse and realistic human motions. Specifically, the DivDiff employs DDPM as our backbone and incorporates Discrete Cosine Transform (DCT) and Transformer mechanisms to encode the observed human motion sequence as a condition to instruct the reverse process of DDPM. More importantly, we design a diversified reinforcement sampling function (DRSF) to enforce human skeletal constraints on the predicted human motions. DRSF utilizes the acquired information from human skeletal as prior knowledge, thereby reducing significant disturbances introduced during the forward process. Extensive results received in the experiments on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.

Abstract:
Compared to single-source to single-target (1S1T) domain adaptation, single-source to multi-target (1SmT) domain adaptation is more practical but also more challenging. In 1SmT scenarios, the significant differences in feature distributions between various target domains increase the difficulty for models to adapt to multiple domains. Moreover, 1SmT requires effective transfer to each target domain while maintaining performance in the source domain, demanding higher generalization capabilities from the model. In 1S1T scenarios, active domain adaptation methods improve generalization by incorporating a few target domain samples, but these methods are rarely applied in 1SmT due to potential sampling bias and outlier interference. To address this, we propose Progressive Prototype Refinement (PPR), an active multi-target domain adaptation method combining 1SmT with active learning to enhance cross-domain knowledge transfer. Specifically, an uncertainty assessment strategy is used to select representative samples from multiple target domains, forming a candidate set for model training. Based on the Lindeberg--Levy central limit theorem, we sample from a Gaussian distribution using corrected prototype statistics to augment the classifier's feature input, allowing the model to learn transitional information between domains. Finally, a mapping matrix is used for cross-domain alignment, addressing incomplete class coverage and outlier interference. Extensive experiments on multiple benchmark datasets demonstrate PPR's superior performance, with a 6.35% improvement on the PACS dataset and a 17.32% improvement on the Remote Sensing dataset.

Abstract:
The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.

Abstract:
Scene observation from multiple perspectives brings a more comprehensive visual experience. However, acquiring multiple views in the dark causes highly correlated views alienated, making it challenging to improve scene understanding with auxiliary views. Recent single image-based enhancement methods may not provide consistently desirable restoration performance for all views due to ignoring potential feature correspondence among views. To alleviate this issue, we make the first attempt to investigate multi-view low-light image enhancement. First, we construct a new dataset called Multi-View Low-light Triplets (MVLT), including 1,860 pairs of triple images with large illumination ranges and wide noise distribution. Each triplet is equipped with three viewpoints towards the same scene. Second, we propose a multi-view enhancement framework based on the Recurrent Collaborative Network (RCNet). To benefit from similar texture correspondence across views, we design the recurrent feature enhancement, alignment, and fusion (ReEAF) module, where intra-view feature enhancement (Intra-view EN) followed by inter-view feature alignment and fusion (Inter-view AF) is performed to model intra-view and inter-view feature propagation via multi-view collaboration. Additionally, two modules from enhancement to alignment (E2A) and alignment to enhancement (A2E) are developed to enable interactions between Intra-view EN and Inter-view AF, utilizing attentive feature weighting and sampling for enhancement and alignment. Experimental results demonstrate our RCNet significantly outperforms other state-of-the-art methods.

Abstract:
Motion blur estimation is a critical and fundamental task in scene analysis and image restoration. While most state-of-the-art deep learning-based methods for single-image motion image deblurring focus on constructing deep networks or developing training strategies, the characterization of motion blur has received less attention. In this paper, we innovatively propose a non-parametric Variational Bayesian Kernel Generation Network (VB-KGN) for characterizing motion blur in a single image. To solve this model, we employ the variational inference framework to approximate the expected statistical distribution of motion blur images in a data-driven manner. The qualitative and quantitative evaluations of our experimental results demonstrate that our proposed model can generate highly accurate motion blur kernels, significantly improving motion image deblurring performance and substantially reducing the need for extensive training sample preprocessing for deblurring tasks.

Abstract:
Deep neural networks (DNNs) have shown great potential in no-reference image quality assessment (NR-IQA). However, the annotation of NR-IQA is labor-intensive and time-consuming, which severely limits its application, especially for authentic images. To relieve the dependence on quality annotation, some works have applied unsupervised domain adaptation (UDA) to NR-IQA. However, the above methods ignore the fact that the alignment space used in classification is sub-optimal, since the space is not elaborately designed for perception. To solve this challenge, we propose an effective perception-oriented unsupervised domain adaptation method StyleAM (Style Alignment and Mixup) for NR-IQA, which transfers sufficient knowledge from label-rich source domain data to label-free target domain images. Specifically, we find a more compact and reliable space i.e., feature style space for perception-oriented UDA based on an interesting observation, that the feature style (i.e., the mean and variance) of the deep layer in DNNs is exactly associated with the quality score in NR-IQA. Therefore, we propose to align the source and target domains in a more perceptual-oriented space i.e., the feature style space, to reduce the intervention from other quality-irrelevant feature factors. Furthermore, to increase the consistency (i.e., ordinal/continuous characteristics) between quality score and its feature style, we also propose a novel feature augmentation strategy Style Mixup, which mixes the feature styles (i.e., the mean and variance) before the last layer of DNNs together with mixing their labels. Extensive experimental results on many cross-domain settings (e.g., synthetic to authentic, and multiple distortions to one distortion) have demonstrated the effectiveness of our proposed StyleAM on NR-IQA.

Abstract:
The preservation and enhancement of texture information is crucial for the fusion of visible and infrared images. However, most current deep neural network (DNN)-based methods ignore the differences between texture and content, leading to unsatisfactory fusion results. To further enhance the quality of fused images, we propose a texture-content dual guided (TCDG-Net) network, which produces the fused image by the guidance inferred from source images. Specifically, a texture map is first estimated jointly by combining the gradient information of visible and infrared images. Then, the features learned by the shallow feature extraction (SFE) module are enhanced with the guidance of the texture map. To effectively model the texture information in the long-range dependencies, we design the texture-guided enhancement (TGE) module, in which the texture-guided attention mechanism is utilized to capture the global similarity of the texture regions in source images. Meanwhile, we employ the content-guided enhancement (CGE) module to refine the content regions in the fused result by utilizing the complement of the texture map. Finally, the fused image is generated by adaptively integrating the enhanced texture and content information. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed TCDG-Net in terms of qualitative and quantitative evaluations. Besides, the fused images generated by our proposed TCDG-Net also show better performance in downstream tasks, such as objection detection and semantic segmentation.

Abstract:
With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.

Abstract:
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations. However, there is limited research on learning video-text representations for general video multimodal tasks based on these powerful features. Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks. Specifically, VLAB is founded on two key strategies: feature adapting and feature blending. In the former, we introduce a new video adapter module to address CLIP's deficiency in modeling temporal information and extend the model's capability to encompass both contrastive and generative tasks. In the latter, we propose an end-to-end training method that further enhances the model's performance by exploiting the complementarity of image and video features. We validate the effectiveness and versatility of VLAB through extensive experiments on highly competitive video multimodal tasks, including video text retrieval, video captioning, and video question answering. Remarkably, VLAB outperforms competing methods significantly and sets new records in video question answering on MSRVTT, MSVD, and TGIF datasets. It achieves an accuracy of 49.6, 60.9, and 79.0, respectively.

Abstract:
In recent years, multimodal sentiment analysis (MSA) has gained prominence with the proliferation of social media. However, prior studies have often disregarded the possibility of spurious correlations between multimodal data and sentiment labels. Neglecting these factors often results in significant performance degradation, hampering the model's ability to generalize in out-of-distribution (OOD) scenarios. To gain a comprehensive understanding of multimodal knowledge and enhance the model's generalization across diverse distribution scenarios, we present the Multimodal Debiasing Framework (MulDeF). This model-agnostic framework addresses label bias through causal intervention and tackles multimodal biases using counterfactual reasoning. During the training phase, MulDeF rectifies multimodal representations through frontdoor adjustment in causal intervention, effectively eliminating label bias. In order to model conditional expectation calculations within the context of frontdoor adjustment, we introduce multimodal causal attention (MCA). In the inference phase, it employs counterfactual reasoning to eliminate multimodal biases. To further refine our debiasing strategy, we categorize multimodal biases into two distinct types: nonverbal bias and verbal bias. Nonverbal bias is addressed at the utterance level, involving the establishment of unimodal models for audio and visual modalities to estimate their biases concerning sentiment labels. Conversely, verbal bias mitigation occurs at the word level. Here, we mask “harmless” words to generate corresponding counterfactual texts, which are then assessed by the text model to identify word-level bias. Experimental results validate the effectiveness of MulDeF, showcasing its superior performance in OOD settings compared to state-of-the-art methods, while also achieving competitive results in independent and identically distributed (IID) settings.

Abstract:
In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio (mainly ambient sound), visual, and speech (predominately spoken language) modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition.

Abstract:
Speech-driven 3D facial animation has improved a lot recently while most related works only utilize acoustic modality and neglect the influence of visual and textual cues, leading to unsatisfactory results in terms of precision and coherence. We argue that visual and textual cues are not trivial information. Therefore, we present a novel framework, namely PMMTalk, using complementary Pseudo Multi-Modal features for improving the accuracy of facial animation. The framework entails three modules: PMMTalk encoder, cross-modal alignment module, and PMMTalk decoder. Specifically, the PMMTalk encoder employs the off-the-shelf talking head generation architecture and speech recognition technology to extract visual and textual information from speech, respectively. Following this, the cross-modal alignment module aligns the audio-image-text features at temporal and semantic levels. Subsequently, the PMMTalk decoder is employed to predict lip-syncing facial blendshape coefficients. Contrary to prior methods, PMMTalk only requires an additional random reference face image but yields more accurate results. Additionally, it is artist-friendly as it seamlessly integrates into standard animation production workflows by introducing facial blendshape coefficients. Finally, given the scarcity of 3D talking face datasets, we introduce a large-scale 3D Chinese Audio-Visual Facial Animation (3D-CAVFA) dataset. Extensive experiments and user studies show that our approach outperforms the state of the art. Codes and datasets are available at PMMTalk.

Abstract:
To accurately perform crowd counting, utilizing the complementary relationship between RGB and thermal images to analyze the crowd has become the focus of current research. Due to different imaging principles, multi-modal images often contain different contents, which are their modality-specific information. For example, RGB images contain more texture and color details, while thermal images contain thermal radiation information. Meanwhile, they also describe the same target content, e.g., crowds, which are modality-invariant. However, existing methods only design different modules to directly fuse RGB and thermal image features, which did not fully consider the above facts. In this paper, by analyzing the similarities and differences between multi-modal images, we propose a Modality-Invariant and -Specific Fusion Network (MISF-Net) for RGB-T Crowd Counting. Specifically, we design a modality decomposition and fusion module (MDFM), which decomposes RGB and thermal image features into modality-invariant and -specific features by using the similarity and difference supervision between multi-modal features. Besides, reconstruction supervision is also used to prevent network learning from generating bias. After that, different fusion strategies are applied to the invariant and specific features, respectively. In addition, to adapt to the variations in size of different pedestrians, we design a modality-invariant fusion module (MIFM). Finally, after the fusion decoder, MISF-Net can obtain a more accurate crowd density map. Comprehensive experiments on the RGB-T crowd counting dataset show that our MISF-Net can achieve competitive performance.

Abstract:
Single image scene relighting aims to generate a realistic new version of an input image so that it appears to be illuminated by a new target light condition. Although existing works have explored this problem from various perspectives, generating relit images under arbitrary light conditions remains highly challenging, and related datasets are scarce. Our work addresses this problem from both the dataset and methodological perspectives. We propose two new datasets: a synthetic dataset with the ground truth of intrinsic components and a real dataset collected under laboratory conditions. These datasets alleviate the scarcity of existing datasets. To incorporate physical consistency in the relighting pipeline, we establish a two-stage network based on intrinsic decomposition, giving outputs at intermediate steps, thereby introducing physical constraints. When the training set lacks ground truth for intrinsic decomposition, we introduce an unsupervised module to ensure that the intrinsic outputs are satisfactory. Our method outperforms the state-of-the-art methods in performance, as tested on both existing datasets and our newly developed datasets. Furthermore, pretraining our method or other prior methods using our synthetic dataset can enhance their performance on other datasets. Since our method can accommodate any light conditions, it is capable of producing animated results.

Abstract:
In real-world physiological and psychological scenarios, there often exists a robust complementary correlation between audio and visual signals. Audio-Visual Event Localization (AVEL) aims to identify segments with Audio-Visual Events (AVEs) that contain both audio and visual tracks in unconstrained videos. Prior studies have predominantly focused on audio-visual cross-modal fusion methods, overlooking the fine-grained exploration of the cross-modal information fusion mechanism. Moreover, due to the inherent heterogeneity of multi-modal data, inevitable new noise is introduced during the audio-visual fusion process. To address these challenges, we propose a novel Cross-modal Contrastive Learning Network (CCLN) for AVEL, comprising a backbone network and a branch network. In the backbone network, drawing inspiration from physiological theories of sensory integration, we elucidate the process of audio-visual information fusion, interaction, and integration from an information-flow perspective. Notably, the Self-constrained Bi-modal Interaction (SBI) module is a bi-modal attention structure integrated with audio-visual fusion information, and through gated processing of the audio-visual correlation matrix, it effectively captures inter-modal correlation. The Foreground Event Enhancement (FEE) module emphasizes the significance of event-level boundaries by elongating the distance between scene events during training through adaptive weights. Furthermore, we introduce weak video-level labels to constrain the cross-modal semantic alignment of audio-visual events and design a weakly supervised cross-modal contrastive learning loss (WCCL Loss) function, which enhances the quality of fusion representation in the dual-branch contrastive learning framework. Extensive experiments conducted on the AVE dataset for both fully supervised and weakly supervised event localization, as well as Cross-Modal Localization (CML) tasks, demonstrate the superior performance of our model compared to state-of-the-art approaches.

Abstract:
Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the “same amount” of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo “different amount” of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users' viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods.

Abstract:
Knowledge graph construction is aimed at storing and representing the knowledge of the objective world in a structured form. Existing methods for automatic construction of knowledge graphs have problems such as difficulty in understanding potential semantics and low precision. The emergence of Large Language Models (LLMs) provides an effective way for automatic knowledge graph construction. However, using LLMs as automatic knowledge graph construction engines relies on the embedding of schema layers, which brings challenges to the input length of LLMs. In this paper, we present a framework for Adaptive Construction of Knowledge Graph by leveraging the exceptional generation capabilities of LLMs and the latent relational semantic information of triples, named ACKG-LLM. Our proposed framework divides the knowledge graph construction task into three subtasks within a unified pipeline: triple extraction of open information, additional relational semantic information embedding and knowledge graph normalization based on schema-level embedding. The framework can construct knowledge graphs in different domains, making up for the defects of existing frameworks that need to retrain and fine-tune the internal model. Extensive experiments demonstrate that our proposed ACKG-LLM performs favorably against representative methods on the REBEL and WiKi-NRE datasets.

Abstract:
Open-vocabulary multi-object tracking (OVMOT) is a cutting research direction within the multi-object tracking field. It employs large multi-modal models to effectively address the challenge of tracking unseen objects within dynamic visual scenes. While models require robust domain generalization and temporal adaptability, OVTrack, the only existing open-vocabulary multi-object tracker, relies solely on static appearance information and lacks these crucial adaptive capabilities. In this paper, we propose OVSORT, a new framework designed to improve domain generalization and temporal information processing. Specifically, we first propose the Adaptive Contextual Normalization (ACN) technique in OVSORT, which dynamically adjusts the feature maps based on the dataset's statistical properties, thereby fine-tuning our model's to improve domain generalization. Then, we introduce motion cues for the first time. Using our Joint Motion and Appearance Tracking (JMAT) strategy, we obtain a joint similarity measure and subsequently apply the Hungarian algorithm for data association. Finally, our Hierarchical Adaptive Feature Update (HAFU) strategy adaptively adjusts feature updates according to the current state of each trajectory, which greatly improves the utilization of temporal information. Extensive experiments on the TAO validation set and test set confirm the superiority of OVSORT, which significantly improves the handling of novel and base classes. It surpasses existing methods in terms of accuracy and generalization, setting a new state-of-the-art for OVMOT.

Abstract:
Extending image-based Large Multimodal Models (LMMs) to video-based LMMs always requires temporal modeling in the pre-training. However, training the temporal modules gradually erases the knowledge of visual features learned from various image-text-based scenarios, leading to degradation in some downstream tasks. To address this issue, in this paper, we introduce a novel, efficient transfer approach termed MTransLLAMA, which employs transfer learning from pre-trained image LMMs for fine-grained video tasks with only small-scale training sets. Our method enables fewer trainable parameters and achieves faster adaptation and higher accuracy than pre-training video-based LMM models. Specifically, our method adopts early fusion between textual and visual features to capture fine-grained information, reuses spatial attention weights in temporal attentions for cyclical spatial-temporal reasoning, and introduces dynamic attention routing to capture both global and local information in spatial-temporal attentions. Experiments demonstrate that across multiple datasets and tasks, without relying on video pre-training, our model achieves state-of-the-art performance, enabling lightweight and efficient transfer from image-based LMMs to fine-grained video tasks.

Abstract:
With the rapid advancement of large language models, there has been a growing interest in their capabilities in mathematical reasoning. However, existing research has primarily focused on text-based algebra problems, neglecting the study of geometry due to the lack of high-quality geometric datasets. To address this gap, this paper introduces AutoGeo, a novel approach for automatically generating mathematical geometric images to fulfill the demand for large-scale and diverse geometric datasets. AutoGeo facilitates the creation of AutoGeo-100 k, an extensive repository comprising 100 k high-quality geometry image-text pairs. By leveraging precisely defined geometric clauses, AutoGeo-100 k contains a wide variety of geometric shapes, including lines, polygons, circles, and complex spatial relationships, etc. Furthermore, this paper demonstrates the efficacy of AutoGeo-100 k in enhancing the performance of multimodal large language models through fine-tuning. Experimental results indicate significant improvements in the model's ability in handling geometric images, as evidenced by enhanced accuracy in tasks such as geometric captioning and mathematical reasoning. This research not only fills a critical gap in the availability of geometric datasets but also paves the way for the advancement of sophisticated AI-driven tools in education and research.

Abstract:
Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.

Abstract:
Few-shot 3D point cloud semantic segmentation is a challenging task due to the lack of labeled point clouds (support set). To segment unlabeled query point clouds, existing prototype-based methods learn 3D prototypes from point features of the support set and then measure their distances to the query points. However, such homogeneous 3D prototypes are often of low quality because they overlook the valuable heterogeneous information buried in the support set, such as semantic labels and projected 2D depth maps. To address this issue, in this paper, we propose a novel Relation Consistency-guided Heterogeneous Prototype learning framework (RCHP), which improves prototype quality by integrating heterogeneous information using large multi-modal models (e.g. CLIP). RCHP achieves this through two core components: Heterogeneous Prototype Generation module which collaborates with 3D networks and CLIP to generate heterogeneous prototypes, and Heterogeneous Prototype Fusion module which effectively fuses heterogeneous prototypes to obtain high-quality prototypes. Furthermore, to bridge the gap between heterogeneous prototypes, we introduce a Heterogeneous Relation Consistency loss, which transfers more reliable inter-class relations (i.e., inter-prototype relations) from refined prototypes to heterogeneous ones. Extensive experiments conducted on five point cloud segmentation datasets, including four indoor datasets (S3DIS, ScanNet, SceneNN, NYU Depth V2) and one outdoor dataset (Semantic3D), demonstrate the superiority and generalization capability of our method, outperforming state-of-the-art approaches across all datasets.

Abstract:
Learning-based online monocular 3D reconstruction has emerged with great potential recently. Most state-of-the-art methods focus on two key questions, namely 1) how to exploit accurate voxel features and 2) how to preserve detailed voxels in the sparsification process. However, 1) most methods adopt the same receptive field to extract features for both informative and uninformative regions, which struggle to capture geometric details. Furthermore, 2) they mainly utilize a fixed threshold or a straightforward ray-based algorithm to discard voxels in the sparsification process. However, some detailed regions (especially thin regions) may be discarded incorrectly. To tackle these challenges, we present a novel method named DetailRecon to focus on detailed regions that contain more geometric information. Specifically, we first propose an Adaptive Hybrid Fusion (AHF) module and a Connectivity-Aware Sparsification (CAS) module for voxel feature learning and voxel sparsification, respectively. 1) The AHF receives multiple feature maps with different receptive fields as input, and adaptively adopts a smaller receptive field for regions with fine structures to exploit accurate geometric details. 2) The CAS updates the occupancy value of voxels based on the connected voxels within its neighbor space, which could expand the radiation range of reliable voxels in detailed regions and eventually reduce their probability of being discarded. Moreover, 3) we introduce a lightweight yet effective pipeline named Focus On Fine (FOF) to accelerate our DetailRecon. In addition, 4) we propose a Hierarchical Consistency Loss (HCL) to align multi-level volume features, which assists in exploring accurate volume features for recovering more details. Extensive experiments conducted on the ScanNet (V2) and 7-Scenes datasets demonstrate the superiority of our DetailRecon.

Abstract:
Online task-free continual learning (OTFCL) is a more challenging variant of continual learning that emphasizes the gradual shift of task boundaries and learning in an online mode. Existing methods rely on a memory buffer of old samples to prevent forgetting. However, the use of memory buffers not only raises privacy concerns but also hinders the efficient learning of new samples. To address this problem, we propose a novel framework called I^2CANSAY that gets rid of the dependence on memory buffers and efficiently learns the knowledge of new data from one-shot samples. Concretely, our framework comprises two main modules. Firstly, the Inter-Class Analogical Augmentation (ICAN) module generates diverse pseudo-features for old classes based on the inter-class analogy of feature distributions for different new classes, serving as a substitute for the memory buffer. Secondly, the Intra-Class Significance Analysis (ISAY) module analyzes the significance of attributes for each class via its distribution standard deviation, and generates an importance vector as a correction bias for the linear classifier, thereby enhancing the capability of learning from new samples. We run our experiments on four popular image classification datasets: CoRe50, CIFAR-10, CIFAR-100, and CUB-200, our approach outperforms the prior state-of-the-art by a large margin.

Abstract:
Watermarking is a tool for actively identifying and attributing the images generated by latent diffusion models. Existing methods face the dilemma of image quality and watermark robustness. Watermarks with superior image quality usually have inferior robustness against attacks such as blurring and JPEG compression, while watermarks with superior robustness usually significantly damage image quality. This dilemma stems from the traditional paradigm where watermarks are injected and detected in pixel space, relying on pixel perturbation for watermark detection and resilience against attacks. In this paper, we highlight that an effective solution to the problem is to both inject and detect watermarks in the latent diffusion space, and propose Latent Watermark with a progressive training strategy. It weakens the direct connection between quality and robustness and thus alleviates their contradiction. We conduct evaluations on two datasets and against 10 watermark attacks. Six metrics measure the image quality and watermark robustness. Results show that compared to the recently proposed methods such as StableSignature, StegaStamp, RoSteALS, LaWa, TreeRing, and DiffuseTrace, LW not only surpasses them in terms of robustness but also offers superior image quality.

Abstract:
In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing has emerged as a focal point of research, taking advantage of the inherent advantages of the multifaceted and dynamism inherent in multimedia data. Existing unsupervised cross-modal hashing methods rely mainly on initial pre-trained correlations among cross-modal features, and the inaccurate neighborhood correlations impacts the presentation of common semantics throughout the optimization. To address the aforementioned issues, we propose Ensemble Prototype Networks (EPNet), which delineates class attributes of cross-modal instances through an ensemble clustering methodology. EPNet seeks to extract correlation information between instances by leveraging local correlation aggregation and ensemble clustering from multiple perspectives, aiming to reduce initialization effects and enhance cross-modal representations. Specifically, the local correlation aggregation is first proposed within a batch of semantic affinity relationships to generate a precise and compact hash code among cross-modal instances. Secondly, the ensemble prototype module is employed to discern the class attributes of deep features, thereby aiding the model in extracting more universally applicable feature representations. Thirdly, an early attempt to constrict the representational congruity of local semantic affinity relationships and deep feature ensemble prototype correlations using cross-task consistency loss aims to enhance the representation of cross-modal common semantic features. Finally, EPNet outperforms several state-of-the-art cross-modal retrieval methods on three real-world image-text datasets in extensive experiments.

Abstract:
Group features have significant effects on pedestrian movement and constitute a focal point in pedestrian trajectory prediction research. In reality, pedestrians within a group exhibit notable consistency features due to their compact spatial positions, close destinations, and factors such as coordination within the group. In contrast, owing to the dispersed destinations among groups and the lack of coordination, there are significant differences in velocity and direction between the groups, leading to strong conflicts. However, existing pedestrian trajectory prediction models based on group features lack sufficient quantification of both within-group and between-group features. To address this problem, we propose Group-PTP, a novel pedestrian trajectory prediction model based on group features. Specifically, we first propose a group graph attention network-based group features aggregation method (Group-GAT). By quantifying and aggregating the intra-consistency and inter-conflict features exhibited by the groups, our method can better capture the features and interactions both within and between groups. Second, we propose a group multi-feature information representation model that fuses captured group aggregate features, pedestrian coordinates, surrounding pedestrian features, and obstacle features through fusion concatenation. Finally, we propose a multi-feature temporal convolutional network (MF-TCN) that embeds the impact weights of multi-feature information into pedestrian coordinates to obtain feature outputs and conducts temporal operations on feature outputs to predict future trajectories. The experimental results demonstrate that our proposed Group-PTP achieves state-of-the-art performance on several different trajectory prediction benchmarks.

Abstract:
Group convolution networks have shown great potential in hyperspectral image (HSI) classification because of their ability to divide total spectral bands into multiple groups and focus on fine discrimination within different spectral ranges. Most group convolution networks process parallel spectral groups independently; however, they neglect the important relevance of nearby spectral ranges. Moreover, the feature maps from different spectral groups are not considered for recalibration in the existing attention. To address these issues, we propose a novel group interactive threshold attention network (GITANet). In the network, a stratified-split-concatenation strategy, which not only splits all bands into multiple groups for intragroup convolution but also propagates the intergroup information via the stratified concatenation operation between different groups, is designed for bandwise group convolution. Relying on the high dependencies among nearby spectra, the cross-group interactive attention block is designed to encourage significant spectral features. Subsequently, from different spectral ranges, a learnable threshold generation block is built to estimate the information validity of each pixel. On the basis of this threshold, soft threshold spatial attention is developed in the bandwise encoder-decoder architecture, which emphasizes high-value spatial areas during the fusion of group convolutional features. Therefore, complementary and discriminative spectral-spatial features are obtained to improve the performance of HSI classification. The experimental results on three HSI datasets illustrate that GITANet is superior to several state-of-the-art networks.

Abstract:
Mask-guided matting networks have achieved significant improvements and have shown great potential in practical applications in recent years. However, simply learning matting representation from synthetic and lack-of-real-world-diversity matting data, these approaches tend to overfit low-level details in wrong regions, lack generalization to objects with complex structures and real-world scenes such as shadows, as well as suffer from interference of background lines or textures. To address these challenges, in this paper, we propose a novel auxiliary learning framework for mask-guided matting models, incorporating three auxiliary tasks: semantic segmentation, edge detection, and background line detection besides matting, to learn different and effective auxiliary representations from different types of data and annotations. Our framework and model introduce the following key aspects: 1) to learn real-world adaptive semantic representation for objects with diverse and complex structures under real-world scenes, we introduce extra semantic segmentation and edge detection tasks on more diverse real-world data with segmentation annotations; 2) to avoid overfitting on low-level details, we propose a module to utilize the inconsistency between learned segmentation and matting representations to regularize detail refinement; 3) we propose a novel background line detection task into our auxiliary learning framework, to suppress interference of background lines or textures. In addition, we propose a high-quality matting benchmark, Plant-Mat, to evaluate matting methods on complex structures. Extensively quantitative and qualitative results show that our approach outperforms state-of-the-art mask-guided methods.

Abstract:
In this article, different from previous traditional multi-exposure image fusion (MEF) algorithms that use hand-designed feature extraction approaches or deep learning-based algorithms that utilize convolutional neural networks for information preservation, we propose a novel multi-Exposure image fusion method via Adversarial learning and focal Transformer, named EAT. In our framework, a Focal Transformer is proposed to focus on more remarkable regions and construct long-range multi-exposure relationships, with which the fusion model can simultaneously extract local and global multi-exposure properties and therefore generate promising fusion results. To further improve the fusion performance, we introduce adversarial learning to train the proposed method in an adversarial manner with the guidance of ground truth. By doing so, the fused images exhibit better visual perception and color fidelity. Extensive experiments conducted on publicly available databases provide compelling evidence that EAT surpasses other state-of-the-art approaches on both quantitative and qualitative evaluations. Furthermore, we directly employ our trained model to address another benchmark MEF dataset. The impressive fusion performance serves as evidence of the credible generalization ability of EAT.

Abstract:
Human-centric Emotional Video Captioning (H-EVC) aims to generate fine-grained, emotion-related sentences for human-based videos, enhancing the understanding of human emotions and facilitating human-computer emotional interaction. However, existing video captioning methods often overlook subtle emotional clues and interactions in videos. As a result, the generated captions frequently lack emotional information. To address this, we propose Emotion-oriented Cross-modal Prompting and Alignment (ECPA), which improves HEVC accuracy by modeling fine-grained visual-textual emotion clues. Using large foundation models, ECPA introduces two learnable prompting strategies: visual emotion prompting (VEP) and textual emotion prompting (TEP), along with an emotion-oriented cross-modal alignment (ECA) module. VEP uses two levels of visual prompts, i.e., emotion recognition (ER) and action unit (AU), to focus on both coarse and fine visual emotional features. TEP devise two-level learnable textual prompts, i.e., sentence-level emotional tokens and word-level masked tokens to capture global and local textual emotion representations. ECA introduces another two levels of emotion-oriented prompt alignment learning mechanisms: the ER-sentence level and the AU-word level alignment losses. Both enhance the model's ability to capture and integrate both global and local cross-modal emotion semantics, thereby enabling the generation of fine-grained emotional linguistic descriptions in video captioning. Experiments show ECPA significantly outperforms state-of-the-art methods on various H-EVC datasets (relative improvements of 9.98%, 5.72%, 4.46%, 24.52% on MAFW, and 12.82%, 20.27%, 4.22%, 5.01% on EmVidCap across four evaluation metrics) and supports zero-shot tasks on MSVD and MSRVTT, demonstrating strong applicability and generalization.

Abstract:
The core of image-text retrieval is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. Therefore, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward outputting the average representation of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate image embeddings to capture various semantic variations, we construct a mixed distribution and employ a variance-aware weighting loss to assign different weights to the optimization process. In addition, we develop a Fast Re-ranking strategy (FR) to efficiently evaluate the retrieval results and enhance the performance. We compare the performance with existing set-based method using five image feature encoders and three text feature encoders on three benchmark datasets: MSCOCO, Flickr30K and CUB Captions. We also show the role of different components by ablation studies and perform a sensitivity analysis of the hyperparameters. The qualitative analysis of visualized bidirectional retrieval and attention maps further demonstrates the ability of our method to encode semantic variations.

Abstract:
Pre-trained models are extensively embraced in deep learning, facilitating efficient fine-tuning for downstream user-specific tasks and yielding substantial computational savings. However, backdoor attacks present a significant security threat to downstream models constructed on corrupted pre-trained models, necessitating the implementation of effective countermeasures to mitigate this threat prior to deploying the models in safety-critical applications. This paper introduces Purifier and its advanced version Purifier^+, the former of which mitigates backdoors in pre-trained models by aligning anomaly activation to normal activation, and the latter builds on this by making importance rating about activation patterns, boosting important activation patterns and suppressing unimportant activation patterns. Purifier and Purifier^+ draw inspiration from the observation that anomaly activation patterns for backdoor triggers manifest across various perspectives such as channel-wise, cube-wise, and feature-wise, each exhibiting distinct levels of granularity. Crucially, the choice of alignment granularity plays a pivotal role in ensuring robustness and accuracy. In addressing this challenge, Purifier and Purifier^+ demonstrate the ability to effectively thwart various categories of backdoor triggers devoid of requiring prior information about the specific backdoor attacks. Additionally, it offers a convenient and flexible deployment feature, namely, plug-and-play capability. The comprehensive experimental results demonstrate that Purifier and Purifier^+ outperform current methodologies regarding defense efficacy and accuracy in model inference with uncontaminated samples when subjected to a series of State-of-the-Art mainstream attacks.

Abstract:
Open-set single-source domain generalization aims to use a single-source domain to learn a robust model that can be generalized to unknown target domains with both domain shifts and label shifts. The scarcity of the source domain and the unknown data distribution of the target domain pose a great challenge for domain-invariant feature learning and unknown class recognition. In this article, we propose a novel learning approach based on domain expansion and boundary growth to expand the scarce source samples and enlarge the boundaries across the known classes that indirectly broaden the boundary between the known and unknown classes. Specifically, we achieve domain expansion by employing both background suppression and style augmentation on the source data to synthesize new samples. Then we force the model to distill consistent knowledge from the synthesized samples so that the model can learn domain-invariant information. Furthermore, we realize boundary growth across classes by using edge maps as an additional modality of samples when training multi-binary classifiers. In this way, it enlarges the boundary between the inliers and outliers, and consequently improves the unknown class recognition during open-set generalization. Extensive experiments show that our approach can achieve significant improvements and reach state-of-the-art performance on several cross-domain image classification datasets.

Abstract:
Cross-modality based on remote sensing (RS) text-image retrieval has gained increasing attention in recent years due to its ability to leverage the rich semantics of images and the understandability of text to provide a more comprehensive description. Existing cross-modal retrieval methods typically apply self-attention or cross-attention mechanisms to identify important information in RS data, but they ignore the multi-view perception characteristic of geographical space in RS images. As a result, these retrieval models fail to locate the correct perspective in images according to the query text, ultimately leading to incorrect matching. In this work, a Cross-modal Progressive Perspective Matching Network (CPPMN) is proposed for remote sensing image-text retrieval by establishing a progressive perspective matching mechanism and semantic alignment to further improve the performance of the retrieval model. Specifically, the CPPMN framework consists of three core modules: the Compensation Network for Full Perspective Modeling (CN_FPM), the Graph Transformation for Individual Perspective Modeling (GT_IPM), and the Cascaded Transformer for Cross-modal Semantic Alignment (CT_CSA). The CN_FPM module utilizes all positive text samples as supervision signals to guide the feature extraction training process, aiming to capture full perspective information from images. Subsequently, the GT_IPM module transforms implicit-perspective feature representations into explicit-perspective cross-modal relationship graphs. This transformation enables the identification of specific perspective locations within the image according to the query sentence by analyzing graph density and connectivity. Finally, the CT_CSA module comprises a cascaded Transformer network that aligns features at the semantic level between cross-modal data The quantitative and qualitative experiments are conducted on four large-scale remote sensing cross-modal retrieval datasets to demonstrate the significant performance of adopting the progressive perspective matching mechanism and semantic alignment strategy.

Abstract:
The advent of large language models, which enable flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generation, versatile multi-modal generative shape models can significantly benefit various fields, such as 3D virtual construction and network-aided design. In this article, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a “word-sentence-paragraph” framework to discretize continuous shapes into shape words, further assembles these words into shape sentences, and integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multi-modal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.

Abstract:
Deep learning-based compressed sensing (CS) technology attracts widespread attention owing to its remarkable reconstruction with only a few sampling measurements and low computational complexity. However, the existing video compressive sampling approaches cannot fully exploit the inherent interframe and intraframe correlations and sparsity of video sequences. To address this limitation, a novel sampling and reconstruction method for video CS (called WRDD) is proposed, which exploits the advantages of wavelet residual sampling and dual-domain fusion optimization. Specifically, in order to capture high-frequency details and achieve efficient and high-quality measurements, we propose a wavelet residual (WR) sampling strategy for the nonkeyframe sampling, which is achieved by the wavelet residuals between nonkeyframes and keyframes. Furthermore, a dual-domain (DD) fusion strategy is proposed, which fully combine intraframe and interframe to improve the reconstruction quality of nonkeyframes both in the pixel domain and multilevel feature domains. Extensive experiments demonstrate that our WRDD surpasses the state-of-the-art video and image CS methods in both subjective and objective evaluations. Besides, it exhibits outstanding antinoise capability and computational efficiency.

Abstract:
Empirical risk minimization (ERM) is a celebrated induction principle for developing data-driven models. However, ERM has received both pros and cons for its capability on domain generalization (DG). To this end, this paper attempts to study the success and failure of ERM at supervised DG classification tasks, both theoretically and empirically, with causal perspectives. In the theoretical aspect, we first explore different properties of a causal metric termed information flow, followed with discussing relationships between the information flow and the mutual information in the proposed causal graph. Next, we analyze the roles of the transformed causal feature and the transformed spurious feature on modeling performances. It reveals that the interaction between the spurious influencer and the transformed causal feature is the key determining the failure or success of ERM on DG. In the empirical study, we first simulate various DG settings based on the MNIST, Fashion MNIST, and CIFAR10 datasets. Next, we verify developed theories by testing three different neural network configurations in designed experiments. In addition, experiments based on real-world datasets are conducted to further consolidate key points of the proposed theories. To extend application benefits of the theoretical discoveries, a new risk minimization framework with a novel feature intervention for regulating ERM is proposed. It achieves DG improvements over ERM on real-world datasets of image segmentation, image classification, and text classification.

Abstract:
Emerging learning-based video compression suffers from error propagation in long group of pictures (GOP), yielding limited coding performance. To address this problem, a novel end-to-end Deep Video Compression method based on Hierarchical Temporal Context Learning (DVCH) is proposed in this paper. DVCH aims to fully exploit temporal contexts and suppress error propagation for better coding performance. It first divides video frames into several hierarchies with different compression qualities. The frames in lower hierarchies have high compression quality, and serve as reference frames. To mine high-quality reference information, we propose a Hierarchical Temporal Context Learning (HTCL) network as the fundamental module of our DVCH. The informative temporal context features from hierarchical prediction structure can be extracted by the network. Motion vectors (MVs) between the to-be-coded frame and its reference frames are estimated by the MV Learning module and used to align the extracted contexts. The contexts are fed into Context Coding module to generate the prediction of the decoded frame. Moreover, a multi-stage training strategy is developed to solve the imbalanced training challenge. Experimental results demonstrate that the proposed DVCH exceeds x264 and other end-to-end video compression methods, regardless of objective, subjective, error propagation suppression, GOP sizes, and sequence length evaluations. As much as 49.27% bitrate savings and 2.52 dB PSNR gains can be achieved in large GOP.

Abstract:
In recent years, the lookup tables (LUTs) with deep learning for image enhancement have achieved remarkable results with extremely high inference efficiency. However, when dealing with severely degraded low-light images, lookup-table-based methods tend to exhibit poor enhancement results due to the lack of contextual and global information. To address the limitations of current lookup-table-based methods in the low-light image enhancement task, we propose the novel Wide Vision Lookup Tables (WV-LUT) by introducing Complementary-Hierarchical 4D-LUTs into 3D-LUT, which allows 3D-LUT to have a wider range of vision. Specifically, the 4D-LUTs are used to expand the receptive field and process local information on a single channel, while a 3D-LUT is used for sRGB channel post-processing. Additionally, we propose a lightweight Global Adjustment Module that further enhances the performance and generalization of WV-LUT by obtaining global adjustment parameters for gamma and color correction matrix to adaptively process images. Experimental results demonstrate that our method outperforms other state-of-the-art methods in low-light image enhancement with the highest average ranking and superior inference efficiency. Furthermore, deployment experiments on mobile devices demonstrate that our WV-LUT achieves superior results and inference efficiency, showcasing promising application prospects for edge devices.

Abstract:
For the prediction unit partition modes (PUPM)-based steganography, a mainstream branch of high efficiency video coding (HEVC) video steganography, striking a balance between embedding performance and security is very challenging. Including the 2\mathcal N × 2\mathcal N PUPMs having the maximum number of PUPMs into data embedding is indeed an effective way of enlarging the embedding capacity, but it necessarily causes a significant decline in security. Therefore, a multi-factor-involved cost function (MFICF) is proposed in this paper to evaluate the embedding cost for modifying each PUPM by comprehensively considering four different aspects affecting the embedding performance and security. With the assistance of MFICF, the 7-ary notational system is combined to use all the 7 types of PUPMs containing 2\mathcal N × 2\mathcal N for data embedding, thus enlarging the embedding capacity as well as enhancing the embedding efficiency. The syndrome-trellis code driven by MFICF, named CFSTC, is designed to preferentially select PUPMs with low embedding costs for data embedding, so that the embedding efficiency is largely enhanced. The security is effectively guaranteed by allocating a large embedding cost for modifying 2\mathcal N × 2\mathcal N to another type of PUPM. Finally, a lightweight convolutional neural network in combination with gated channel transformation, called GSCNet, is proposed to replace the in-loop filter in HEVC, further optimizing the visual distortion and bitrate increase caused by data embedding. Combining these components above, we design a PUPM-based steganography algorithm, GSAPM. Experimental results show that GSAPM effectively enhances the embedding performance while maintaining high security.

Abstract:
Like deep neural network (DNN)-based classifiers, DNN-based trackers are also vulnerable to adversarial attacks that degrade the tracking performance by adding adversarial perturbations to the input videos. This paper proposes a detection method for the first time to assist the tracker in detecting adversarial attacks. The adversarial perturbations in the visual object tracking task are invisible but are effective at attacking trackers. This naturally creates challenges in detecting attacks in the spatial pixel domain. To this end, we innovatively transfer the detection of adversarial attacks from the spatial domain to the frequency domain. Specifically, we first theoretically prove that the perturbations are added mainly to the high-frequency band of the video. Then, from the empirical studies, we conclude that the low-frequency band contributes most to the tracking performance and is most robust against adversarial attacks. According to the theoretical proof and empirical conclusion, we finally design an unsupervised adversarial detection framework, which mainly contains a frequency decomposition module (FDM), a target tracker (TT) with its mirror tracker (MT), and a discriminant module (DM). For an input video, the TT is fed the full-frequency video, whereas the MT takes as input the low-frequency video that is decomposed by the FDM. The DM discriminates the input video as adversarial or natural by comparing the racking performance differences between the two trackers. The whole detection process is performed along with the tracking phase, and all the modules in the framework require no training on adversarial examples. Extensive experiments demonstrate that our adversarial detection framework can effectively detect mainstream adversarial attacks in the tracking field. It can also be flexibly integrated with many trackers, including anchor-based and anchor-free trackers. More importantly, the trackers integrated with the detection framework can still maintain near-original tracking performance.

Abstract:
We present Centra-Net, a centralized network that concurrently optimizes visual localization over numerous scenes under heterogeneous dataset domains. Centra-Net exemplifies storage efficiency by amalgamating multiple models with task-shared parameters into a singular cohesive structure. Technically, we develop a basic feature extraction unit (BFEU) with two parallel branches: one dedicated to local feature extraction and the other adept at adaptively generating a task-specific attention mask for feature calibration, thus bolstering its feature extraction capability across diverse scenes. Based on the BFEU, we introduce a filter-wise sharing mechanism (FSM) that adaptively determines parameter sharing within the unit, thus facilitating fine-grained parameter allocation. The key insight of FSM resides in reconceptualizing the parameter sharing of the unit as a learnable paradigm, enabling the determination of shared parameters to be made post-training. Finally, we suggest a complexity-prioritized gradient algorithm (CPGA) that capitalizes on task complexity to attain a harmonious learning space for various tasks, thus safeguarding optimal performances across all tasks. Through rigorous experiments on numerous benchmarks, Centra-Net demonstrates a notable edge over existing state-of-the-art works while operating with a significantly reduced parameter footprint.

Abstract:
Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets—Cityscapes, Bdd100 K, and ADE20K—demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.

Abstract:
Multi-view clustering, which identifies shared semantics from different perspectives and classifies data samples into distinct categories using unsupervised methods, is gaining increasing interest. This task primarily focuses on learning consistent multi-view feature representations and clustering labels. Current approaches for achieving consistent multi-view feature representations often use techniques such as cascading, weight fusion, and attention mechanism fusion. These methods reconstruct features based on original low-level features via encoder-decoder, which often contain visual private information, leading to misleading feature representations. Furthermore, in the clustering label learning process, many methods use a two-stage approach: first, they achieve consistent feature representations, and then they apply hard labeling methods like K-means or spectral clustering to obtain clustering labels. Single-stage methods typically derive consistent labels through a linear coding layer based on consistent representation learning. These methods do not fully utilize the multi-view view semantic information, and consistent representation learning may be impaired when some low-quality views are present, leading to the generation of inaccurate semantic labels. To address these issues, we propose a Self-supervised Semantic Soft Label Learning Network for Deep Multi-view Clustering. Specifically, we introduce a consensus high-level feature learning module that uses a shared MLP layer to transform low-level features into a high-level feature space. To enhance the consistency between high-level features from different views, we maximize mutual information between these features and introduce the U-Projection module, which improves the expressive power of the consensus feature via resampling the features and concatenating the fused features before and after sampling operations. Additionally, we propose a self-supervised semantic label learning module that employs a dual-branch approach to independently learn consistent view-specific semantic labels through contrastive learning, while deriving view-consensus semantic labels from shared high-level features extracted from multiple views. Finally, KL divergence is used to align the view-consensus labels with the view-specific labels. A series of extensive experiments have shown that our approach yields superior clustering results compared to existing techniques.

Abstract:
Chest X-ray images have been highly involved in clinical diagnosis and treatment planning for thoracic disease. The process of medical images has attracted great attention in the machine learning community. However, the labeled medical images are limited and the regions of lesions are usually much smaller in the image. Most of the existing methods are prone to learning the spurious correlation for classification, resulting in poor generalization. In this paper, we propose a medical generation transformer network based on self-supervised learning and the adversarial strategy to capture the discriminative label-relevant regions with lesions in the images by extending the Chest X-ray images. In the proposed method, we first localize the label-relevant regions in each transformer layer. Then we keep the label-relevant regions to mask the image and construct the masked image with self-supervised learning. Thus we can generate more images to fine-tune the classification network with masked images that keep the label-relevant regions. Since the generated images are usually noisy to fine-tune the classification network, we adopt the adversarial probabilities to weight the importance of each generated image for training. Experimental results on two large-scale and popular chest X-ray datasets show that the proposed method can efficiently leverage the location of lesions to improve the performance of classification.

Abstract:
Haptic data compression has gradually become a key issue for emerging real-time haptic communications in Tactile Internet (TI). However, it is challenging to achieve a trade-off between high perceptual quality and compression ratio in haptic data compression scheme. Inspired by the perspective of embodied AI, we propose a cross-modal haptic compression scheme for haptic communications to improve the perception quality on TI devices in this paper. Since multimodal fusion is routinely employed to improve the ability of system in cognition, we assume that haptic codec is guided by visual semantics to optimize parameter settings in the coding process. We first design a multi-dimensional tactile feature fusion network (MTFFN) relying on multi-head attention mechanism. The MTFFN extracts the multi-dimensional features from the material surface and maps them to infer the coding parameters. Secondly, we provide second-order difference and linear interpolation to establish an criterion for the determination of optimal codec parameters, which are customized by the material categories so as to give high robustness. Finally, the simulation results reveal that our compression scheme can efficiently make a personalized codec procedure for different materials, obtaining more than 17% improvement in terms of compression ratio with high perceptual quality at the same time.

Abstract:
Learning from experience is a fundamental capability of intelligent agents. Autonomous systems rely on sensors that provide data about the environment and internal situations to their perception systems for learning and inference mechanisms. These systems can also learn Self-Aware and Situation-Aware generative modules from these data to localize themselves and interact with the environment. In this paper, we propose a self-aware cognitive architecture capable to perform tasks where the interactions between the self-state of an agent and the surrounding environment are explicitly and dynamically represented. We specifically develop a Deep Learning (DL) based Self-Aware interaction model, empowered by learning from Multi-Modal Perception (MMP) and World Models using multi-sensory data in a novel Multi-Agent Self-Awareness Architecture (MASAA). Two sub-modules are developed, the Situation Model (SM) and the First-Person model (FPM), that address different and interrelated aspects of the World Model (WM). The MMP model, instead, aims at learning the mapping of different sensory perceptions into Exteroceptive (EI) and Proprioceptive (PI) latent information. The WM then uses the learned MMP model as experience to predict dynamic self-behaviors and interaction patterns within the experienced environment. WM and MMP Models are learned in a data-driven way, starting from the lower-dimensional odometry data used to guide the learning of higher-dimensional video data, thus generating coupled Generalized State Hierarchical Dynamic Bayesian Networks (GS-HDBNs). We test our model on KITTI, CARLA, and iCab datasets, achieving high performance and a low average localization error (RMSE) of 2.897%, when considering two interacting agents.

Abstract:
The most significant characteristic of long-tailed classification is that severe sample imbalance causes the model to be biased towards the head category. While the long-tailed distribution of multimedia dataset remains a constant, we can enhance the acquisition of balanced training samples and corresponding features during the learning process. This paper innovatively designs a sample provider to construct balanced training samples to enhance the acquisition of comprehensive features, and proposes a Siamese-based parameter-sharing framework to handle data with long-tailed distributions. Specifically, one branch of the Siamese network is introduced to classify samples with conventional random cropping sampling, another branch integrates the advantages of constructed balanced samples and hybrid optimization to capture the balanced features to identify more precise category boundaries. This combination not only facilitates the learning of long-tailed distribution but also strengthens the model's extraction of balanced features through the incorporation of contrastive learning. Most significantly, extensive experiments on CIFAR10-LT, CIFAR100-LT, ImageNet-LT and iNaturalist 2018 datasets demonstrate our model not only achieves superior performance but also retains the benefits of end-to-end training. Specifically, our method achieves 60.7% accuracy on ImageNet-LT with an end-to-end ResNeXt-50 backbone.

Abstract:
Temporal action detection aims to recognize the action category and determine each action instance's starting and ending time in untrimmed videos. The mixed method has demonstrated notable performance by integrating both anchor-based and anchor-free approaches. However, while it leverages the strengths of each method, it also retains their respective limitations. For instance, the anchor-based approach depends on manually crafted anchors tailored to specific datasets, while the anchor-free approach predicts potential action instances at each temporal position, resulting in a significant number of false positives in category prediction. The inclusion of these limitations undermines the potential benefits of the mixed method. In this paper, we propose a novel Boundary Discretization and Reliable Classification Network (BDRC-Net) that addresses the issues above by introducing boundary discretization and reliable classification modules. Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, eliminating the need for the traditional handcrafted anchor design. Furthermore, the reliable classification module (RCM) predicts reliable global action categories to reduce false positives. Extensive experiments conducted on different benchmarks demonstrate that our proposed method achieves competitive detection performance.

Abstract:
User-generated cinematic creations are gaining popularity as our daily entertainment, yet it is a challenge to master cinematography for producing immersive contents. Many existing automatic methods focus on roughly controlling predefined shot types or movement patterns, which struggle to engage viewers with the actor's circumstances. Real-world cinematographic rules show that directors can create immersion by comprehensively synchronizing the camera with the actor. Inspired by this strategy, we propose a deep camera control framework that enables actor-camera synchronization in three aspects, considering frame aesthetics, spatial action, and emotional status in the 3D virtual stage. Following rule-of-thirds, our framework first modifies the initial camera placement to position the actor aesthetically. This adjustment is facilitated by a weakly-supervised adjustor that analyzes frame composition via camera projection. We then design a GAN model that can adversarially synthesize fine-grained camera movement based on the actor's action and psychological state, using an encoder-decoder generator to map kinematics and emotional variables into camera trajectories. Moreover, we incorporate a regularizer to align the generated stylistic variances with specific emotional categories and intensities. The experimental results show that our proposed method yields immersive cinematic videos of high quality, both quantitatively and qualitatively. Live examples can be found in the supplementary video.

Abstract:
Class Incremental Learning (CIL) for image classification aims to address real-world scenarios by allowing a model to learn new categories while retaining the knowledge of old categories. It is more challenging than Task Incremental Learning (TIL) as task ID is not provided during testing. Therefore, transitioning from CIL to TIL is an intuitive approach to handling CIL problems for image classification. Currently, the main challenge of this approach lies in improving the accuracy of task identification. To address this issue, we propose to use a large-scale image-text pre-training model (i.e. CLIP) as the backbone, training and saving different classifiers for different tasks. Each classifier not only includes the classes of the current task, but also an Out-of-distribution (OOD) class corresponding to the classes encountered in all previous tasks. At test time, we iterate through classifiers from the last task to find the correct task ID of the test image, and perform classification in a TIL way. In addition, to tackle the issue of early-stop termination in iterative prediction due to model bias toward later tasks, we propose using CLIP zero-shot ability to assist learned OOD detection. Experiments show that our method achieves state-of-the-art performance on the traditional many-shot and the more challenging few-shot settings of CIFAR-100 and ImageNet-Subset datasets.

Abstract:
The goal of Few-Shot Continual Learning (FSCL) is to incrementally learn novel tasks with limited labeled samples and preserve previous capabilities simultaneously. However, current FSCL works lack research on domain increment and domain generalization ability, which cannot cope with changes in the visual perception environment. In this paper, we set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental scenarios together with domain generalization assessment. Firstly, two benchmark datasets and protocols are newly arranged, and detailed baselines are provided for this unexplored configuration. Furthermore, we find that common continual learning methods have poor generalization ability on unseen domains and cannot better tackle catastrophic forgetting issue in cross-incremental tasks. Hence, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA). It contains two non-conflicting parts: (1) By applying the fast-adaptation characteristic of adapter-embedded ViT, the mixture of Adapters (MoA) module is incorporated into ViT. For stability purpose, cosine similarity regularization and dynamic weighting are designed to make each adapter learn specific knowledge and concentrate on particular classes. (2) To further enhance domain generalization ability, we alleviate the intra-class variation by prototype-calibrated contrastive learning to improve domain-invariant representation learning. Finally, six evaluation indicators showing the overall performance and forgetting are compared by comprehensive experiments on two benchmark datasets to validate the efficacy of CMoA, and the results illustrate that CMoA can achieve comparative performance with rehearsal-based continual learning methods.

Abstract:
Zero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that hinges on overcoming the cross-domain differences between sketches and images. Previous methods primarily address cross-domain differences by creating a common embedding space, improving final retrieval results. However, most previous approaches have overlooked a critical aspect: sketch-based image retrieval task actually requires only the cross-domain invariant information relevant to the retrieval. Irrelevant information (such as posture, expression, background, and specificity) may detract from retrieval accuracy. In addition, most previous methods perform well on traditional SBIR datasets but lack corresponding research on generalization and extensibility in the face of more diverse and complex data. To address these issues, we propose a Dual Causal Disentangled Learning (DCDL) for ZS-SBIR. This approach can mitigate the negative impact of irrelevant features by separating retrieval-relevant features in the latent variable space. Specifically, we constructed a causal disentanglement model using two Variational Autoencoders (VAE), each applied to the sketch and image domains, to obtain disentangled variables with exchangeable attributes. Our framework effectively integrates causal intervention with disentangled representation learning, enabling a clearer separation of cross-domain retrieval-relevant and intra-class irrelevant features, which can be recombined into new reconstructed samples. Concurrently, we designed a Dual Alignment Module (DAM), leveraging the accurate and comprehensive semantic features provided by a text encoder pre-trained on large-scale datasets to supplement semantic associations and align disentangled retrieval-relevant features. The Dual Alignment Module enhances the model's ability to generalize across diverse datasets by effectively aligning retrieval-relevant information from different domains. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on the Sketchy and TU–Berlin datasets. Additionally, more experiments on larger scale dataset QuickDraw, fine-grained datasets, Shoe-V2 and Chair-V2, as well as an inter-dataset further validate the generalization and extensibility of DCDL.

Abstract:
Conventional audio steganography methods typically require embedding secret information into the carrier, making them vulnerable to steganalysis. To address this issue, we propose a novel coverless audio steganography method that hides information by generating carriers and establishing mapping rules rather than embedding data directly. Our approach leverages a differential privacy clustering algorithm to cluster audio data and select representative audio files, thereby enhancing the security of the steganography. Additionally, we introduce an improved audio feature extraction method that combines traditional Mel-frequency cepstral coefficients (MFCC) with global statistical information, significantly boosting the robustness of the secret information against common audio attacks, particularly time-stretching attacks. Experimental results show that our method achieves a robustness rate of up to 95% against time-stretching and maintains an average security accuracy rate exceeding 97% across various attack scenarios. The proposed method ensures that the audio carrier remains unaltered, thus effectively resisting detection by steganalysis tools. This innovative approach provides a practical and efficient solution for the secure transmission of information in the digital era.

Abstract:
Visible and infrared image fusion(VIF) provides more comprehensive understanding of a scene and can facilitate subsequent processing. Although frequency domain contains valuable global information in low frequency and rapid pixel intensity variation data in high frequency of images, existing fusion methods mainly focus on spatial domain. To close this gap, a novel VIF method in frequency domain is proposed. First, a frequency-domain feature extraction module is developed for source images. Then, a frequency-domain transformer fusion method is designed to merge the extracted features. Finally, a residual reconstruction module is introduced to obtain final fused images. To the best of our knowledge, it is the first time that image fusion study is conducted from frequency domain perspective. Comprehensive experiments on three datasets, i.e., MSRS, TNO, and Roadscene, demonstrate that the proposed approach obtains superior fusion performance over several state-of-the-art fusion methods, indicating its great potential as a generic backbone for VIF tasks.

Abstract:
Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and assume that the correspondence between videos and paragraphs is known. This is impractical in real-world applications, as constructing temporal labels requires significant labor costs, and the correspondence is often unknown. To address this issue, we propose a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG). In this method, retrieval and grounding tasks are mutually reinforced rather than being treated as separate issues. DMR-JRG mainly consists of two branches: a retrieval branch and a grounding branch. The retrieval branch uses inter-video contrastive learning to roughly align the global features of paragraphs and videos, reducing modality differences and constructing a coarse-grained feature space to break free from the need for correspondence between paragraphs and videos. Additionally, this coarse-grained feature space further facilitates the grounding branch in extracting fine-grained contextual representations. In the grounding branch, we achieve precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs. By synergizing these dimensions, we construct a fine-grained feature space for video and textual features, greatly reducing the need for large-scale annotated temporal labels. Meanwhile, we design a grounding reinforcement retrieval module (GRRM) that brings the coarse-grained feature space of the retrieval branch closer to the fine-grained feature space of the grounding branch, thereby reinforcing retrieval branch through grounding branch, and finally achieving mutual reinforcement between tasks. Extensive experiments on three challenging datasets demonstrate the effectiveness of our proposed method.

Abstract:
Data-free knowledge distillation (DFKD) enables knowledge transfer from a pre-trained teacher to a student network without accessing the real dataset. However, generator-based DFKD methods struggle to ensure that the synthetic images accurately reflect the real dataset distribution. The update of the generator network relies heavily on teacher category guidance, but varying teacher prediction accuracy across categories leads to inconsistent synthetic image quality. Such variations introduce a distribution shift between synthetic and real datasets, negatively impacting student network performance during knowledge distillation. To address this challenge, we propose the SRIF, comprising two components: Student-Driven Flexible Filtering (SDFF) and Re-weighting for Independent Regularization (RIR). SDFF filters out synthetic images affected by the category distribution shift during data generation, producing a more reliable dataset. RIR, applied during distillation, encourages the student to learn stable causal relationships through sample reweighting. Both components flexibly integrate into existing DFKD frameworks, improving performance while reducing training costs.

Abstract:
The growth of 3D point cloud applications requires efficient compression techniques for high-quality and low-latency services. Recently, learning-based point cloud compression models have made significant progress. However, geometric distortion resulting from downsampling limits the feature depth within large-scale point clouds, thereby constraining the receptive field and suppressing the redundant removal. Moreover, the issues of computational efficiency and reconstruction quality still persist in the compression of large-scale point clouds. To address these challenges, we propose a hierarchical distortion learning framework for end-to-end lossy compression of point clouds. First, we design a feature residual compression module to efficiently transmit shallow semantics between the encoder and the decoder, which enables a lightweight design of our framework. Second, we introduce a geometry residual compression module to progressively complement the reconstruction distortion, avoiding the accumulation of geometric distortion. By integrating these two modules and employing sufficient downsampling processes, we develop a high-performance framework with a significantly enlarged receptive field and low computational cost. Extensive experiments demonstrate that our method achieves state-of-the-art performance in geometry lossy compression, while delivering competitive performance in joint geometry and color lossy compression with fast running speed.

Abstract:
Today, the family of latent diffusion models (LDMs) has gained prominence for its high quality outputs and scalability. This has also raised security concerns on social media, as malicious users can create and disseminate harmful content. Existing approaches typically involve training specific components or entire generative models to embed a watermark in generated images for traceability and responsibility. However, in the fast-evolving era of AI-generated content (AIGC), the rapid iteration and modification of LDMs makes retraining with watermark models costly. To address the problem, we propose MarkPlugger, a generalizable plug-and-play watermark framework without LDM retraining. In particular, to reduce the disturbance of the watermark on the semantics of the generated image, we try to identify a watermark representation that is approaching orthogonal to the semantic in latent space, and apply an additive fusion strategy for the watermark and the semantic. Without modifying any components of the LDMs, we embed diverse watermarks in latent space, adapting to the denoising process. Our experimental findings reveal that our method effectively harmonizes image quality and watermark recovery rate. We also have validated that our method is generalized to multiple official versions and modified variants of LDMs, even without retraining the watermark model. Furthermore, it performs robustly under various attacks of different intensities.

Abstract:
Input encoding has proven crucial in the success of methods based on neural radiance field. Compared to the literature on general static scene modeling, input encoding for dynamic hand modeling has been less explored. However, this aspect is critical to the modeling of deformation and rendering, as it maps a sampled point in space to the representation containing all the information associated with dynamic hand for inferring the geometry and appearance property of this point. The design of input encoding determines how well the neural network can learn for photo-realistic hand rendering. We offer an in-depth examination of this key component and introduce DEHand, a new representation utilizing Deformable Encoding for photo-realistic free-view and free-pose Hand rendering. DEHand leverages deformable encoding with a latent code map to achieve high-quality, pose-controlled rendering. Deformable encoding is achieved by adapting static input encoding techniques for the view synthesis of dynamic hands, using parametric hand mesh model as a proxy to construct encodings that map sampled points into a space capable of integrating over different poses and providing rich information for hand modeling. Our findings demonstrate that with our deformable encoding, a single Multilayer Perceptron (MLP) can achieve high-quality dynamic hand rendering, learning solely from images. Extensive experiments on InterHand2.6 M validate the superior rendering quality of our method and the effectiveness of each component in our design.

Abstract:
Temporal Sentence Grounding (TSG) requires a thorough understanding of the complex cross-modal semantic relationships between videos and text. However, existing methods fail to accurately capture content at diverse granularity levels with distinct semantics, making it difficult to achieve fine alignment of visuals and text. To overcome this issue, we attempt to mine for rich semantic clues by utilizing the hierarchical correspondence structure and multi-granularity visual-to-text reconstruction, achieving fine-grained reasoning. Specifically, for the TSG task, we propose a novel Hierarchical Cross-modal Fine-grained Mining Network (HCFMN), which utilizes an attention mechanism based on temporal hierarchical relationships to extract temporal features corresponding to the text of different granularities. We leverage the reconstructability of visual-to-text, recovering multi-granularity textual content from coarse to fine by focusing on temporal features at different layers, hierarchically extracting temporal features and the dependencies related to the text, and implementing fine-grained cross-modal semantic alignment. Furthermore, HCFMN introduces a novel partitioned efficient attention mechanism, which significantly enhances the model’s efficiency through a two-stage attention based on sequence and channel compression. Extensive experimental results on three public datasets (ActivityNet-Captions, TACoS, and Charades-STA) demonstrate that the proposed method achieves state-of-the-art performance.

Abstract:
Cross-view image geo-localization is a technique to determine the geographic location of the query image by matching it with geo-tagged aerial images. However, when the query image is captured at nighttime, the existing methods could not extract geographic-related information from low and uneven illumination areas effectively, thus geo-localizing the nighttime ground image with poor performance. In this work, we propose a cross-view and cross-day-night image geo-localization method (CCIGeo), which contains three branches, taking the query nighttime ground image, the supervision daytime ground image, and the reference satellite image as inputs, respectively. Inspired by knowledge distillation, the proposed method takes daytime ground image branch as the teacher model, which would supervise the nighttime ground image branch to overcome the interference of the uneven and low illumination, and pay more attention to the areas containing rich geographic-related information. And to better adapt to the cross-day-night environment, a dual-constraint loss function is designed inspired by the concept of knowledge distillation. Extensive experimental results show that CCIGeo significantly improves the performance on nighttime image geo-localization, exceeding the state-of-the-art (SOTA) methods by 1.83%, 3.84%, and 1.64% on three datasets.

Affiliations: School of Public Security and Emergency Management, Anhui University of Science and Technology, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, China; Beijing Institute of Technology, Beijing, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Hefei, China; Peng Cheng Laboratory, Shenzhen, China

Abstract:
Existing datasets for RGB-DVS tracking are collected with DVS346 camera and their resolution (346 × 260) is low for practical applications. Actually, only visible cameras are deployed in many practical systems, and the newly designed neuromorphic cameras may have different resolutions. The latest neuromorphic sensors can output high-definition event streams, but it is very difficult to achieve strict alignment between events and frames on both spatial and temporal views. Therefore, how to achieve accurate tracking with unaligned neuromorphic and visible sensors is a valuable but unresearched problem. In this work, we formally propose the task of object tracking using unaligned neuromorphic and visible cameras. We build the first unaligned frame-event dataset CRSOT collected with a specially built data acquisition system, which contains 1,030 high-definition RGB-Event video pairs, 304,974 video frames. In addition, we propose a novel unaligned object tracking framework that can realize robust tracking even using the loosely aligned RGB-Event data. This proposed method utilizes uncertainty perception techniques, which can effectively reduce the negative impact of noise (especially noise in event data) on tracking performance. Specifically, we extract the template and search regions of RGB and Event data and feed them into a unified ViT backbone for feature embedding. Next, we propose uncertainty perception modules to encode the RGB and Event features, respectively, then, we propose a modality uncertainty fusion module to aggregate the two modalities. These three branches are jointly optimized in the training phase. Extensive experiments demonstrate that our tracker can collaborate the dual modalities for high-performance tracking even without strictly temporal and spatial alignment.

Abstract:
Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.

Abstract:
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (i.e., TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.

Abstract:
Single-Domain Generalization Object Detection (Single-DGOD) refers to training a model with only one source domain, enabling the model to generalize to any unseen domain. For instance, a detector trained on a sunny daytime dataset should also perform well in scenarios such as rainy nighttime. The main challenge is to improve the detector’s ability to learn the domain-invariant representation (DIR) while removing domain-specific information. Recent progress in Single-DGOD has demonstrated the efficacy of removing domain-specific information by adjusting feature distributions. Nonetheless, simply adjusting the global feature distribution in Single-DGOD task is insufficient to learn the potential relationship from sunny to adverse weather, as these ignore the significant domain gaps between instances across different weathers. In this paper, we propose a novel object detection method for more robust single-domain generalization. In particular, it mainly consists of a frequency-aware selective whitening module (FSW) for removing redundant domain-specific information and a contrastive feature alignment module (CFA) for enhancing domain-invariant information among instances. Specially, FSW extracts the magnitude spectrum of the feature and uses a group whitening loss to selectively eliminate redundant domain-specific information in the magnitude. To further eliminate domain differences among instances, we apply the style transfer method for data augmentation and use the augmented data in the CFA module. CFA formulates both the original and the augmentd RoI features into a series of groups with different categories, and utilizes contrastive learning across them to facilitate the learning of DIR in various categories. Experiments show that our method achieves favorable performance on existing standard benchmarks.

Abstract:
Self-similarity techniques are booming in no-reference super-resolution (SR) due to accurate estimation of the degradation types involved in low-resolution images. However, high-dimensional matrix multiplication within self-similarity computation prohibitively consumes massive computational costs. We find that the high-dimensional attention map is derived from the matrix multiplication between query and key, followed by a softmax function. This softmax makes the matrix multiplication inseparable, posing a great challenge in simplifying computational complexity. To address this issue, we first propose a second-order Taylor expansion approximation (STEA) to separate the matrix multiplication of query and key, resulting in the complexity reduction from \mathcal O(N^2) to \mathcal O(N). Then, we design a multi-scale large field reception (MLFR) to compensate for the performance degradation caused by STEA. Finally, we apply these two core designs to laboratory and real-world scenarios by constructing LabNet and RealNet, respectively. Extensive experimental results tested on five synthetic datasets demonstrate that our LabNet sets a new benchmark in qualitative and quantitative evaluations. Tested on the real-world dataset, our RealNet achieves superior visual quality over existing methods. Ablation studies further verify the contributions of STEA and MLFR towards both LabNet and RealNet frameworks.

Abstract:
Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label Student Action Video (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes.

Abstract:
Blind image inpainting is a challenging task aimed at reconstructing corrupted regions without relying on mask information. Due to the lack of mask priors, previous methods usually integrate a mask prediction network in the initial phase, followed by an inpainting backbone. However, this multi-stage generation process may result in feature misalignment. While recent end-to-end generative methods bypass the mask prediction step, they typically struggle with weak perception of contaminated regions and introduce structural distortions. This study presents a novel mask region perception strategy for blind image inpainting by combining adversarial training with forgery detection. To implement this strategy, we propose an attention-driven forgery adversarial network (AFAN), which leverages adaptive contextual attention (ACA) blocks for effective feature modulation. Specifically, within the generator, ACA employs self-attention to enhance content reconstruction by utilizing the rich contextual information of adjacent tokens. In the discriminator, ACA utilizes cross-attention with noise priors to guide adversarial learning for forgery detection. Moreover, we design a high-frequency omni-dimensional dynamic convolution (HODC) based on edge feature enhancement to improve detail representation. Extensive evaluations across multiple datasets demonstrate that the proposed AFAN model outperforms existing generative methods in blind image inpainting, particularly in terms of quality and texture fidelity.

Abstract:
In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is: https://xmusic-project.github.io.

Abstract:
In recent years, automatic depression detection (ADD) technology has been rapidly developed to boost an objective and assistive diagnosis for major depressive disorder (MDD) with the help of artificial intelligence technology and various physiological and psychological data. Despite emotion being an important reflection of mental status and frequently related to depression symptoms, few recent multi-modal ADD methods take emotional information into account. To address the above issue, we propose to explore emotional distribution information in interviews to assist multi-modal ADD model. On one hand, we use large language models (LLMs) to automatically recognize emotion of text data, and re-organize the data guided by the valence attribute of emotion, which facilitates our model being aware of difference in emotion distribution. On the other hand, we design the emotion encoding which enhances the proposed model to consider the emotional distribution information in its decision-making process. Extensive experiments are conducted by comparing with state-of-the-art ADD methods as well as the ablation study on different modules of the proposed method. More importantly, our experimental results can confirm the research findings in the psychology field, where more attention on negative emotion information is demanded in distinguishing different depressive status.

Abstract:
3D object detection has garnered significant attention within the academic community, primarily due to its broad utility in domains such as autonomous driving and robotics. Prior research efforts have predominantly concentrated on leveraging temporal contextual information embedded within sequential data to enhance the current feature representations. However, a notable limitation of these endeavors lies in their inadequate treatment of the inherent noise present within historical sequences, thereby constraining the efficiency of fusion methods. In this paper, we propose a new temporal feedback network, named TFNet, to model and correct the temporal noise by designing a coupling-decoupling mechanism. Central to our approach are two distinct modules: (i) Foreground Feature Enhancement, which amplifies sparse instance details across temporal frames, thereby furnishing essential local information priors for subsequent fusion; and (ii) Coupling-Decoupling Feature Interaction, designed to first aggregate temporal contextual information and then disentangle fusion features into frame-specific representations. Leveraging a feedback strategy, this module can adaptively enhance useful information and eliminate noise within individual frame features. Empirical evaluations conducted on the nuScenes benchmark demonstrate the effectiveness of TFNet, achieving the new state-of-the-art performance without any bells and whistles.

Abstract:
Despite the recent advancements in deep learning techniques, existing unsupervised low-light image enhancement methods fail to improve global brightness and restore colour due to the lack of high-quality training targets. Moreover, real-world low-light images inevitably contain noise, which significantly reduces image visibility and quality, further complicating the enhancement process. However, current unsupervised approaches tend to oversimplify or ignore the noise in low-light images. To address these issues, we first revise the traditional Retinex decomposition to better integrate with unsupervised deep learning frameworks. Then, we design a Local and Global Illumination-Guided Network for removing corruption from the reflectance component, which improves enhancement quality by not only investigating multi-feature similarity and attention mechanism based on the Retinex theory but also leveraging local details and long-range dependencies. Furthermore, by analysing the attributes of corruption within the reflectance component, we introduce a novel reflectance enhancement loss to effectively remove noise without using ground truth.

Abstract:
Existing 3D cross-modal retrieval (3CMR) methods heavily rely on prior knowledge of training categories, which leads to the problem of modality shift and unseen center deviation when encountering unseen categories under the open-set environment. Aiming at the open-set 3CMR, this paper introduces the Hypergraph-Based Residual Fuzzy Alignment (ReFA) framework, which revisits the open-set retrieval task and navigates uncertainty of it through the lens of Fuzzy Theory. Facing the challenges of boundaryless space caused by uncertain unseen categories, we explore the representation and measurement in the fuzzy membership space as an alternative to fixed close-set category space. Specifically, to address the problem of modality shift caused by unseen categories, we utilize the Residual Sampling Generation (RSG) module to generate modality sampling embeddings that are independent of seen categories under the guidance of fuzzy representation, which residually decouples the entangled interactions of seen categories and modalities. To overcome the problem of unseen center deviation, we propose the Center Fuzzy Alignment (CFA) module to leverage the high-order fuzzy correlations for generalized metric, by constructing a fuzzy hypergraph based on the inherent and fuzzy correlations among both modalities and categories. The comprehensive evaluations of comparison and ablation studies on the four benchmarks demonstrate the superiority of our proposed framework compared to state-of-the-art methods.

Abstract:
In the era of Artificial Intelligence, visual data gathered by edge devices could be primarily utilized for machine vision tasks. The prominent coding frameworks accomplish this by extracting and compressing features extracted from input data. As such, the quality of these features is vital, as they reflect the performance of the coding framework. However, much less work has been dedicated to quality assessment on features, impeding the optimization of the coding system. In this work, we pioneer to explore the feature quality assessment by creating a novel database tailored for features, with the quality ground-truth for each feature. Then, we propose a lightweight feature quality assessment method, called Lightweight Feature Quality Assessment (LFQA). We analyze the feature characteristics from the perspective of spatial and channel thoroughly, and the framework of LFQA is designed based on the analysis results. Experimental results demonstrate that LFQA accurately evaluates the quality of features, reaching a notable Spearman Rank-Order Correlation Coefficient of 85.38%, and exhibits competitive performance in improving the performance of video coding for machine system. Furthermore, LFQA has fewer model parameters and faster inference speed, ensuring a wide range of promising applications.

Abstract:
Despite demonstrating impressive capabilities in comprehending multi-modal contexts, large vision-language models (LVLMs) are invariably prone to generate unreliable answers, i.e., hallucinations. Existing methods mainly mitigate this hallucination by introducing specific designed datasets or employing contrastive decoding techniques. However, these methods heavily rely on the quality of constructed datasets and negative samples, overlooking the inherent ambiguity in reasoning caused by over-reliance on linguistic priors and data complexity, termed reasoning uncertainty. This oversight hinders the models from effectively identifying the causal relationships behind each token, increasing their susceptibility to hallucinations. To address this issue, we propose a novel framework named Reasoning Uncertainty-guided Refinement (RUR) for mitigating hallucinations in LVLMs from an uncertainty perspective. Specifically, unlike conventional uncertainty quantification methods, we first extract the causal reasoning relationships between tokens by exploiting the link between structural causal models and the Transformer architecture. Based on this relationship, we then employ the Subjective Logic principle to model the reasoning uncertainty at both token and sentence levels, which reflects the unreliability degree of generated tokens and sentences. Finally, guided by reasoning uncertainty, we develop multi-level uncertainty-based adjustment to eliminate deceptive tokens exhibiting severe uncertainty and mitigate potential hallucinations in sentences. Extensive experiments demonstrate that our RUR method consistently achieves state-of-the-art performance on five benchmarks.

Abstract:
Composed Image Retrieval (CIR) aims to search an image of interest using a combination of a reference image and modification text as the query. Despite recent advancements, this task remains challenging due to limited training data and laborious triplet annotation processes. To address this issue, this paper proposes to synthesize the training triplets to augment the training resource for the CIR problem. Specifically, we commence by training a modification text generator exploiting large-scale multimodal models and scale up the CIR learning throughout both the pretraining and fine-tuning stages. During pretraining, we leverage the trained generator to directly create Modification Text-oriented Synthetic Triplets (MTST) conditioned on pairs of images. For fine-tuning, we first synthesize reverse modification text to connect the target image back to the reference image. Subsequently, we devise a two-hop alignment strategy to incrementally close the semantic gap between the multimodal pair and the target image. We initially learn an implicit prototype utilizing both the original triplet and its reversed version in a cycle manner, followed by combining the implicit prototype feature with the modification text to facilitate accurate alignment with the target image. Extensive experiments validate the efficacy of the generated triplets and confirm that our proposed methodology attains competitive recall on both the CIRR and FashionIQ benchmarks.

Abstract:
We introduce NeuV-SLAM, a novel dense simultaneous localization and mapping pipeline based on neural multiresolution voxels, characterized by ultra-fast convergence and incremental expansion capabilities. This pipeline utilizes RGBD images as input to construct multiresolution neural voxels, achieving rapid convergence while maintaining robust incremental scene reconstruction and camera tracking. Central to our methodology is to propose a novel implicit representation, termed VDF that combines the implementation of neural signed distance field (SDF) voxels with an SDF activation strategy. This approach entails the direct optimization of color features and SDF values anchored within the voxels, substantially enhancing the rate of scene convergence. To ensure the acquisition of clear edge delineation, SDF activation is designed, which maintains exemplary scene representation fidelity even under constraints of voxel resolution. Furthermore, in pursuit of advancing rapid incremental expansion with low computational overhead, we developed hashMV, a novel hash-based multiresolution voxel management structure. This architecture is complemented by a strategically designed voxel generation technique that synergizes with a two-dimensional scene prior. Our empirical evaluations, conducted on the Replica and ScanNet Datasets, substantiate NeuV-SLAM’s exceptional efficacy in terms of convergence speed, tracking accuracy, scene reconstruction, and rendering quality.

Abstract:
Vision-language models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage the potential of VLMs in adapting to downstream tasks, context optimization methods such as prompt tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose context optimization with multi-knowledge representation (CoKnow), a framework that enhances prompt learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we train lightweight semantic knowledge mappers, which are capable of generating multi-knowledge representations for an input image without requiring additional priors. Experimentally, we conduct extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.

Abstract:
Multiple object tracking based on the tracking-by-detection paradigm relies on appearance information and motion information for trajectory association. Employing global re-identification features and two-stage association strategies can improve the utilization of both types of information for detections with different confidence scores. However, when targets are occluded, coarse-grained global representations can lead to false positive detections. Additionally, two-stage association strategies tend to prioritize matching high-confidence detections over more accurate low-confidence detections, leading to identity switch problems. To address these issues, we propose the OAFTracker framework, which focuses on local representations and a one-stage association strategy. Firstly, a Fine-grained Representation Orthogonal Fusion (FROF) network is designed to adaptively integrate local and global representations. Secondly, we propose a One-stage Association Matching (OAM) strategy. This strategy combines multiple distance constraints to ensure fairness in matching detections with different confidence scores to predicted trajectories. Additionally, we propose an Adaptive Variable Noise (AVN) Kalman filtering algorithm to dynamically update the state of predicted trajectories. Finally, extensive experiments conducted on two public datasets demonstrate the effectiveness of the OAFTracker method.

Abstract:
The objective of blind image quality assessment (BIQA) is to develop a model capable of automatically evaluating image quality without requiring any reference knowledge. While multi-task learning has been widely utilized in BIQA, it has predominantly remained unimodal. This paper delves into the Visual-Language multi-task BIQA model, where distortion knowledge can be captured through image-text contrastive learning. Specifically, Visual-Language auxiliary tasks targeting distortion type and quality level are introduced, respectively, where both positive and negative image-text pairs are constructed for the target distorted image. Subsequently, image-text correspondences are learned in the embedding space while simultaneously evaluating image quality. Notably, in the auxiliary task learning, the proposed method not only brings the image and its corresponding positive text prompt closer but also pushes away the image from its negative text prompts, thereby facilitating the extraction of pertinent distortion features. In the quality assessment task, a patch-wise strategy is employed during the training phase. Differing from conventional BIQA methods, a novel NSS-guided quality weighting is introduced to gauge the correlation between patch quality and global quality, thereby enabling precise quality prediction. Extensive experiments are conducted on six IQA datasets, and the experimental results verify the superiority of the proposed method.

Abstract:
The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network.

Abstract:
Deepfake attribution (DFA) aims to perform multiclassification on different facial manipulation techniques, thereby mitigating the detrimental effects of forgery content on the social order and personal reputations. However, previous methods focus only on method-specific clues, which easily lead to overfitting, while overlooking the crucial role of common forgery features. Additionally, they struggle to distinguish between uncertain novel classes in more practical open-world scenarios. To address these issues, in this paper we propose an innovative multi-DisentAnglement based conTrastive leArning framework, DATA, to enhance the generalization ability on novel classes for the open-world semi-supervised deepfake attribution (OSS-DFA) task. Specifically, since all generation techniques can be abstracted into a similar architecture, DATA defines the concept of ‘Orthonormal Deepfake Basis’ for the first time and utilizes it to disentangle method-specific features, thereby reducing the overfitting on forgery-irrelevant information. Furthermore, an augmented-memory mechanism is designed to assist in novel class discovery and contrastive learning, which aims to obtain clear class boundaries for the novel classes through instance-level disentanglements. Additionally, to enhance the standardization and discrimination of features, DATA uses bases contrastive loss and center contrastive loss as auxiliaries for the aforementioned modules. Extensive experimental evaluations show that DATA achieves state-of-the-art performance on the OSS-DFA benchmark, e.g., there are notable accuracy improvements in 2.55% / 5.7% under different settings, compared with the existing methods.

Abstract:
Multispectral pedestrian detection has been shown to be effective in improving performance in complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To efficiently compress multispectral object detection networks, we propose a novel distillation method, the Adaptive Modal Fusion Distillation (AMFD) framework. Unlike traditional distillation methods, the AMFD framework fully leverages the original modal features from the teacher network, thereby significantly enhancing the performance of the student network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP, SUNRGB-D and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision.

Abstract:
Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweightmulti-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates.

Affiliations: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China; University of Macau, Macao, China; Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, OH, USA; Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore; Sea-NExT Joint Center, School of Computing, National University of Singapore, Singapore

Abstract:
Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.

Abstract:
The distribution of SAR targets generally conforms to a long-tailed distribution. Due to the existence of sample distribution bias and sample selection bias, training classifiers on this distribution of data often introduces spurious correlations between samples and classes. To address this issue, we propose a two-stage causal intervention framework. The core is that structural causality allows for independent interventions on multiple biases, thereby ensuring high-quality tail class predictions while maintaining unbiased performance for head classes. Firstly, we construct a structural causal graph for the long-tailed recognition task from causal perspective. Based on this graph, the causal paths underlying the two types of biases are identified. Secondly, we design a data augmentation method named DiagPatch-M, which identifies causal features within samples. In this process, these generated patches randomly integrate causal and non-causal features from two different samples, disrupting the original recognition process and effectively eliminating biases induced by sample selection. Thirdly, we design an unbiased structural risk minimization (USRM) optimization strategy, which eliminates the “head preference” of conventional models and the “tail preference” of modified models. This strategy reduces the bias introduced by the model’s dependence on the original sample distribution, and achieves stable recognition under different sample distributions. Experimental results on two long-tailed and two balanced datasets demonstrate that the effectiveness of our model surpasses the state-of-the-art (SOTA) methods, indicating the efficacy of our proposed framework in tackling the challenges posed by the long-tailed distribution in SAR target recognition.

Abstract:
Class-incremental learning (Class-IL) aims to continuously learn a model from a sequence of tasks, which suffers from the issue of catastrophic forgetting. Recently, a few transformer based methods are proposed to address this issue by transferring self-attention into task-specific attention. However, these methods utilize shared task-specific attention modules across the whole incremental learning process, and are unable to achieve the balance between consolidation and plasticity, i.e., to remember the knowledge learned from previous tasks and absorb the knowledge from the current task simultaneously. Motivated by the mechanism of LSTM and hippocampus memory, we point out that dual attention on long and short-term memories can handle the consolidation-plasticity dilemma of Class-IL. Typically, we propose Dual-Attention Transformers (DAFormer) to learn external attention and internal attention. The former utilizes sample-dependent keys which exclusively focused on the new tasks, while the latter consolidates the knowledge from previous tasks by using sample-agnostic keys. We present two editions of DAFormer: DAFormer-S and DAFormer-M: the former utilizes shared external keys and maintains a small parameter size, while the latter utilizes multiple external keys and enhances the long-term memory. Furthermore, we propose the K-nearest neighbor invariant based distillation scheme, which distills knowledge from previous tasks to current task by maintaining the same neighborhood relationship of each sample over old and new models. Experimental results on CIFAR-100, ImageNet-subset and ImageNet-full demonstrate that DAFormer significantly outperforms all the state-of-the-art parameter-static and parameter-growing methods.

Abstract:
Cooperation between temporal convolutional networks (TCN) and graph convolutional networks (GCN) as a processing module has shown promising results in skeleton-based video anomaly detection (SVAD). However, to maintain a lightweight model with low computational and storage complexity, shallow GCN and TCN blocks are constrained by small receptive fields and a lack of cross-dimension interaction capture. To tackle this limitation, we propose a lightweight module called the Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in spatio-temporal skeletal data. It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and total Floating Point Operations (FLOPs). Furthermore, the proposed Dual Attention Normalizing Flow (DA-Flow) integrates the DAM as a post-processing unit after GCN within the normalizing flow framework. Simulations show that the proposed model is robust against noise and negative samples. Experimental results show that DA-Flow reaches competitive or better performance than the existing state-of-the-art (SOTA) methods in terms of the micro AUC metric with the fewest parameters and FLOPs. Moreover, we found that even without training, simply using random projection without dimensionality reduction on skeleton data enables substantial anomaly detection capabilities.

Abstract:
Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8% and 80.4% rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.

Abstract:
Compositional zero-shot learning (CZSL) aims to identify novel compositions formed by known primitives (attributes and objects). Motivated by recent advancements in pre-trained vision-language models such as CLIP, many methods attempt to fine-tune CLIP for CZSL and achieve remarkable performance. However, the existing CLIP-based CZSL methods focus mainly on text prompt tuning, which lacks the flexibility to dynamically adapt both modalities. To solve this issue, an intuitive solution is to additionally introduce visual prompt tuning. This insight is not trivial to achieve because effectively learning prompts for CZSL involves the challenge of entanglement between visual primitives as well as appearance shifts in different compositions. In this paper, we propose a novel Synergetic Prompts as Disentanglement Queries (SPDQ) framework for CZSL. It can disentangle primitive features based on synergetic prompts to jointly alleviate these challenges. Specifically, we first design a low-rank primitive modulator to produce synergetic adaptive attribute and object prompts based on prior knowledge of each instance for model adaptation. Then, we additionally utilize text prefix prompts to construct synergetic prompt queries, which are used to resample corresponding visual features from local visual patches. Comprehensive experiments conducted on three benchmarks demonstrate that our SPDQ approach achieves state-of-the-art results.

Abstract:
With high embedding capacity and security, transform coefficient-based video steganography has become an important branch of video steganography. However, existing steganalysis methods against transform coefficient-based steganography provide insufficient consideration to the prediction process of HEVC compression, which results in steganalysis that is not straightforward and fail to effectively detect adaptive steganography methods in low embedding rate scenarios. In this paper, an HEVC video steganalysis method based on centralized error and attention mechanism against transform coefficient-based steganography is proposed. Firstly, the centralized error phenomenon brought by distortion compensation-based steganography is analyzed, and prediction error maps is constructed for steganalysis to achieve higher SNR(signal-to-noise ratio). Secondly, a video steganalysis network called CESNet (Centralized Error Steganalysis Network) is proposed. The network takes the prediction error maps as input and four types of convolutional modules are designed to adapt to different stages of feature extraction. To address the intra-frame sparsity of adaptive steganography, CEA (Centralized Error Attention) modules based on spatial and channel attention mechanisms are proposed to adaptively enhance the steganographic region. Finally, after extracting the feature vectors of each frame, the detection of steganographic video is completed using the self-attention mechanism. Experimental results show that compared with the existing transform coefficient-based video steganalysis methods, the proposed method can effectively detect multiple transform coefficient-based steganography algorithms and achieve higher detection performance in low payload scenarios.

Abstract:
Recent progress in weakly-supervised object detection (WSOD) is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, since most WSOD methods only use image-level annotations, the serial stacking of convolutional blocks in MIDN cannot effectively model multi-channel information, often emphasizing only the most prominent parts of the target while ignoring the entire objects, thus affecting detection performance. In this paper, we investigate how to effectively use multi-channel data to improve the model’s ability to detect long-range dependencies, introducing CC-DETR (Cross-Channel DETR), a new weakly-supervised object detection framework. Specifically, we propose Cross-Channel Adaptive Convolution (CCAC), a module that captures different spatial features at multiple scales, increases the receptive field, and adaptively weights each important feature to guide the model to focus on long-term dependencies. Moreover, we designed a new attention mechanism called Dual-Stream Self-Attention (DSSA). This mechanism uses convolutions with adaptive sizes to capture multi-scale information, preserving long-range dependencies while supporting local feature responses, enhancing the model’s ability to capture long-range dependencies. Extensive experiments demonstrate that our proposed method outperforms the current end-to-end state of the art (+2.3% mAP in VOC, +2.3% AP_50 in COCO). Moreover, our method can be easily integrated into various DETR and ViT models with minimal modifications.

Abstract:
Video virtual try-on aims to generate realistic sequences where garments maintain their identity and adapt accurately to a person’s pose and body shape in source video. This task can be regarded as video inpainting, whereas previous methods focus primarily on the specific try-on region while simply “copying” the remaining parts of the person. However, this approach limits the degrees of freedom and heavily relies on precise human parsing. In complex in-the-wild scenarios, dynamic blurring and limb occlusions can introduce errors and discontinuities in the inpainting regions, adversely affecting the video try-on results. Our solution, VidClothEditor, adopts a relaxed editing approach that allows for full-body inpainting and treats non-edited regions as a reconstruction task. It utilizes multiple garment alignment with a proposed region guidance to enhance the naturalness of video try-on results. Additionally, we employ garment-augmented video consistency learning, which significantly reduces the inference time and increases the practical potential for video editing. Comprehensive experiments on the VITON-HD and TikTok datasets confirm VidClothEditor’s ability to generate high-quality images and smooth videos.

Abstract:
Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. Additionally, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods.

Abstract:
Video-based point cloud compression (V-PCC) developed by MPEG has achieved remarkable compression efficiency for dynamic point clouds. However, the point clouds compressed by V-PCC still suffer from serious artifacts due to the lossy compression and lose a large number of points. In this paper, we propose a new geometry quality enhancement method for the V-PCC compressed point clouds and it can effectively recover the lost points. Our method is applied to the 2D projected near and far frames rather than 3D point clouds. It is designed to enhance the quality of 2D frames, guided by the predicted difference information between them. More specifically, we firstly construct a gradient-based difference prediction network (G-DPnet) to predict the difference between near and far frames. This difference is introduced in the enhancement of 2D frames, for the recovery of the lost 3D points. Meanwhile, we propose the single-frame quality enhancement network (SFQEnet) to separately enhance near and far frames. The enhanced frames are then used to produce the near-far frame difference with G-DPnet. After obtaining the difference, we feed it into a dual-frame quality enhancement network (DFQEnet) to guide the further enhancement of near and far frames. Experimental results demonstrate that our method can effectively recover a large number of lost points and improve the quality of point clouds compressed by V-PCC.

Abstract:
Point cloud completion aims to infer the complete point clouds from incomplete ones. In real-world scenarios, where the paired data is absent, self-supervised methods have emerged as a promising solution. Although existing self-supervised methods perform well at relatively low resolutions, they suffer significant performance degradation at higher resolution primarily because they focus on point cloud reconstruction at patch-level or point-level. In this paper, we propose a self-supervised method based on Geometric Continuity and Consistency Learning (GCCL) at multi-scale level to improve the accuracy of predicting local details and global shapes of point clouds. Specifically, to capture local details, we employ a patch-to-point strategy and a coarse-fine manner for geometric continuity learning. To constrain the global shapes, we construct multiple branches for mutual supervision and utilize class priors to build a memory queue for contrasting current features, enhancing the network focus on geometric consistency learning. We evaluate GCCL on multiple datasets, and the results show that our method outperforms existing self-supervised methods by a 4.4 improvement in CD-\ell _2 on the synthetic PCN dataset and can generate more uniformly distributed completion results on real-world datasets.

Abstract:
Incomplete Multi-view Clustering (IMVC) endeavors to harness information from multiple incomplete views to partition multi-view data into their respective clusters. How to recover missing information with lossless fidelity is the core of IMVC, which is of vital importance but challenging. Most of the existing methods include a feature recovery step to mitigate the negative impact of missing samples on the feature graph, however, these IMVC algorithms simply utilize the correlation between samples to recover the relationship between the unmissing instances and the missing instances while ignoring the consistency between views, which leads to often unsatisfactory recovery results. In addition, previous IMVC algorithms focus more on the recovery of incomplete data, ignoring the effect of the error term on incomplete graphs. This can mislead the recovery process of IMVC algorithm and the feature graph can be affected by anomalous information, which leads to degradation of clustering performance. To address this gap, this paper introduces the Tensor Completion Framework by Graph Refinement for Incomplete Multi-view Clustering (IMVC-TGR). IMVC-TGR separates the redundant information in each affine graph by graph refinement operation, aiming to mitigate the negative impact of error terms and redundant information on the feature graph during the recovery process. Meanwhile, IMVC-TGR stacks the feature graphs into tensors to explore intra-view correlation and inter-view consistency, so as to recover the relationship between missing samples and non-missing samples, and improve the quality of the feature graphs. Finally, IMVC-TGR introduces semantic consistency constraints and self-weighted fusion strategies into the high-quality feature graphs, aiming at preserving the complementary information between different views while balancing the contributions of the refined representation matrices of different views. The experimental results on multiple different datasets indicate that IMVC-TGR can achieve state-of-the-art performance.

Abstract:
Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model–based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model’s feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image’s brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges.

Affiliations: School of Cyber Science and Engineering, Southeast University, Nanjing, China; Information Management and Information Systems, Xian Jiaotong Liverpool University, Suzhou, China; State Key Laboratory of Millimeter Waves, School of Information Science and Engineering, Southeast University, Nanjing, China; Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China; Department of Computer and Information Science, University of Macau, Macau, China

Abstract:
Learning-based methods for underwater image enhancement (UWIE) have undergone extensive exploration. However, learning-based models are usually vulnerable to adversarial examples so as the UWIE models. To the best of our knowledge, there is no comprehensive study on the adversarial robustness of UWIE models, which indicates that UWIE models are potentially under the threat of adversarial attacks. In this paper, we propose a general adversarial attack protocol. We make a first attempt to conduct adversarial attacks on five well-designed UWIE models on three common underwater image benchmark datasets. Considering the scattering and absorption of light in the underwater environment, there exists a strong correlation between color correction and underwater image enhancement. On the basis of that, we also design two effective UWIE-oriented adversarial attack methods, Pixel Attack and Color Shift Attack targeting different color spaces. The results show that five models exhibit varying degrees of vulnerability to adversarial attacks and well-designed small perturbations on degraded images are capable of preventing UWIE models from generating enhanced results. In addition, we conduct adversarial training on these models and successfully mitigated the effectiveness of adversarial attacks. In summary, we reveal the adversarial vulnerability of UWIE models and propose a new evaluation dimension of UWIE models.

Abstract:
Image-text matching aims to retrieve images from the guidance of textual queries or retrieve text expressions with the help of images. Existing Transformer-based methods compute attention for all tokens and thus suffer from redundant information, resulting in inadequate focus on salient features. On the other hand, the widely adopted bidirectional ranking loss overlooks the importance of expanding the distance between positive and negative samples, leading to the misclassification of negative samples as positive ones. In this work, we propose similarity shuffled criss-cross Transformer (SSCT) with angle loss for image-text matching. Specifically, a grouping-shuffling operation is introduced to better distinguish salient features from redundant information, bypassing the need for fully connected mapping. The grouping-shuffling operation establishes channel dependencies across different groups of feature representations, enhancing salient features while suppressing unimportant ones. Then, a criss-cross attention mechanism that equips self-attention with a novel criss-cross convolution is designed to make isolated information cooperatively express integral semantics. Moreover, a novel angle loss is introduced to expand the distances between positive and negative samples. Extensive experiments on the benchmark datasets of MSCOCO and Flickr30 K demonstrate that the proposed methods achieve superior performances compared to state-of-the-art methods.

Abstract:
The unpaired point cloud completion task aims to complete a partial point cloud by using models trained with no ground truth. Existing unpaired point cloud completion methods are class-aware, i.e., a separate model is needed for each object class. Since they have limited generalization capabilities, these methods perform poorly in real-world scenarios when confronted with a wide range of point clouds of generic 3D objects. In this paper, we propose a novel unpaired point cloud completion framework, namely the Reference-guided Completion (RefComp) framework, which attains strong performance in both the class-aware and class-agnostic training settings. The RefComp framework transforms the unpaired completion problem into a shape translation problem, which is solved in the latent feature space of the partial point clouds. To this end, we introduce the use of partial-complete point cloud pairs, which are retrieved by using the partial point cloud to be completed as a template. These point cloud pairs are used as reference data to guide the completion process. Our RefComp framework uses a reference branch and a target branch with shared parameters for shape fusion and shape translation via a Latent Shape Fusion Module (LSFM) to enhance the structural features along the completion pipeline. Extensive experiments demonstrate that the RefComp framework achieves not only state-of-the-art performance in the class-aware training setting but also competitive results in the class-agnostic training setting on both virtual scans and real-world datasets.

Abstract:
Kernel image regression methods have demonstrated excellent efficiency in various image processing tasks, including image and light-field compression, Gaussian Splatting, denoising and super-resolution. The estimation of parameters for these methods commonly employs gradient descent iterative optimization, which poses a significant computational burden for many applications. In this paper, we introduce a novel adaptive segmentation-based initialization method targeted for optimizing Steered-Mixture-of Experts (SMoE) gating networks and Radial-Basis-Function (RBF) networks with steering kernels. The novel initialization method allocates kernels into pre-calculated image segments. The optimal number of kernels, kernel positions, and steering parameters are derived per segment in an iterative optimization and kernel sparsification procedure. The kernel information from local segments is then transferred into a global initialization, ready for use in iterative optimization of SMoE, RBF, and related kernel image regression methods. Results demonstrate significant improvements in both objective and subjective quality compared to regular grid, K-Means, deep-learning-based, and previous segmentation-based initialization methods. The proposed initialization method reduces kernel usage by 70% compared to other initialization methods while maintaining the same reconstruction quality. Furthermore, by generating initial parameters closer to optimized results, convergence time is reduced, achieving overall runtime savings of up to 50% compared to prior methods. Additionally, the method supports parallel computation, with initialization time halved when using four GPUs compared to one.

Abstract:
Coded aperture snapshot spectral imaging (CASSI) captures 3D hyperspectral images (HSIs) in a single shot by encoding incident light into 2D measurements. However, recovering the original hyperspectral data from these measurements is a severely ill-posed inverse problem due to significant information loss during compression. Recent deep learning methods, especially deep unfolding networks, have demonstrated promising reconstruction results by embedding learnable priors into iterative optimization frameworks. However, most existing approaches use a single network to jointly estimate spatial and spectral priors, limiting their ability to handle the distinct properties of HSIs. To overcome this limitation, we propose the Spatial-Spectral Prior Decoupling Model (SSPD), which reformulates HSI reconstruction as a prior absorption problem, enabling independent modeling of spatial and spectral priors with specialized network architectures. To achieve this, we design two attention mechanisms tailored for hyperspectral data: one for capturing spatial correlations and another for preserving spectral signatures. Additionally, we develop a hybrid loss function that combines convergence constraints and cross-prior interactions, ensuring accurate prior fusion and stable reconstruction. Experiments on synthetic and real-world datasets confirm that SSPD outperforms existing methods in spectral snapshot compressive imaging.

Abstract:
Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task consisting of answering the visual question and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) task that only aims at predicting answers for visual questions, EVQA also aims to generate user-friendly explanations to improve the explainability and credibility of reasoning models. To date, existing methods for VQA and EVQA ignore the prompt in the question and enforce the model to predict the probabilities of all answers. Moreover, existing EVQA methods ignore the complex relationships among question words, visual regions, and explanation tokens. Therefore, in this work, we propose a Logic Integrated Neural Inference Network (LININ) to restrict the range of candidate answers based on first-order-logic (FOL) and capture cross-modal relationships to generate rational explanations. Firstly, we design a FOL-based question analysis program to fetch a small number of candidate answers. Secondly, we utilize a multimodal transformer encoder to extract visual and question features, and conduct the prediction on candidate answers. Finally, we design a multimodal explanation transformer to construct cross-modal relationships and generate rational explanations. Comprehensive experiments on benchmark datasets demonstrate the superiority of LININ compared with the state-of-the-art methods for EVQA.

Abstract:
Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition by exploiting the adjacency topology of body representation. However, the adaptive strategy adopted by the previous methods to construct the adjacency matrix is not balanced between the performance and the computational cost. We assume this concept of Adaptive Trap, which can be replaced by multiple autonomous submodules, thereby simultaneously enhancing the dynamic joint representation and effectively reducing network resources. To effectuate the substitution of the adaptive model, we unveil two distinct strategies, both yielding comparable effects. (1) Optimization. Individuality and Commonality GCNs (IC-GCNs) is proposed to specifically optimize the construction method of the associativity adjacency matrix for adaptive processing. The uniqueness and co-occurrence between different joint points and frames in the skeleton topology are effectively captured through methodologies like preferential fusion of physical information, extreme compression of multi-dimensional channels, and simplification of self-attention mechanism. (2) Replacement. Auto-Learning GCNs (AL-GCNs) is proposed to boldly remove popular adaptive modules and cleverly utilize human key points as motion compensation to provide dynamic correlation support. AL-GCNs construct a fully learnable group adjacency matrix in both spatial and temporal dimensions, resulting in an elegant and efficient GCN-based model. In addition, three effective tricks for skeleton-based action recognition (Skip-Block, Bayesian Weight Selection Algorithm, and Simplified Dimensional Attention) are exposed and analyzed in this paper. Finally, we employ the variable channel and grouping method to explore the hardware resource bound of the two proposed models. IC-GCN and AL-GCN exhibit impressive performance across NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA, and UAV-Human datasets, with an exceptional parameter-cost ratio.

Abstract:
To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.

Abstract:
The task of chest X-ray report generation, which aims to simulate the diagnosis process of doctors, has received widespread attention. Compared with the image caption task, chest X-ray report generation is more challenging since it needs to generate a longer and more accurate description of each diagnostic part in chest X-ray images. Most of existing works focus on how to extract better visual features or more accurate text expression based on existing reports. However, they ignore the interactions between visual and text modalities and are thus obviously not in line with human thinking. A small part of works explore the interactions of visual and text modalities, but data-driven learning of cross-modal information mapping can not break the semantic gap between different modalities. In this work, we propose a novel approach called Knowledge-guided Cross-modal Alignment and Progressive fusion (KCAP), which takes the knowledge words from a created medical knowledge dictionary as the bridge to guide the cross-modal feature alignment and fusion, for accurate chest X-ray report generation. In particular, we create the medical knowledge dictionary by extracting medical phrases from the training set and then selecting some phrases with substantive meanings as knowledge words based on their frequency of occurrence. Based on the knowledge words from the medical knowledge dictionary, the visual and text modalities are interacted by a mapping layer for the enhancement of the features of two modalities, and then the alignment fusion module is introduced to mitigate the semantic gap between visual and text modalities. To retain the important details of the original information, we design a progressive fusion scheme to integrate the advantages of both salient fused and original features to generate better medical reports. The experimental results on IU-Xray and MIMIC datasets demonstrate the effectiveness of the proposed KCAP.

Abstract:
Lidars and cameras are critical sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, accurate and robust fusion methods are still under exploration due to non-homogenous representations. In this paper, we find that the complementary roles of point clouds and images vary with depth. An important reason is that the point cloud appearance changes significantly with increasing distance from the Lidar, while the image's edge, color, and texture information are not sensitive to depth. To address this, we propose a fusion module based on the Depth Attention Mechanism (DAM), which mainly consists of two operations: gated feature generation and point cloud division. The former adaptively learns the importance of bimodal features without additional annotations, while the latter divides point clouds to achieve differential fusion of multi-modal features at different depths. This fusion module can enhance the representation ability of original features for different point sets and provide more comprehensive features by using the dual splicing strategy of concatenation and index connection. Additionally, considering point density as a feature and its negative correlation with depth, we build an Adaptive Threshold Generation Network (ATGN) to generate the depth threshold by extracting density information, which can divide point clouds more reasonably. Experiments on the KITTI dataset demonstrate the effectiveness and competitiveness of our proposed models.

Abstract:
Downsampling is a crucial task for processing large scale and/or dense point clouds with limited resources. Owing to the development of deep learning, approaches of task-oriented point cloud downsampling have significant performance gains in preserving geometric information. However, most downsamling methods are limited by the disordered and unstructured point cloud data, making it difficult to continually improve the performance. To address this issue, we propose a light-weight Transformer network (LighTN) for the task-oriented point cloud downsampling as an end-to-end solution. In LighTN, we design an energy-efficient and permutation invariant single-head self-correlation module to extract refined global geometric features. Moreover, we present a novel sampling loss function to guide LighTN to focus on critical point cloud regions with more uniform distributions and prominent point coverage. Extensive experiments on classification, registration, and reconstruction tasks demonstrate that LighTN can achieve the state-of-the-art performance-overhead tradeoff and high-quality qualitative results.

Abstract:
Many point cloud completion methods typically rely on two steps: coarse generation and 2D grid deformed fine output. However, in the fine generation, the expansion range (2D grid scale) required by each point cloud sample may be vastly different. For example, if the expansion range for a vessel shape is applied to a table shape, the final output may be blurry or sparse. To this end, we propose the RLGrid, Reinforcement Learning Controlled Grid Deformation. In detail, we firstly obtain two point cloud skeletons by two branches. One is to use an autoencoder, and the other is to convert the randomly generated normal distribution to coarse point cloud by GAN. We choose the one with smaller Chamfer Distance between coarse output and incomplete input as the input of the second stage. Then, a Reinforcement Learning (RL) agent is designed to select the appropriate expansion range based on the feature of each point cloud, and generate a 2D grid. Finally, all the features are concatenated and sent into a Multilayer Perceptron to obtain the detailed complete point cloud. Experimental results show that RLGrid achieves state-of-the-art performance on various datasets. To the best of our knowledge, RL is not widely used in point cloud completion task due to lack of custom environment, and the proposed RLGrid provides an insight on how to formulate 2D grid deformation as a sequential decision making problem. Further, it can also be plug-and-play on any 2D grid features.

Abstract:
Point cloud analysis, arising from computer graphics, remains a fundamental but challenging problem, mainly due to the non-Euclidean property of point cloud data modality. With the snap increase in the amount and breadth of related research in deep learning for graphs, many important works come in the form of graphs representing the point clouds. In this paper, we present a sampling adaptive graph convolutional network that combines the powerful representation ability of random walk subgraph searching and the essential success of the Fisher vector. Extending from those existing graph representation learning or embedding methods with multi-hop neighbor random searching, we sample multi-scale walk fields by using a steerable exploration-exploitation second order random walk, which endows our model with the most flexibility compared with the original first order random walk. To encode each-scale walk field consisting of several walk paths, specifically, we characterize these paths of walk field by Gaussian mixture models (GMMs) so as to better analogize the standard CNNs on Euclidean modality. Each Gaussian component implicitly defines a direction and all of them properly encode the spatial layout of walk fields after the gradient projecting to the space of Gaussian parameters, i.e. the Fisher vectors. Thereby, we introduce and name our deep graph convolutional network as PointFisher. Comprehensive evaluations on several public datasets well demonstrate the superiority of our proposed learning method over other state-of-the-arts for point cloud classification and segmentation.

Abstract:
Keypoint detection and descriptor matching are two vital steps in the 3D feature extraction framework, but they are difficult to learn in an end-to-end fashion due to their inherent discreteness. To tackle the non-differentiable operations, we formulate feature extraction as a decision-making problem: the network is treated as a policy pool that can make probabilistic estimations for keypoint selection and feature matching, supervised by maximizing a reward expectation of actions. In this way, we propose a novel end-to-end training paradigm of 3D feature extraction based on the stochastic policy gradient method, named Reinforced Detectors and Descriptors (RDD). Firstly, we propose a local-to-global probabilistic keypoint selection module that formulates the sampling probabilities of keypoints in a local-and-global mechanism to yield sparse and accurate keypoints. Secondly, we regard feature matching as an optimal transport problem and an efficient Sinkhorn method is leveraged to solve the optimal matching probabilities. In particular, we carefully design a reward function and derive gradients of probabilistic actions, thus overcoming the discreteness and providing reinforced supervision signals. Since our reward function is calculated from sampled keypoints rather than from randomly sampled points as in existing methods, the gap between training and inference is bridged. Experimental results demonstrate that our approach exceeds the quality of state-of-the-art methods and shows strong generalization ability. Remarkably, our approach can achieve significantly higher Registration Recall than other advanced methods when aligning scenes with a small number of keypoints, due to our highly accurate and repeatable detector.

Abstract:
Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. Point cloud semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.

Abstract:
Monocular depth prediction has received significant attention in recent years. However, the impact of illumination variations, which can shift scenes to unseen domains, has often been overlooked. To address this, we introduce the first indoor scene dataset featuring RGB-D images captured under multiple illumination conditions, allowing for a comprehensive exploration of indoor depth prediction. Additionally, we propose a novel method, MI-Transformer, which leverages global illumination understanding through large receptive fields to capture depth-attention contexts. This enables our network to overcome local window limitations and effectively mitigate the influence of changing illumination conditions. To evaluate the performance and robustness, we conduct extensive qualitative and quantitative analyses on both the proposed dataset and existing benchmarks, comparing our method with state-of-the-art approaches. The experimental results demonstrate the superiority of our method across various metrics, making it the first solution to achieve robust monocular depth estimation under diverse illumination conditions.

Affiliations: School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China; School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; National Clinical Research Center for Geriatric Diseases, Department of Otolaryngology Head and Neck Surgery, Xiangya Hospital, Central South University, Changsha, China

Abstract:
The spectral shape holds crucial information for Audio Classification (AC), encompassing the spectrum's envelope, details, and dynamic changes over time. Conventional methods utilize cepstral coefficients for spectral shape description but overlook its variation details. Deep-learning approaches capture some dynamics but demand substantial training or fine-tuning resources. The Learning in the Model Space (LMS) framework precisely captures the dynamic information of temporal data by utilizing model fitting, even when computational resources and data are limited. However, applying LMS to audio faces challenges: 1) The high sampling rate of audio hinders efficient data fitting and capturing of dynamic information. 2) The Dynamic Information of Partial Spectral Shapes (DIPSS) may enhance classification, as only specific spectral shapes are relevant for AC. This paper extends an AC framework called Effective Dynamic Information Capture (EDIC) to tackle the above issues. EDIC constructs Mel-Frequency Cepstral Coefficients (MFCC) sequences within different dimensional intervals as the fitted data, which not only reduces the number of sequence sampling points but can also describe the change of the spectral shape in different parts over time. EDIC enables us to implement a topology-based selection algorithm in the model space, selecting effective DIPSS for the current AC task. The performance on three tasks confirms the effectiveness of EDIC.

Abstract:
Few-shot learning is a crucial aspect of modern machine learning that enables models to recognize and classify objects efficiently with limited training data. The shortage of labeled 3D point cloud data calls for innovative solutions, particularly when novel classes emerge more frequently. In this paper, we propose a novel few-shot learning method for recognizing 3D point clouds. More specifically, this paper addresses the challenges of applying few-shot learning to 3D point cloud data, which poses unique difficulties due to the unordered and irregular nature of these data. We propose two new modules for few-shot based 3D point cloud classification, i.e., the Soft Interaction Module (SIM) and Self-Attention Residual Feedforward (SARF) Module. These modules balance and enhance the feature representation by enabling more relevant feature interactions and capturing long-range dependencies between query and support features. To validate the effectiveness of the proposed method, extensive experiments are conducted on benchmark datasets, including ModelNet40, ShapeNetCore, and ScanObjectNN. Our approach demonstrates superior performance in handling abrupt feature changes occurring during the meta-learning process. The results of the experiments indicate the superiority of our proposed method by demonstrating its robust generalization ability and better classification performance for 3D point cloud data with limited training samples.

Abstract:
Text Super-Resolution (SR) technology aims to recover lost information in low-resolution text images. With the proposal of TextZoom, which is the first dataset aiming at text super-resolution in real scenes, more and more scene text super-resolution models have been presented on the basis of it. Although these methods have achieved excellent performance, they do not consider how to make full and efficient use of semantic information. Out of this consideration, a Semantic-aware Trident Network (STNet) for Scene Text Image Super-Resolution is proposed. Specifically, pre-trained text recognition model ASTER (Attentional Scene Text Recognizer) is utilized to assist this process in two ways. Firstly, a novel basic block named Semantic-aware Trident Block (STB) is designed to build the STNet, which incorporates an added branch for semantic distillation to learn semantic information of pre-trained recognition model. Secondly, we expand our model in an adversarial training manner and propose new text perceptual loss based on ASTER to further enhance semantic information in SR images. Extensive experiments on TextZoom dataset show that compared with directly recognizing bicubic images, the proposed STNet boosts the recognition accuracy of ASTER, MORAN (Multi-Object Rectified Attention Network), and CRNN (Convolutional Recurrent Neural Network) by 17.4%, 18.2%, and 24.3%, respectively, which is higher than the performance of several existing state-of-the-art (SOTA) SR network models. Besides, experiments in real scenes (on ICDAR 2015 dataset) and in restricted scenarios (defense against adversarial attacks) validate that addition of semantic information enables the proposed method to achieve promising cross-dataset performance. Since the proposed method is trained on cropped images, when applied to real-world scenarios, locations of text in natural images are firstly localized through scene text detection methods, and then cropped text images are obtained based on detected text positions.

Abstract:
Few-shot action recognition aims to identify new action classes with limited training samples. Most existing methods overlook the low information content and diversity of skeleton features, failing to exploit useful information in rare samples during meta-training. This leads to poor feature discriminability and recognition accuracy. To address both issues, we propose a novel Enriched Skeleton Representation and Multi-relational Metrics (ESR-MM) method for skeleton-based few-shot action recognition. First, a Frobenius Norm Diversity Loss is introduced to enrich skeleton representation by maximizing the Frobenius norm of the skeleton feature matrix. This mitigates over-smoothing and boosts information content and diversity. Leveraging these enriched features, we propose a multi-relational metrics strategy exploiting cross-sample task-specific information, intra-sample temporal order, and inter-sample distance. Specifically, Support-Adaptive Attention leverages task-specific cues between samples to generate attention-enhanced features. Then, the Bidirectional Temporal Coherent Mean Hausdorff Metric integrates Temporal Coherence Measure into the Bidirectional Mean Hausdorff Metric for class separation by accounting for temporal order. Finally, Prototype-discriminative Contrastive Loss exploits distances from class prototypes to query samples. ESR-MM demonstrates superior performance on two benchmarks.

Abstract:
Unforeseen appearance variation is a challenging factor for visual tracking. This paper provides a novel solution from semantic data augmentation, which facilitates offline training of trackers for better generalization. We utilize existing samples to obtain knowledge to augment another in terms of diversity and hardness. First, we propose that the similarity matching space in Siamese-like models has class-agnostic transferability. Based on this, we design the Latent Augmentation (LaAug) to transfer relevant variations and suppress irrelevant ones between training similarity embeddings of different classes. Thus the model can generalize across a more diverse semantic distribution. Then, we propose the Semantic Interaction Mix (SIMix), which interacts moments between different feature samples to contaminate structure and texture attributes and retain other semantic attributes. SIMix simulates the occlusion and complements the training distribution with hard cases. The mixed features with adversarial perturbations can empirically enable the model against external environmental disturbances. Experiments on six challenging benchmarks demonstrate that three representative tracking models, i.e., SiamBAN, TransT and OSTrack, can be consistently improved by incorporating the proposed methods without extra parameters and inference cost.

Abstract:
Photo retouching aims to adjust the hue, luminance, contrast, and saturation of the image to make it more human and aesthetically desirable. Based on researches on image imaging process and artists' retouching processes, we propose three improvements to existing automatic retouching methods. Firstly, in the past retouching methods, all the imaging conditions in EXIF were ignored. According to this, we design a simple module to introduce these imaging conditions into a network called ECM (EXIF Condition Module). This module can improve the performance of several existing auto-retouching methods with only a small parameter cost. Additionally, artists' operations also were ignored. By investigating artists' operations in retouching, we propose a two-stage network that brightens images first and then enriches them in the chrominance plane to mimic artists. Finally, we find that there is a color imbalance in the existing retouching dataset, thus, hue palette loss is designed to resolve the imbalance and make the image more vibrant. Experimental results show that our method is effective on the benchmark MIT-Adobe FiveK dataset and PPR10 K dataset, and achieves SOTA performance in both quantitative and qualitative evaluation.

Abstract:
In the past decade, despite significant advancements in Artificial Intelligence (AI) and deep learning technologies, they still fall short of fully replicating the complex functions of the human brain. This highlights the importance of researching human-machine collaborative systems. This study introduces a statistical framework capable of finely modeling integrated performance, breaking it down into the individual performance term and the diversity term, thereby enhancing interpretability and estimation accuracy. Extensive multi-granularity experiments were conducted using this framework on various image classification datasets, revealing the differences between humans and machines in classification tasks from macro to micro levels. This difference is key to improving human-machine collaborative performance, as it allows for complementary strengths. The study found that Human-Machine collaboration (HM) often outperforms individual human (H) or machine (M) performances, but not always. The superiority of performance depends on the interplay between the individual performance term and the diversity term. To further enhance the performance of human-machine collaboration, a novel Human-Adapter-Machine (HAM) model is introduced. Specifically, HAM can adaptively adjust decision weights to enhance the complementarity among individuals. Theoretical analysis and experimental results both demonstrate that HAM outperforms the traditional HM strategy and the individual agent (H or M).

Abstract:
Image enhancement algorithms can facilitate computer vision tasks in real applications. However, various distortions may also be introduced by image enhancement algorithms. Therefore, the image quality assessment (IQA) plays a crucial role in accurately evaluating enhanced images to provide dependable feedback. Current enhanced IQA methods are mainly designed for single specific scenarios, resulting in limited performance in other scenarios. Besides, no-reference methods predict quality utilizing enhanced images alone, which ignores the existing degraded images that contain valuable information, are not reliable enough. In this work, we propose a degraded-reference image quality assessment method based on dual residual-guided interactive learning (DRGQA) for the enhanced images in multiple scenarios. Specifically, a global and local feature collaboration module (GLCM) is proposed to imitate the perception of observers to capture comprehensive quality-aware features by using convolutional neural networks (CNN) and Transformers in an interactive manner. Then, we investigate the structure damage and color shift distortions that commonly occur in the enhanced images and propose a dual residual-guided module (DRGM) to make the model concentrate on the distorted regions that are sensitive to human visual system (HVS). Furthermore, a distortion-aware feature enhancement module (DEM) is proposed to improve the representation abilities of features in deeper networks. Extensive experimental results demonstrate that our proposed DRGQA achieves superior performance with lower computational complexity compared to the state-of-the-art IQA methods.

Abstract:
Infrared and visible image fusion is currently an important research direction in the field of multimodal image fusion, which aims to utilize the complementary information between infrared images and visible images to generate a new image containing richer information. In recent years, many deep learning-based methods for infrared and visible image fusion have emerged.However, most of these approaches ignore the importance of semantic information in image fusion, resulting in the generation of fused images that do not perform well enough in human visual perception and advanced visual tasks.To address this problem, we propose a semantic prior knowledge-driven infrared and visible image fusion method. The method utilizes a pre-trained semantic segmentation model to acquire semantic information of infrared and visible images, and drives the fusion process of infrared and visible images through semantic feature perception module and semantic feature embedding module.Meanwhile, we divide the fused image into each category block and consider them as components, and utilize the regional semantic adversarial loss to enhance the adversarial network generation ability in different regions, thus improving the quality of the fused image.Through extensive experiments on widely used datasets, the results show that our approach outperforms current leading algorithms in both human eye visualization and advanced visual tasks.

Abstract:
Site selection aims to select optimal locations for new stores, which is crucial in business management and urban computing. The early data-driven models heavily relied on feature engineering, which could not effectively model the complex relationships and diverse influences among different data. To alleviate such issues, the knowledge-driven paradigm is proposed based on urban knowledge graphs (KGs). However, the research on them is at an early stage. They omit extra multi-modal information corresponding to brands and stores due to two main challenges, i.e., (1) building available datasets, and (2) designing effective models. It constrains the expressive ability and practical value of previous models. To this end, we first construct new multi-modal urban KGs for site selection with three extra modal (i.e., visual, textual, and acoustic) attributes. Then, we propose a novel multi-modal knowledge-driven model (MGKsite). Concretely, a graph neural network (GNN) based fusion network is designed to fuse the features based on the attribute K-Nearest Neighbor (KNN) graph, which models both intra and inter-modal correlations among the features. The fused embeddings are further injected into the knowledge-driven backbones for learning and inference. Experiments prove promising capacities of MGKsite from five aspects, i.e., superiority, effectiveness, sensitivity, transferability and complexity.

Abstract:
Multi-view clustering (MVC) can fuse the information of multiple views for robust clustering result, among it two fusion strategies, early-fusion and late-fusion are widely adopted. Although they have derived many MVC methods, there are still two crucial questions: (1) early-fusion forces multiple views to share a consensus latent representation, which compounds the challenge of excavating view-specific diverse local information and (2) late-fusion generates view-partitions independently and then integrates them in the following clustering procedure, where the two procedures cannot guide each other and lack necessary negotiation. In view of this, we propose a novel Graph Proxy Fusion (GPF) method to preserve and fuse view-specific local information concertedly in one unified framework. Specifically, we first propose anchor-based local information learning to capture view-specific local structural information in bipartite graphs; meanwhile, a view-consensus graph learned through self-expressiveness-based proxy graph learning module is deemed as a higher-order proxy; following, the novel graph proxy fusion module integrally embeds all lower-order bipartite graphs in the higher-order proxy via higher-order correlation theory. As a novel fusion strategy, the proposed GPF efficiently investigates the valuable consensus and diverse information of multiple views. Experiments on various multi-view datasets demonstrate the superiority of our method.

Abstract:
3D human mesh recovery from single RGB images or monocular videos is a challenging task. The twist representation utilized in existing inverse kinematics-based methods fails to accurately describe the twisting posture when the estimated bone direction is imprecise. Additionally, supervising SMPL shape parameters has the issue of shape estimation overfitting due to limited training data. This often results in compromised bone lengths that subsequently impair the precision of joint positions. To address these issues, we propose a framework that breaks down both human pose and shape into finer components, effectively managing and minimizing errors within each component. The proposed framework integrates two key advancements: the advanced Ortho-Twist and Swing Representation (OTSR) and the Skeleton-Focused Shape Refinement (SFSR). OTSR offers a more sophisticated representation for limb rotations compared to the traditional twist angle and swing representation to enhance the accuracy of twisting posture estimation. SFSR refines the estimated SMPL shape parameters by fitting bone lengths using the estimated joint positions, thereby significantly mitigating shape overfitting and enhancing joint position accuracy in the recovered mesh. We conduct experiments on the Human3.6 M and 3DPW datasets. The results demonstrate the superiority of the proposed framework in both single-image and video scenarios. Additionally, the ablation studies confirm the effectiveness of our proposed modules, and further generalizability experiments demonstrate that our two key advancements can serve as plug-and-play modules to enhance existing methods.

Abstract:
In recent studies of 3D shape modelling and reconstruction, the focus has primarily been on the 3D face region. However, accurately creating the entire 3D head opens up a wide range of applications, including headwear design, cranial diagnosis, and avatar design. Therefore, we present our newly developed method of constructing 3D comprehensive morphable models (3DCMM) specifically tailored for human heads, along with a novel 3DCMM-based stepwise pipeline for creating accurate full 3D heads. Within our 3DCMM framework, we constructed a powerful 3D morphable face model with UV-UNet to generate the 3D face and predict the 3D scalp, resulting in a complete representation of the head. Additionally, our 3DCMM-based self-learning approach incorporates novel facial boundary-aware and structure-aware losses for highly accurate overall reconstructions of the entire facial region. Experimental evaluations demonstrate that our 3DCMM exhibits superior face representation power and achieves higher head prediction accuracy than existing models. Consequently, our 3DCMM-based 3D head creation method from a single image demonstrates outstanding performance capability on both face and head benchmarks.

Abstract:
Data augmentation (DA) is an effective approach for enhancing model performance with limited data, such as light field (LF) image super-resolution (SR). LF images inherently possess rich spatial and angular information. Nonetheless, there is a scarcity of DA methodologies explicitly tailored for LF images, and existing works tend to concentrate solely on either the spatial or angular domain. This paper proposes a novel spatial and angular DA strategy named MaskBlur for LF image SR by concurrently addressing spatial and angular aspects. MaskBlur consists of spatial blur and angular dropout two components. Spatial blur is governed by a spatial mask, which controls where pixels are blurred, i.e., pasting pixels between the low-resolution and high-resolution domains. The angular mask is responsible for angular dropout, i.e., selecting which views to perform the spatial blur operation. By doing so, MaskBlur enables the model to treat pixels differently in the spatial and angular domains when super-resolving LF images rather than blindly treating all pixels equally. Extensive experiments demonstrate the efficacy of MaskBlur in significantly enhancing the performance of existing SR methods. We further extend MaskBlur to other LF image tasks such as denoising, deblurring, low-light enhancement, and real-world SR.

Abstract:
In recent years, we have observed significant advancements in learning-based techniques for image-enhancement tasks. However, most of the existing methods are either purely based on image-to-image convolutional neural networks, which cannot handle high-resolution images in real-time, or resort to 3D Lookup Tables, which fall short of local tone adjustments. In this paper, we rethink affine transform through a color space perspective, and then propose AttnBL (Attentional Bilateral Grid Learning), a novel hybrid image enhancement algorithm to process ultra-high-definition images in real-time. Our algorithm consists of two paths, the low-resolution chroma prediction path that aims to learn the chroma coefficients and the full-resolution luma adaptation path that aims to preserve brightness details. Specifically, we propose a carefully designed hierarchical transformer to capture the global information in an efficient way and introduce a feature extraction module to adaptively learn a luma guidance for bilateral upsampling. Our algorithm can process a 4K-resolution image in 20 milliseconds. This efficiency provides a practical solution for high resolution real-time preview. Without bells and whistles, our model outperforms previous state-of-the-art methods on two well-known datasets in image enhancement tasks both quantitatively and qualitatively. Our analysis also provides some interesting findings that may enlighten further studies.

Abstract:
Open set recognition (OSR) aims to identify whether a test sample belongs to a semantic class in the classifier training set. Existing OSR methods exhibit prominent performance on various image datasets. However, they are primarily designed for general object recognition rather than more complex camouflaged object recognition. When an object is camouflaged, i.e., it exhibits a similar pattern to the background, it is difficult to finely identify it and differentiate between known and unknown categories. To address this problem, we propose a novel multi-granularity context perception network (MCPNet) for OSR of camouflaged objects, which can accurately identify camouflaged objects by fusing coarse-grained and fine-grained context features. In MCPNet, the vision transformer is first utilized to extract coarse-grained context features to locate the approximate location of camouflaged objects. Then, an adaptive local focus module (ALFM) is proposed to pick out the most discriminative regions and learn the fine-grained context of these regions. Finally, multi-granular context features are fused to obtain recognition results. During the training, a contrastive clustering module (CCM) is introduced to guide the network to effectively utilize multi-granularity context to generate high-confidence decision boundaries. We also built two camouflaged object classification datasets named ACOC and NCOC which mainly consist of artificial camouflage and natural camouflage respectively to facilitate research in OSR of camouflaged objects. Experimental results on two datasets show that MCPNet outperforms state-of-the art methods.

Abstract:
RGB-D Salient Object Detection (SOD) aims to segment the most prominent areas and objects in a given pair of RGB and depth images. Most current models adopt a dual-stream structure to extract information from both RGB and depth images. However, this leads to an exponential increase in the number of parameters and computations in the model. Moreover, the discrepancy between RGB pretrained and the 3D geometric relationships in depth maps present a challenge for the encoder in capturing spatial structural details. These issues impact the model's accuracy in locating salient objects and distinguishing edge details. To address these, we propose a novel early feature fusion network, named FasterSal, which enables more efficient RGB-D SOD. FasterSal uses a single stream structure to receive RGB images and depth maps, extracting features based on the 3D geometric relationships in the depth map while fully leveraging the pretrained RGB encoder. This approach effectively avoids the inconsistencies between depth modality and the RGB pretrained encoder. It also significantly reduces the number of network parameters while maintaining efficient feature encoding capabilities. To achieve finer edge learning, the detail-aware loss and texture enhancement module are introduced. These modules are designed to extract latent details in high-frequency component features and to enhance the edge learning capability of the model using distance information. Experimental results on several benchmark datasets confirm the effectiveness and superiority of our method over the state-of-the-art approaches, achieving a good balance between performance and speed with only 3.4 million parameters and a CPU operating speed of 63 FPS.

Abstract:
Current methodologies in distributed source coding have predominantly investigated decoder-focused strategies, emphasizing the alignment and exploitation of side information. This study introduces a paradigm shift by presenting an encoder-centric algorithm that conducts proactive optimization in the frequency domain. This shift is motivated by the current deep learning models' tendency to passively extract high-frequency elements, such as contours and content in the spatial domain at the encoder side, without considering the frequency characteristics of these spatial components. Unlike current trends, the proposed scheme actively selects the essential frequency components directly in the frequency domain by introducing an adaptive self-learning filter, enabling the encoder to discern and retain critical frequency components effectively and precisely. Furthermore, we align the side information in the spatial domain before feature extraction and implement an affine transformation-based alignment strategy to utilize the side information better. By leveraging the shared frequency domain components of the image pairs, the proposed algorithm adeptly learns affine coefficients to accomplish precise spatial alignment. This dual strategy of proactive encoder optimization and decoder alignment via affine transformations is highly efficient, outperforming existing state-of-the-art methods in distributed source coding when tested across two diverse datasets by an average of 0.5 dB in PSNR.

Abstract:
Extensive studies have revealed that deep neural networks (DNNs) are vulnerable to adversarial attacks, especially black-box ones, which can heavily threaten the DNNs deployed in the real world. Many attack techniques have been proposed to explore the vulnerability of DNNs and further help to improve their robustness. Despite the significant progress made recently, existing black-box attack methods still suffer from unsatisfactory performance due to the vast number of queries needed to optimize desired perturbations. Besides, the other critical challenge is that adversarial examples built in a noise-adding manner are abnormal and struggle to successfully attack robust models, whose robustness is enhanced by adversarial training against small perturbations. There is no doubt that these two issues mentioned above will significantly increase the risk of exposure and result in a failure to dig deeply into the vulnerability of DNNs. Hence, it is necessary to evaluate DNNs' fragility sufficiently under query-limited settings in a non-additional way. In this paper, we propose the Spatial Transform Black-box Attack (STBA), a novel framework to craft formidable adversarial examples in the query-limited scenario. Specifically, STBA introduces a flow field to the high-frequency part of clean images to generate adversarial examples and adopts the following two processes to enhance their naturalness and significantly improve the query efficiency: a) we apply an estimated flow field to the high-frequency part of clean images to generate adversarial examples instead of introducing external noise to the benign image, and b) we leverage an efficient gradient estimation method based on a batch of samples to optimize such an ideal flow field under query-limited settings. Compared to existing score-based black-box baselines, extensive experiments indicated that STBA could effectively improve the imperceptibility of the adversarial examples and remarkably boost the attack success rate under query-limited settings.

Abstract:
Alignment between the food images and the corresponding recipes is an emerging cross-modal representation learning task. In this task, the recipes are composed of three components, i.e., food title, ingredient lists, and cooking instructions, which require a fine-grained alignment between the features of the two modalities. Existing methods usually aggregate the recipes into global embeddings and then align them with the global image embeddings. Meanwhile, semantic classification is frequently used in these methods to regularize the embeddings of the two modalities. While these methods are efficient, there remain two problems. (1) Forcing the alignment between the global images and recipes embeddings may result in losing the component-specific information. (2) The high diversity of food appearance leads to high uncertainty in the semantic classification of food images and recipes. To solve these problems, we propose a Fine-grained Prompting and Alignment (FPA) model to enhance the feature extraction and bring more component-specific information for fine-grained alignment. Furthermore, to regularize the semantic information contained in the cross-modal features, we design an Evidential Semantic Consistency (ESC) loss to keep the cross-modal semantic consistency. We have conducted comprehensive experiments on the benchmark dataset Recipe1M and the state-of-the-art results on the cross-modal recipe retrieval task demonstrate the effectiveness of our method.

Abstract:
Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.

Abstract:
Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop a pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160 K, to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability in navigating the complexities of the physical world. 3UR-LLM directly receives 3D point cloud as input and project 3D features fused with text instructions into a manageable set of tokens. Considering the computation burden derived from these hybrid tokens, we design a 3D compressor module to cohesively compress the 3D spatial cues and textual narrative. 3UR-LLM achieves promising performance with respect to the previous SOTAs, for instance, 3UR-LLM exceeds its counterparts by 7.1% CIDEr on ScanQA, while utilizing fewer training resources. The code and model weights for 3UR-LLM and the 3DS-160 K benchmark are available at 3UR-LLM.

Abstract:
Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.

Abstract:
Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.

Abstract:
In the tasks of pose estimation and behavior analysis in computer vision, conventional models are often constrained by various factors or complex environments (such as multiple targets, small targets, occluded targets, etc.). To address this problem, this paper proposes a symmetric cascaded additive network (MulAG) to improve the accuracy of posture estimation and behavior analysis in complex environments. MulAG consists of two modules, MulA and MulG. The MulA module is designed based on a cascaded symmetric network structure and incorporates the addition operation. MulA extracts the posture spatial features of the target from a single frame image. And, the MulG module is designed based on three continuous GRUs (gated recurrent unit). Based on the MulA, MulG extracts the posture temporal features from the posture spatial features of the moving target and predicts the posture temporal features of the moving target. The paper firstly demonstrates the feasibility of addition operations in pose estimation tasks by comparing with MobileNet-v3 in ablation experiments. Secondly, on the HiEve and CrowdPose datasets, MulA achieves accuracy of 79.6% and 80.4%, respectively, outperforming the PTM model by 12.0% and 21.2%. Detection speed of MulA achieves the best value at 8.6 ms, which is 1 times higher than HDGCN. The result demonstrates the effectiveness of MulA in multi-target pose estimation in complex scenes. Finally, on the HDMB-51 and UCF-101 datasets, MulAG achieves accuracy of 74.8% and 86.3%, respectively, outperforming HDGCN by 9.6% and 9.5%. Compared with SKP and GIST, the fps of MulAG (44.8 s−1) is improved by 8.2% and 8.9%. These experiments highlight the generalizability and superiority of MulAG in behavior analysis and pose estimation tasks.

Abstract:
It is challenging for crowd counting models to generalize to new scenes due to domain shifts in training and test data. Although domain adaptation approaches have made notable progress in bridging the domain gap, they require target domain data. In this paper, we propose a novel framework for cross-scene crowd counting, which unifies domain generalization and adaptation. For domain generalization, we train a model only using single-domain data and the model can be generalized to any scene with satisfying performance. Regarding domain adaptation, we use both source and target domain data to further improve the performance. We first design a generation network that diversifies the generated samples to cover the unseen target domains as much as possible by minimizing mutual information. This approach simulates training data in various domains, thereby enhancing the model's generalization ability. Then we develop a pixel-wise supervised contrastive loss function that pulls the human heads in the source images and generated images closer to each other and pushes them further away from the background. This loss helps extract a domain-invariant feature representation, thus improving the model's generalization ability. Moreover, if information about the target domain is available, our generalization method can be easily applied as an adaptation method by replacing the mutual information minimization loss with the mutual information maximization loss. This can further improve cross-scene crowd counting performance. The experimental results demonstrate the strong generalizability of our method across different datasets.

Abstract:
Multi-label image recognition with convolutional neural networks has achieved remarkable progress in the past few years. However, most existing multi-label image recognition methods suffer from the long-tailed data distribution problem, i.e., head categories occupy most training samples, while tailed classes have few samples. This work firstly studies the influence of long-tailed data distribution on existing multi-label image recognition methods. Based on this, two crucial issues of the existing methods are identified: 1) severe gradient imbalance between head and tailed categories, even though re-balancing strategies are adopted; 2) the lack of diversity of tail category training samples. To tackle the first issue, this paper proposes a group sampling strategy to create group-wise balanced data distribution. Meanwhile, a dynamic gradient balancing loss is proposed to equalize the gradient for all categories. To tackle the second issue, this paper proposes a diversity enhancement module to fuse the information across all categories, preventing the network from overfitting tail classes. Furthermore, it also balances the gradient, promoting the discriminability of learned classifiers. Our method significantly outperforms the baseline method and achieves competitive performance with state-of-the-art methods on VOC-LT and COCO-LT datasets. Extensive ablation studies are conducted to verify the effectiveness of the essential proposals.

Abstract:
Recent advances in text-to-image synthesis have captivated audiences worldwide, drawing considerable attention. Although significant progress in generating photo-realistic images through large pre-trained autoregressive and diffusion models, these models face three critical constraints: (1) The requirement for extensive training data and numerous model parameters; (2) Inefficient, multi-step image generation process; and (3) Difficulties in controlling the output visual features, requiring complexly designed prompts to ensure text-image alignment. Addressing these challenges, we introduce the CLIP-GAN model, which innovatively integrates the pretrained CLIP model into both the generator and discriminator of the GAN. Our architecture includes a CLIP-based generator that employs visual concepts derived from CLIP through text prompts in a feature adapter module. We also propose a CLIP-based discriminator, utilizing CLIP's advanced scene understanding capabilities for more precise image quality evaluation. Additionally, our generator applies visual concepts from CLIP via the Text-based Generator Block (TG-Block) and the Polarized Feature Fusion Module (PFFM) enabling better fusion of text and image semantic information. This integration within the generator and discriminator enhances training efficiency, enabling our model to achieve evaluation results not inferior to large pre-trained autoregressive and diffusion models, but with a 94% reduction in learnable parameters. CLIP-GAN aims to achieve the best efficiency-accuracy trade-off in image generation given the limited resource budget. Extensive evaluations validate the superior performance of the model, demonstrating faster image generation speed and the potential for greater stylistic diversity within the GAN model, while still preserving its smooth latent space.

Abstract:
The investigation for incomplete multi-modal 3D shape clustering is evolving as a promising task for the field of recognizing massive unlabeled 3D shapes. As two widely adopted 3D shape modalities, point clouds and multiple views not only exhibit rich intra-modal correlations but also encompass complementary structures and appearances of 3D shapes. By effectively modeling the intra-modal and inter-modal correlations, this paper proposes a novel incomplete multi-modal 3D shape clustering method to reveal the underlying clustering associations from incomplete multi-modal 3D shapes. In detail, a similarity-transferred feature prediction module is presented to recover the features of missing instances within one modality with the assistance of similarity exploring from another modality. Then, an intra-to-inter progressive feature fusion module is designed to mine the correlations within the modality as well as between different modalities, thereby obtaining comprehensive 3D shape features for clustering. Extensive experiments on two public 3D shape datasets have demonstrated that the proposed method has achieved promising clustering results under different missing rates.

Abstract:
Recent research has shown that deep learning networks are vulnerable to adversarial samples. Although there has been great progress in the study of adversarial attacks on images, there is relatively little research on adversarial attacks in the video domain, especially on intrinsic factors of videos, such as motion blur. In this paper, we devise a novel Grad-Weighted based One-step Motion Blur Attack (GWO-MBA) and a Discrete-Fusion based Progressive Motion Blur Attack (DFP-MBA) for video recognition, starting from the idea of integrating global adversarial attacks and adversarial patch attacks. Concretely, we use gradient maps to filter and weighted fusion motion blur (termed GWO-MBA) to achieve the attack that matches the motion information in the context of the video. In order to make the generated motion blur attack perturbations more natural and improve the attack success rate, we further introduce a progressive decomposition motion blur strategy (termed DFP-MBA) to progressively fuse more realistic discrete motion blurs. Besides, we propose an Aggressive Motion Blur Generation (AMBG), which generates natural motion blur based on the video context and has a better attack effect. The extensive experiments, on the HMDB-51 and UCF-101 datasets, demonstrate the effectiveness and superiority of our proposed attack method. In addition, the attack effectiveness of the mainstream denoising defense model and the deblur model further validates the robustness of our attack method.

Abstract:
Human motion transfer aims at animating a static source image with a driving video. While recent advances in one-shot human motion transfer have led to significant improvement in results, it remains challenging for methods with 2D body landmarks, skeleton and semantic mask to accurately capture correspondences between source and driving poses due to the large variation in motion and articulation complexity. In addition, the accuracy and precision of DensePose degrade the image quality for neural-rendering-based methods. To address the limitations and by both considering the importance of appearance and geometry for motion transfer, in this work, we proposed a unified framework that combines multi-scale feature warping and neural texture mapping to recover better 2D appearance and 2.5D geometry, partly by exploiting the information from DensePose, yet adapting to its inherent limited accuracy. Our model takes advantage of multiple modalities by jointly training and fusing them, which allows it to robust neural texture features that cope with geometric errors as well as multi-scale dense motion flow that better preserves appearance. Experimental results with full and half-view body video datasets demonstrate that our model can generalize well and achieve competitive results, and that it is particularly effective in handling challenging cases such as those with substantial self-occlusions.

Abstract:
In the field of multi-view multi-label learning, the challenges of incomplete views and missing labels are prevalent due to the complexity of manual labeling and data acquisition errors. These challenges significantly reduce the quality of latent representations and hinder prediction by multi-label classification. To address this issue, we propose a novel Category-driven Semi-supervised Contrastive Recovery (CSCR) framework in this study. Our framework aims to fully integrate existing label information into incomplete representation learning and classification. Specifically, to address the limitations posed by incomplete views and labels, we construct a label coincidence matrix based on existing labels, which serves as a similarity matrix in subsequent semi-supervised contrastive learning and multi-view classification. By leveraging this matrix, we design a semi-supervised multi-view contrastive learning module, which constructs sample pairs on the basis of inter-view correspondences and label similarity. It learns discriminative latent representations without the need for data augmentation. A weighted multi-label classification module is subsequently employed to integrate the predictions from each view to obtain the final classification result. Experimental evaluations on five challenging datasets demonstrate the superiority of our model over existing state-of-the-art methods.

Abstract:
Continual learning (CL) aims to enable deep neural networks (DNNs) to learn new data without forgetting previously learned knowledge. The key to achieving this goal is to avoid confusion at the feature level, i.e., to avoid confusion within old tasks and between new and old tasks. Existing prototype-based CL methods generate pseudo features for old knowledge replay by adding Gaussian noise to the centroids of old classes. However, the distribution in the feature space exhibits anisotropy during the incremental process, which prevents the pseudo features from faithfully reproducing the distribution of old knowledge in the feature space, leading to confusion at the classification boundaries within old tasks. To address this issue, we propose the distribution-level memory recall (DMR) method, which uses a Gaussian mixture model to precisely fit the feature distribution of old knowledge at the distribution level and generate pseudo features in the next stage. Furthermore, resistance to confusion at the distribution level is crucial for multimodal learning. Multimodal imbalance, which refers to uneven optimization processes among encoders of different modalities, results in significant differences in feature responses between modalities; this exacerbates confusion within old tasks in prototype-based CL methods. Therefore, we mitigate the multimodal imbalance problem by using the intermodal guidance and intramodal mining (IGIM) method to guide weaker modalities with prior information from dominant modalities and further explore useful information within modalities. To avoid confusion between new and old tasks, we propose using the confusion index to quantitatively describe a model's ability to distinguish between new and old tasks, and we use the incremental mixup feature enhancement (IMFE) method to enhance pseudo features with new sample features, alleviating classification confusion between new and old knowledge. We conduct extensive experiments on the CIFAR100, ImageNet100, TinyImageNet, ImageNet-1K and UESTC-MMEA-CL datasets and achieve state-of-the-art results.

Abstract:
In recent years, point-based methods have achieved promising performance on 3D object detection task. Although effective, they still suffer from the inherent sparsity of point cloud, which makes it challenging to distinguish objects with backgrounds only relying on the view of raw point. To this end, we propose a straightforward yet effective multi-view fusion network termed RAFDet to alleviate this issue. The core idea of our method lies in combining the merits of raw point and its range view to enhance the representation learning for sparse point cloud, thus mitigating the sparsity problem and boosting the detection performance. In particular, we introduce a novel bidirectional attentive fusion module to equip sparse point with interacted fine-grained semantic clues during feature learning process. Then, we devise the range-view augmented fusion module to fully exploit the supplementary relationship between different perspectives with the aim of enhancing original point-view features. In the end, a single-stage detection head is utilized to predict final 3D bounding boxes based on the enhanced semantics. We have evaluated our method on the popular KITTI Dataset, DAIR-V2X Dataset and Waymo Open Dataset. Experimental results on the above three datasets demonstrate the effectiveness and robustness of our approach in terms of detection performance and model complexity.

Abstract:
The study adopted a human-centered perspective to research the financial markets, focusing on identifying variations in eye movement patterns between professional and non-professional traders as they analyze a series of stock charts. Eye movement data was selected as the analysis target based on the hypothesis that it represents a behavioral phenotype indicative of stock analysts' cognitive processes during market analysis. Disparities were identified by conducting variance analysis and the Wilcoxon signed-rank test on statistical metrics derived from eye fixations and saccades. Psychological and behavioral economic interpretations were provided to understand the underlying reasons for these observed patterns. To showcase the practical application potential of the human-centered perspective, eye movement data and human visual characteristics were used to construct visual saliency prediction models of professional stock analysts. Leveraging this human-centered model, we developed two practical application demonstrations specifically designed to support and instruct novice traders. Based on the above demonstrations, a training program was designed that demonstrates how, with ongoing training, the non-professional traders' ability to observe stock charts improves progressively.

Abstract:
Depth cues are essential for visual perception tasks like Salient Object Detection (SOD). Due to varying depth reliability across scenes, some researchers propose evaluating the overall quality of the depth maps and discarding the less reliable ones to avoid contamination. However, these methods often fail to fully utilize valuable information in depth maps, leading to sub-optimal performance particularly when depth quality is unreliable. Since low-quality depth maps still contain useful information that potentially improves model performance, we propose a Depth Pixel-wise Potential-aware Network to leverage these depth cues effectively. This network includes two novel components designed: 1) A learning strategy for explicitly modeling the confidence of each depth pixel to assist the model in locating valid information in the depth map. 2) A cross-modal adaptive multiple fusion module that fuses features from both RGB and depth modalities. It aims to mitigate the contamination effect of unreliable depth maps and fully exploit the benefits of multiple fusion strategies. Experimental results show that on four publicly available datasets, our method outperforms 17 mainstream methods on various evaluation metrics.

Abstract:
Recently, a large number of image compressive sensing (CS) methods with deep unfolding networks (DUNs) have been proposed. However, existing methods either use fixed-scale blocks for sampling that leads to limited insights into the image content or employ a plain convolutional neural network (CNN) in each iteration that weakens the perception of broader contextual prior. In this paper, we propose a novel DUN (dubbed SVASNet) for image compressive sensing, which achieves scale-variable adaptive sampling and hybrid-attention Transformer reconstruction with a single model. Specifically, for scale-variable sampling, a sampling matrix-based calculator is first employed to evaluate the reconstruction distortion, which only requires measurements without access to the ground truth image. Then, a Block Scale Aggregation (BSA) strategy is presented to compute the reconstruction distortion under block divisions at different scales and select the optimal division scale for sampling. To realize hybrid-attention reconstruction, a dual Cross Attention (CA) submodule in the gradient descent step and a Spatial Attention (SA) submodule in the proximal mapping step are developed. The CA submodule introduces inter-phase inertial forces in the gradient descent, which improves the memory effect between adjacent iterations. The SA submodule integrates local and global prior representations of CNN and Transformer, and explores local and global affinities between dense feature representations. Extensive experimental results show that the proposed SVASNet achieves significant improvements over the state-of-the-art methods.

Abstract:
This article proposes a novel multi-view (MV) video coding technique that leverages a four-dimensional (4D) voxel-grid representation to enhance coding efficiency, particularly in novel view synthesis. Although the voxel grid approximation provides a continuous representation for dynamic scenes, its volumetric nature requires substantial storage. The compression of MV videos can be interpreted as the compression of dense features. However, the substantial size of these features poses a significant problem relative to the generation of dynamic scenes at arbitrary viewpoints. To address this challenge, this study introduces a hierarchical coded representation of dynamic volumes based on low-rank tensor decomposition of volumetric features and develops effective coding techniques based on this representation. The proposed method employs a two-level coding strategy to capture the temporal characteristics of the decomposed features. At a higher level, spatial features are encoded, representing 3D structural information, with time-invariant components over short intervals of an MV video sequence. At a lower level, temporal features are encoded to capture the dynamics of current scenes. The spatial features are shared in a group, and temporal features are encoded at each time step. The experimental results demonstrate that the proposed technique outperforms existing MV video coding standards and current state-of-the-art methods, providing superior rate-distortion performance in the novel view synthesis of MV video compression.

Abstract:
Speech-driven 3D facial animation has emerged as a hot topic. During this process, movements in different facial regions are interdependent, influenced by the intricate interactions among facial muscles, and manifest personalized differences. The existing methods typically simplify the facial animation generation task to an infinitely thin surface skin deformation without an underlying structure, thereby ignoring the intricate and personalized dynamics of facial muscle activity. These methods tend to produce static or weak upper-face animations with an average facial movement style. In this work, we propose a novel framework, called DCPTalk, to mimic the intricate dynamics of facial muscle activity and portray personalized facial animations. Based on facial dynamic coupling properties, we propose Mouth2Face to simulate the facial muscle control system, yielding realistic and coordinated facial animations evoked by mouth movements. Mouth movements are easily synthesized from speech signals due to their direct correlation with phonetic articulation and vocal tract dynamics. To further enhance the detail of facial movements, we employ surface skin deformation to refine the facial animation derived from Mouth2Face. Furthermore, personal factors, including inherent physical traits and acquired speaking styles, directly determine the uniqueness and realism of facial animations. Inherent physical traits are embedded into Mouth2Face for constructing personalized facial muscle control system, while acquired speaking styles are employed to modulate external driving signals. Extensive qualitative and quantitative experiments as well as a user study indicate that DCPTalk outperforms the existing state-of-the-art methods.

Abstract:
In complex open scenes, multi-modality image fusion and segmentation encounter two challenges: i) Imaging misalignments, manifested as pixel shifts and structural distortions, are perceptible. ii) Human-crafted adversarial attacks, reflected in pixel distribution variations, are imperceptible. They not only degrade the visual quality of fused images, e.g., noticeable edge ghosts but more critically undermine semantic perception. However, none of the existing works considered the coupled effect of these degradations. This paper proposes a One-Stop framework incorporating sequential task flows of “Registration-Fusion-Segmentation”, termed OS-RFS. Registration aims to mitigate the chained impact of misalignment on fusion and segmentation. We follow a coarse-to-fine registration paradigm and develop a Global-Local Incremental Registration (GLoIR) model, where the global shift registration (GSR) is performed initially for long-range pixel shifts, followed by incremental local deformation registration (LDR) for subtle local deformations. To improve segmentation robustness, we innovatively introduce auxiliary positive attacks and build a Cancellation Defense Strategy (CDS) in the fusion model. The CDS constrains the fusion model to fit fused images to the distribution of positive attacks, endowing fused images with a robust defense ability against adversarial attacks. This significantly mitigates the impact of adversarial attacks on semantic segmentation. Extensive experimental results reveal that our OS-RFS performs remarkable robustness on multi-modality image fusion and semantic segmentation against imaging misalignments and adversarial attacks.

Abstract:
Video moment retrieval (VMR) aims to localize a video segment in an untrimmed video that is semantically relevant to a language query. The challenge of this task lies in effectively aligning the intricate and information-dense video modality with the succinctly summarized textual modality, and further localizing the starting and ending timestamps of the target moments. Previous works have attempted to achieve multi-granularity alignment of video and query in a coarse-to-fine manner, yet these efforts still fall short in addressing the inherent disparities in representation and information density between videos and queries, leading to modal misalignments. In this paper, we propose a progressive video moment retrieval framework, initially retrieving the most relevant and irrelevant video clips to the query as semantic guidance, thereby bridging the semantic gap between video modality and language modality. Futhermore, we introduce a pseudo clips guided aggregation module to aggregate densely relevant moment clips closer together and propose a discriminative boundary-enhanced decoder with the guidance of pseudo clips to push the semantically confusing proposals away. Extensive experiments on the Charades-STA, ActivityNet Captions and TACoS datasets demonstrate that our method outperforms existing methods.

Abstract:
Class-incremental semantic segmentation (CISS) aims to incrementally learn novel classes while retaining the ability to segment old classes, and suffers catastrophic forgetting since the old-class labels are unavailable. Most existing methods typically impose strict constraints on the consistency between the extracted features or output logits of each pixel from old and current models in an attempt to prevent forgetting through knowledge distillation (KD), which 1) results in a significant transfer of redundant knowledge while limiting the restoration of old classes (rigidity) due to potentially overlooking essential knowledge extraction, and 2) imposes strong constraints at the pixel level making it challenging for the model to learn novel classes (plasticity). To solve the above limitations, we propose a novel Spatial Visual and Statistical Relation Distillation (SVSRD) by applying multi-scale visual and statistical position relation distillation for CISS, which enjoys several merits. First, we introduce a region-based similarity matrix and impose a consistency constraint between current and old models, which preserves the essential visual knowledge to enhance the rigidity. Second, we propose a novel statistical feature calculation algorithm to investigate the distribution of the data and further preserve the rules of statistics through statistical consistency, which also promotes the model on the novel-class learning for improving the plasticity. Finally, the aforementioned constraints are jointly applied in multiple scales to alleviate old-class forgetting and enhance novel-class learning. Extensive experiments on Pascal-VOC 2012 and ADE20 K demonstrate that the proposed approach performs favorably against the state-of-the-art CISS methods.

Affiliations: College of Intelligence and Computing and the Tianjin Key Laboratory of Advanced Network Technology and Application, Tianjin University, Tianjin, China; Shandong Provincial Key Laboratory of Ubiquitous Intelligent Computing and the School of Information Science and Engineering, University of Jinan, Jinan, China; Sports Big-Data Research Center, Wuhan Sports University, Wuhan, China; College of Intelligence and Computing and the Tianjin Key Laboratory of Machine Learning, Tianjin University, Tianjin, China

Abstract:
The inherent complexity of Wi-Fi signals makes video-aided Wi-Fi 3D pose estimation difficult. The challenges include the limited generalizability of the task across diverse environments, its significant signal heterogeneity, and its inadequate ability to analyze local and geometric information. To overcome these challenges, we introduce WiViPose, a video-aided Wi-Fi framework for 3D pose estimation, which attains enhanced cross-environment generalization through cross-layer optimization. Bilinear temporal-spectral fusion (BTSF) is initially used to fuse the time-domain and frequency-domain features derived from Wi-Fi. Video features are derived from a multiresolution convolutional pose machine and enhanced by local self-attention. Cross-modality data fusion is facilitated through an attention-based transformer, with the process further refined under a supervisory mechanism. WiViPose demonstrates effectiveness by achieving an average percentage of correct keypoints (PCK)@50 of 91.01% across three typical indoor environments.

Abstract:
Unsupervised domain adaptive semantic segmentation aims to transfer knowledge from the annotated source domain to the unlabeled target domain. Recently, self-training methods have gained substantial attention, which leverage high-confidence predictions in the target domain as pseudo labels for supervision. However, limited exploration of intra-class variations across domains, including significant visual differences within each category, has led to misalignment between feature distribution across domains. In this article, we present a unified non-parametric distance-based online clustering method to efficiently maintain multiple centroid-based prototypes within each category subspace instead of one prototype for each category subspace, which enables prototypes to possess the capacity for richer feature representation. Then, considering the variance across different dimensions of a feature representation, we then extend the prototypes from centroid-based ones to distribution-based ones. Specifically, each subspace is modeled using a Gaussian mixture model which includes several anisotropic Gaussian distributions, aimed at prioritizing discriminative dimensions and obtaining a finer measurement of the pixel-to-prototype similarity. Meanwhile, a category-aware feature space is achieved through pixel-to-prototype contrastive learning to ensure the compactness of pixel features in the same subcategory and drive the separation between pixel features of different subcategories. What's more, multi-resolution features are utilized to promote diversity and robustness among intra-class prototypes. Experiments validate the competitiveness of our two prototype-based methods against existing state-of-the-art methods, with a mIoU of 76.8% on GTA \rightarrow Cityscapes, 68.4% on Synthia \rightarrow Cityscapes, 54.5% on Cityscapes \rightarrow DarkZurich and 56.4% on Cityscapes \rightarrow ACDC. Notably, our method is able to seamlessly integrate with existing UDA methods.

Abstract:
A light field image records rich information of a scene from multiple views, thereby providing complementary information for occlusion removal. However, current occlusion removal methods have several issues: 1) inefficient exploitation of spatial and angular complementary information among views; 2) indistinguishable treatment of pixels from foreground occlusion and background; and 3) insufficient exploration of spatial detail supplementation. Therefore, in this article, we propose a mask-aware de-occlusion network (MANet). Specifically, MANet is a joint training network that integrates the occlusion mask predictor (OMP) and the occlusion remover (OR). First, OMP is proposed to provide the location of occluded regions for OR, as the occlusion removal task is ill-posed without occluded region localization. In OR, we introduce gated spatial-angular feature aggregation, which uses a soft gating mechanism to focus on spatial-angular interaction features in non-occluded regions, extracting effective aggregated features specific to the de-occlusion. Then, we design a complementary strategy to fully utilize spatial-angular information among views. Finally, we propose texture-semantic attention to improve the performance of detail generation. Experimental results demonstrate the superiority of MANet, with substantial improvements in both PSNR and SSIM metrics. Moreover, MANet stands out with an efficient parameter count of 2.4 M, making it a promising solution for real-world applications in public safety and security surveillance.

Abstract:
The detection of image tampering, specifically copy detection, is an important problem in many domains such as military, media, and public opinion outlets. Effective means to detect such tampering is crucial in controlling the dissemination of false information. However, a major challenge in achieving high detection accuracy lies in the variability of the scale of the copied targets. To tackle this problem, we introduce an all-encompassing methodology called Cross-Scale Modeling and Alternating Refinement (CANet) to detect the genuine source and tampered region at the pixel level. CANet consists of three modules: the Cross-Scale Similar Region Detection (CS) module, the Edge-Supervised Tamper Region Detection (ET) module, and the Alternating Refinement (AR) module. The CS module extracts coarse similar region features by cross-scale correlation modeling, which can alleviate the scale gap between the source and tampered region. The obtained coarse similar region feature is refined by the AR module, in which we introduce the source and the tampered region as the auxiliary information and employ a two-stage process that sequentially models their global feature representations. The tampered region used in the AR module is obtained from the ET module using edge supervision with a salient edge selection scheme, and the source region is generated by the implicit modeling. We conducted experiments on the USC-ISI, CASIA v2.0, CoMoFoD, and MICC-F220 datasets separately. Results show that our method outperforms the state-of-the-art.

Abstract:
Deep video coding has paved a way to break through the performance bottleneck of reigning hybrid video coding. However, unlike hybrid video codecs, existing deep video codecs cannot offer both flexible rates and regulable complexities within one single codec, which limits their applications. In this article, we propose a Regulable Deep Video Codec (RDVC) to address the above issue. First, we propose an Adaptive Feature Compression (AFC) network that generates variable rates while ensuring Rate-Distortion (RD) performance. The network introduces a two-stage coarse-to-fine rate adjustment that can be controlled by a user-specified rate level. Second, we propose a Spatio-Temporal Feature Propagation (STFP) mechanism to provide high-quality reference information for AFC process. Third, we also utilize slimmable convolutional components in our framework to adjust decoding complexity constrained by user configuration. Experimental results demonstrate that RDVC can adjust the codec structure flexibly according to different user configurations while maintaining advanced performance. On average, it reduces the bit-per-pixel (bpp) by 9.35%/58.12% while maintaining the same PSNR/MS-SSIM as the reference software VTM-13.2.

Abstract:
Graph Contrastive Learning (GCL) plays a crucial role in multimedia applications due to its effectiveness in analyzing graph-structured data. Existing GCL methods focus on maximizing the agreement of node representations across different augmentations, which leads to the neglect of unique and complementary information in each augmentation. In this paper, we propose a fusion-based GCL model (FB-GCL) that learns fused representations to effectively capture complementary information from both the graph structure and node attributes. Our model consists of two modules: a graph fusion encoder and a graph contrastive module. The graph fusion encoder adaptively fuses the representations learned from the topology graph and the attribute graph. The graph contrastive module extracts supervision signals from the raw graph by leveraging both the pairwise relationships within the graph structure and the multi-label information from the attributes. Extensive experiments on seven benchmark datasets demonstrate that FB-GCL enhances performance in node classification and link prediction tasks. This improvement is especially valuable for multimedia data analysis, as integrating graph structure and attribute information is crucial for effectively understanding and processing complex datasets.

Abstract:
This study investigates the evaluation of multimedia quality models, focusing on the inherent uncertainties in subjective Mean Opinion Score (MOS) ratings due to factors like rater inconsistency and bias. Traditional statistical measures such as Pearson's Correlation Coefficient (PCC), Spearman's Rank Correlation Coefficient (SRCC), and Kendall's Tau (KTAU) often fail to account for these uncertainties, leading to inaccuracies in model performance assessment. We introduce the Constrained Concordance Index (CCI), a novel metric designed to overcome the limitations of existing metrics by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. Through comprehensive experiments across various domains including speech and image quality assessment, we demonstrate that CCI provides a more robust and accurate evaluation of instrumental quality models, especially in scenarios of low sample sizes, rater group variability, and restriction of range. Our findings suggest that incorporating rater subjectivity and focusing on statistically significant pairs can significantly enhance the evaluation framework for multimedia quality prediction models. This work not only sheds light on the overlooked aspects of subjective rating uncertainties but also proposes a methodological advancement for more reliable and accurate quality model evaluation.

Abstract:
Compressing high-resolution videos under low bitrate constraints is a challenging task. Resampling-based compression, which reduces the resolution before encoding and restores it after decoding, has great potential to improve the rate-distortion performance in such scenarios. In this paper, we propose a learning-based frame-level coding scale control scheme that enhances the coding performance by adjusting the coding scale for each frame. The scheme cooperates with the Reference Picture Resampling of the latest video coding standard Versatile Video Coding (VVC), which allows coding scale variations on each frame. More specifically, a dataset with 5200 videos is created by a greedy rate-distortion optimization algorithm employed to select the optimal coding scale for each frame. A neural network-based decision model is further incorporated into VVC, learning to predict the coding scale for each frame in one pass. The scheme is implemented into the Fraunhofer Versatile Video Encoder (VVenC), a fast and efficient VVC encoder, and evaluated on 4 K contents. Experimental results show that the proposed scheme outperforms GOP-based coding scale adaptation methods, achieving average bitrate savings of 3.06% and 4.14% in terms of PSNR and MS-SSIM.

Abstract:
Multi-modal hashing achieves low storage costs and high retrieval speeds by using compact hash codes to represent complex and heterogeneous multi-modal data, effectively addressing the inefficiency and resource intensiveness challenges faced by the traditional multi-modal retrieval methods. However, balancing intraclass compactness and interclass separability remains a struggle in existing works due to coarse-grained feature limitations, simplified fusion strategies that overlook semantic complementarity, and neglect of the structural information within the multi-modal data. To address these limitations comprehensively, we propose a Proto-centric Multi-modal Hashing with Pronounced Category Differences (PMH-PCD) model. Specifically, PMH-PCD first learns modality-specific prototypes by deeply exploring within-modality class information, ensuring effective fusion of each modality's unique characteristics. Furthermore, it learns multi-modal integrated class prototypes that seamlessly incorporate semantic information across modalities to effectively capture and represent the intricate relationships and complementary semantic content embedded within the multi-modal data. Additionally, to generate more discriminative and representative binary hash codes, PMH-PCD integrates multifaceted semantic information, encompassing both low-level pairwise relations and high-level structural patterns, holistically capturing intricate data details and leveraging underlying structures. The experimental results demonstrate that, compared with existing advanced methods, PMH-PCD achieves superior and consistent performances in multi-modal retrieval tasks.

Abstract:
Video prediction is an important yet challenging task that generates future frames based on previous observations. Despite recent progress, existing methods still suffer from motion blur, due to weak motion perception capabilities leading to uncertainty in motion direction. To address this, we propose a Motion Direction Awareness (MDA) mechanism inspired by the direction-selective mechanism in animal visual systems. Specifically, MDA can decompose complex motions into horizontal and vertical components, allowing dimension reduction and independent processing, thereby effectively enhancing motion perception and reducing uncertainty in predicted motion directions. Based on MDA, we design a multi-scale feature fusion network named MDANet for video prediction, which incorporates different scales of spatially encoded features in conjunction with MDA mechanism to extract the temporal evolution information of global and local spatial features. Extensive experiments on representative datasets demonstrate that MDANet can alleviate motion blurring, improving prediction accuracy and temporal consistency over state-of-the-art models. Furthermore, we validate the generalizability and effectiveness of our MDA mechanism by integrating it into other advanced models. The code is available at supplementary.

Abstract:
The process of fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) imagery, commonly referred to as pansharpening, is intended to generate high-resolution multispectral (HRMS) imagery. Typically, most pre-existing pansharpening frameworks mainly emphasize the straightforward learning of the mapping relationship among PAN and LRMS images to HRMS images. However, a key limitation of these frameworks is their potential overemphasis on spatial information, particularly the enhancement of low-frequency components. As a result, such an oversight potentially hinders the model's ability to simultaneously restore both spectral and spatial details. To address this issue, we propose a novel pansharpening model based on the denoising diffusion probabilistic model (DDPM), dubbed FrDiff. Specifically, we build a framelet-based conditional diffusion model that leverages the generative power of diffusion models to produce more refine results. Different from conventional methods directly inferring HRMS images, our strategy is designed to project their framelet coefficients, utilizing the available PAN and LRMS images as resources. This approach enables the separation of high-frequency and low-frequency components through framelet transformation, which are subsequently recombined to create a novel set of conditional embeddings that feed into the diffusion process. At the same time, the powerful predictive power of the diffusion model is exploited to simultaneously recover the high-frequency and low-frequency components of the HRMS. Moreover, we introduce a framelet-oriented cross-attention module dedicated to honing spectral fidelity. This module is crucial for improving the spectral precision of the HRMS images, ensuring a balanced emphasis on both spatial and spectral enhancements. Quantitative and qualitative experiments on multiple benchmark datasets demonstrate that the proposed method achieves more robustness and high-quality results than other state-of-the-art pansharpening methods.

Abstract:
Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet significantly outperforms its variants and existing solutions for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.

Abstract:
Three-dimensional reconstruction can help robots and vehicles understand their surroundings for subsequent navigation and manipulation tasks. However, in the case of target occlusion, it is difficult for visual sensors to acquire complete information about objects. In this work, we propose a progressive refinement generative adversarial network (PR-GAN) to recover object shapes guided by transition-awareness. This method directly predicts the missing point cloud from the partial point cloud. Our PR-GAN contains a progressive generation module (PGM) and a discriminator. A self-attention-based encoder is proposed in PGM to capture contextual information between local and global features. To guide encoders in generating accurate point clouds, we further propose a progressive fusion module (PFM) that extracts transition information between point clouds of different scales. Moreover, a part-whole correlation module (PWCM) is designed to extract the transition-awareness between the partial and the whole point clouds to further preserve the details. With the above modules, we enhance the spatial logic perception capability of the network so that PR-GAN can fully extract point cloud features and predict the high-fidelity point cloud. Experimental results show that PR-GAN performs better compared to other methods, evaluated on three public datasets.

Abstract:
The face swapping task has always been attractive for its wide range of applications. However, existing face swapping methods suffer from two main challenges: a) degraded generation fidelity due to insufficient facial texture information; b) inconsistent synthesized face structure due to the lack of effective forms of face structure supervision. To address the above issues, we propose a novel Texture and Structure Consistency Mining (TSCM) framework to achieve high-fidelity face swapping with rich textures and consistent facial structure. For one thing, a Dual-Scale Oriented Identity Transfer module is devised to globally transfer source identity to the well-disentangled target identity features at dual-level feature spaces, which achieves more efficient identity transfer and promotes target attribute texture preservation. Then, to compensate for the local facial textures, a Semantic-Guided Texture Enhancing module is developed by exploiting disentangled identity and attribute semantics to ensure local texture consistency. For another thing, different from previous methods that directly apply abstract 3D coefficients, a Structure-Aware Head Modeling module is designed to provide intuitive face structure supervision, which is adaptively integrated with local facial texture information in a self-learning manner. Moreover, a structure-consistency discriminator is introduced to effectively restrict the synthesized face structure consistency. Comprehensive experiments demonstrate that our TSCM yields a substantial advantage over the state-of-the-art methods in synthesizing texture- and structure-consistent swapped faces.

Abstract:
In satellite video object tracking, the individual frame analysis method is usually used for target localization, ignoring informative cues of the dynamic scene. Temporal information could contribute to identifying the target from distractors. In this work, a novel dynamic scene learning reasoning tracker is proposed for satellite videos, which reasons over temporal dynamic information to derive the target location. It is inspired by the tracking pattern through human perception and reasoning. First, static-dynamic united analysis is designed to construct dynamic scenes by concatenating the static searching results along the temporal dimension. Second, the information of each response object is aggregated by wavelet transforms. Meanwhile, these scenes are projected into low-frequency and high-frequency subspaces, which could imitate different levels of perceptions of humans for scenes. Third, an object-aware reasoning transformer is proposed to utilize the temporal dynamics of input response objects. In each subspace, it models the mutual interactions between dynamic objects and further learns the intrinsic property of each object for target reasoning. Finally, to obtain the current reasoning result, inverse wavelet transforms are utilized to integrate the results of low-frequency and high-frequency subspaces. The effectiveness of the proposed method is validated on three public satellite video datasets, including SV248S, SkySat, and VISO. Qualitative and quantitative experimental results show that the proposed tracker outperforms 22 popular approaches in seven challenging tracking satellite scenarios.

Abstract:
In ophthalmology diagnosis, high-fidelity fundus images are essential for disease diagnosis and intervention. However, many real-world clinical conditions may degrade the quality of the acquired images and thus affect clinical diagnostic accuracy. Traditional convolutional neural network-based retinal fundus image enhancement methods cannot always capture long-range dependencies, which reduces the overall visual quality of images, especially for real retinal fundus images. Furthermore, existing enhancement methods often fail to fully utilize low-resolution structural detail information, which potentially leads to inaccurate pivotal fundus vessel topology or capillary details. In this paper, we propose a novel Structure-Aware Transformer-based attention fusion Network (SAT-Net) for low-quality retinal fundus image enhancement. First, we introduce a Transformer-based attention fusion module which incorporates window-based self-attention and channel self-attention to capture global spatial dependencies and emphasize important feature channels simultaneously. This fusion significantly improves the overall perceptual quality of the image by enhancing both the local details and the uniformity of the non-vessel background regions. Second, we introduce a cross-quality knowledge distillation technique, which bridges the quality gap between high-quality and low-quality fundus images. By designing a high-performing teacher network to guide a lightweight student network, the student network enables to capture detailed features from low-quality fundus images, further preserving critical diagnostic information and fine topology structures. Moreover, we design a structure-aware multi-scale loss function by using a trainable subnetwork to obtain the edge structure from different scales to better constrain pivotal fundus vessel structure and capillary details. Comprehensive quantitative and qualitative experiments on both synthetic and real fundus image datasets robustly validate that our proposed SAT-Net outperforms other state-of-the-art methods for fundus image enhancement. In addition, extensive comparative experiments on both the vessel segmentation and Optic Disc/Cup detection tasks further validate the effectiveness and superiority of our proposed method.

Abstract:
Skewed histogram shifting (SHS) is an efficient scheme in reversible data hiding (RDH) research. By employing a pair of symmetric predictors which averages part of sorted pixels around the to-be-predicted pixel, two skewed histograms are generated. With the embedding and shifting directions toward the short tail of the two histograms, SHS reduces many invalid modifications. However, the design of the symmetric predictors pair is strictly constrained, which seriously degrades the performance on both embedding capacity and distortion of this SHS scheme. In this work, we propose a generalized SHS model to remove the weight and symmetry constraints. With the help of differential evolution algorithm, the optimized parameters are obtained in a short period of time, avoiding wasting time using exhaustive search. What is more, adaptive pairwise mapping and embedding bin selection are also realized by adding parameters into the evolutionary process, which greatly improve the embedding performance without increasing too much computational complexity. Experiments demonstrate the superiority of our method by comparing it with state-of-the-art RDH schemes.

Affiliations: School of Mathematical Sciences, Research Center for Image and Vision Computing, University of Electronic Science and Technology of China, Chengdu, China; School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China; School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China; School of Information Management, Jiangxi University of Finance and Economics, Nanchang, China; Glasgow College, University of Electronic Science and Technology of China, Chengdu, China; Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

Abstract:
Recently, the transform-based tensor representation has attracted increasing attention in multimedia data (e.g., images and videos) recovery problems, which consists of two indispensable components, i.e., the transform and the characterization. Previously, the development of transform-based tensor representation has focused mainly on the transform perspective. Although several attempts have considered shallow matrix factorization (e.g., singular value decomposition and nonnegative matrix factorization) for characterizing the frontal slices of the transformed tensor (termed the latent tensor), the faithful characterization perspective has been underexplored. To address this issue, we propose a unified Deep Tensor Representation (DTR) framework by synergistically combining the deep latent generative module and the deep transform module. Especially, the deep latent generative module can faithfully generate the latent tensor as compared with shallow matrix factorization. The new DTR framework not only allows us to better understand the classical shallow representations but also leads us to explore new representations. To examine the representation capability of the proposed DTR, we consider the representative multidimensional data recovery task and suggest an unsupervised DTR-based multidimensional data recovery model. Extensive experiments demonstrate that DTR achieves superior performance compared to the state-of-the-art methods from both quantitative and qualitative aspects, especially for fine detail recovery.

Abstract:
Multimodal hate detection plays a crucial role in maintaining harmonious online environments by identifying harmful content, such as hateful memes. Although previous research has made significant progress in detecting explicit hate speech, there remains a critical gap in analyzing implicit hate, which is particularly challenging due to the absence of explicit harmful text claims or demographic visual cues. Despite the promising results based on cross-modal attention, previous methods may suffer from the distributional modality gap caused by the non-literal associations between multimodal elements, which lacks apparent alignment in implicit hateful contents. In this work, we propose a novel framework: Flexible Optimal Transport (FLOT) to capture the non-literal cross-modal alignment for multimodal hate in the context of memes. FLOT formulates the problem of cross-modal alignment as finding optimal transportation plans, which leverages a kernel method to capture complementary information from multiple modalities. The kernel embeddings reproduce a kernel Hilbert space (RKHS) to serve as a non-linear transformation of alignment, which effectively reduces the distributional modality gap with more interpretability. Moreover, we established topological structures with contrastive modeling for the aligned representations, which are optimized to achieve comprehensive alignment between different modalities, and facilitate local reasoning based on multimodal elements. Experimental results have demonstrated that our FLOT achieved state-of-the-art performance on three publicly available benchmark datasets. Furthermore, extensive qualitative analysis confirms the superior ability of FLOT in capturing implicit cross-modal alignment.

Abstract:
The traditional object detection algorithm in the intelligent vehicle perception system cannot maintain stable recognition performance in the unknown and changing road environment. We find that uncertainty quantification is of great significance in detecting unknown complex environments and helps to improve the robustness and safety of autonomous driving systems. Therefore, this paper proposes an Uncertainty-based Transformer (UBT) object detection algorithm. Firstly, the double Gaussian feature map network (DGF) is designed to quantify and utilize the uncertainty of the features derived from the backbone network. Secondly, we propose a RBF-based query filtering model(RBQF), which takes uncertainty sum as the index of query vector screening. At the same time, this paper proposes an uncertainty detection head (UDH); the final model output results are quantitative uncertainty, improved detection performance and enhanced algorithm reliability. To further prove the detection performance of the proposed method in real driving scenes, we use COCO, Cityscapes, FoggyCityscapes, RainCityscapes and self-made traffic scene datasets for verification, which shows that our algorithm is well applicable to large datasets and complex road scenes.

Abstract:
Medical images acquired under suboptimal conditions often suffer from quality degradation, such as low-light, blurring, and artifacts. Such degradations obscure the lesions and anatomical structures in medical images, making it difficult to distinguish key pathological regions. This significantly increases the risk of misdiagnosis by automated medical diagnostic systems or clinicians. To address this challenge, we propose a multi-Color space-based quality enhancement network (MSQNet) that effectively eliminates global low-quality factors while preserving pathology-related characteristics for improved clinical observation and analysis. We first revisit the properties of image quality enhancement in different color spaces, where the V-channel in the HSV space can better represent the contrast and brightness enhancement process, whereas the A/B-channel in the LAB space is more focused on the color change of low-quality images. The proposed framework harnesses the unique properties of different color spaces to optimize the image enhancement process. Specifically, we propose a pathology-preserving transformer, designed to selectively aggregate features across different color spaces and enable comprehensive multiscale feature fusion. Leveraging these capabilities, MSQNet effectively enhances low-quality RGB medical images while preserving key pathological features, thereby establishing a new paradigm in medical image enhancement. Extensive experiments on three public medical image datasets demonstrate that MSQNet outperforms traditional enhancement techniques and state-of-the-art methods, in terms of both quantitative metrics and qualitative visual assessment. MSQNet successfully improves image quality while preserving pathological features and anatomical structures, facilitating accurate diagnosis and analysis by medical professionals and automated systems.

Abstract:
Zero-Shot Skeleton-Based Action Recognition (ZSSAR) is an emerging research field focused on developing alignment models that connect skeleton movements with action definitions, thus enabling generalization to unobserved actions. Current methods often employ generative models to reconstruct cross-modal features or enhance mutual information across modalities for alignment. However, when applied to unseen action categories, these models often neglect the inherent consistency among basic actions, thereby diminishing their generalization capabilities. Furthermore, imprecise annotations fail to capture the rich semantic details of actions, resulting in misalignment. Inspired by human cognitive processes and chain of thought, we argue that integrating prior information about human actions with intrinsic commonality knowledge of basic actions is essential for ZSSAR. To actualize this, we propose a novel method termed Prompt-guided Prototype-aware Commonality and Discrimination Learning (PP-CDL). This method utilize the comprehensive world knowledge contained in LLMs, employing tailored prompts to partition seen action categories into distinct, non-overlapping prototype spaces that embody the commonality knowledge of basic actions. Subsequently, we introduce the Inter- and Intra-Prototype Discriminating (I2PD) module and the Intra-Prototype Commonality Mining (IPCM) module. The I2PD amplifies the distinctiveness of knowledge within prototypes, furnishing a personalized search space for the recognition of unseen actions. In contrast, the IPCM models the shared commonality concept within prototypes, bolstering the consistency between skeleton action representations and corresponding text knowledge representations. Experiments on different skeleton action benchmarks demonstrate the significant improvement of our method over existing alternatives.

Abstract:
We present a novel, first-of-its-kind view synthesis method for plenoptic images, which enables the direct manipulation of images in the micro-images array format, thereby bypassing intermediate transformation steps. Current plenoptic imaging approaches typically rely on an initial conversion to dense multiview images, also known as subaperture images extraction. However, the use of subaperture images presents two main limitations that ultimately impact further processing. First, existing subaperture view extraction methods offer limited control over camera parameters, resolutions, and poses of the subaperture views, which are also constrained to a small area around the main lens, thus restricting free navigation. Second, subaperture images are susceptible to artifacts which can propagate to subsequent processes such as calibration, depth estimation and view synthesis. In this paper, we propose a camera model that enables depth image-based rendering with plenoptic cameras, in a way that allows for the direct synthesis of any target viewpoint. In our evaluation, we show that our method expands view synthesis extrapolation to a range that is two to three times greater than that of pipelines requiring a conversion to subaperture images, including generally accepted tools such as depth image-based rendering and learning-based rendering approaches.

Abstract:
Small Object Detection (SOD) poses significant challenges due to limited information and the model’s low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model’s lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% AP_S on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% AP_S (36.4% vs. 32.0% ) with fewer parameters and FLOPs, under 12 epochs training setting.

Abstract:
Multispectral object detection, which combines RGB visible light and thermal infrared spectral information, has broad applications in complex environments and varying illumination conditions. However, existing methods face challenges in processing multispectral data, such as inconspicuous object features in spectral images and significant discrepancies between input modality spaces and output detection spaces. To address these issues, we propose an innovative multispectral object detection method that combines contrastive learning and a new cross-modal feature fusion module. We introduce a mask feature contrastive loss that maximizes the similarity between the box-level mask features and modal features while suppressing background responses, enabling effective representative alignment between the input and output spaces. Additionally, we propose a mask-guided attention fusion module that uses a predicted pseudo mask to guide the fusion of different modal features, enhancing object responses and reducing background noise interference. Our extensive experiments on several challenging multispectral datasets demonstrate that our proposed COFNet achieves state-of-the-art performance.

Abstract:
Keypoint detection and matching have garnered significant attention, yet remain challenging in low-light environments. Most current studies follow an enhance-then-detect pipeline, which consists of independent enhancers and detectors. While the enhancer focuses on improving the visual quality of low-light images to satisfy human perception standards, the detector prioritizes detection accuracy for machine vision tasks. The unaligned optimization objectives of the enhancer and detector overlook the gap between human and machine vision and lead to sub-optimal performance in low-light keypoint detection. To tackle this problem, a joint enhance-and-detect pipeline is proposed to unify the optimization objectives of enhancement and detection by regarding the improvement of keypoint detection accuracy with enhanced features under machine vision standards. Specifically, we propose a low-light keypoint detection network named DeRFeat, which learns a degradation-equivariant representation between normal and dark domains using AutoEncoding transformation and domain descriptor similarity constraints to indirectly enhance the features from the encoder in the training stage. Then, DeRFeat guides the shared encoder to obtain the degradation-equivariant representations from dark images in the inference stage. With the dark degradation predictions, the encoder is capable of generating equivariant representations between normal and dark domains. The proposed domain descriptor similarity module further aids the encoder in mitigating the impact of dark degradation factors, enabling local descriptors to acquire undisturbed representations. Moreover, a coarse-to-fine point selection strategy is proposed to provide reliable prior keypoints for a globally optimal descriptor construction. Experimental results on four benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods under varying low-light conditions.

Abstract:
Due to the development of text-image multimodal methods, text is used to guide the style transfer of images, which has attracted growing attention. Notably, The existing text-guided image style methods are limited to expressing specific artistic style through simple text. It can only accept coarse-grained text input such as “Van Gogh” and “White Cloud”, and cannot understand fine-grained text input such as “The Night Café by Vincent van Gogh”. To this end, this paper proposes a novel artistic style transfer network based on the fine-grained text guidance and the contrastive semantics similarity, named as TCStyler. It can accept images or texts as style guidance, which is more suitable for fine-grained content understanding stylization. In this network, to address the issue of text-image cross-modal discrepancy, the residual attention feature mapper (RAFM) is introduced to constrain the differences between different modalities in feature space. Then, the global cascading style-sharing module (GCSM) is proposed for performing content-style feature fusion and image-text modality fusion by adopting a global feature-sharing strategy. Furthermore, the contrastive semantics similarity loss is designed to address the problem of multimodal universality. Quantitative and visualization experiments demonstrate that our TCStyler can handle fine-grained artistic text inputs and maintain consistency in the style transfer results guided by different modalities.

Abstract:
The dense representation space in Chinese scene text recognition (STR) makes discriminating between categories highly challenging, because of the large candidate category set. Mainstream STR methods have achieved remarkable advancements by leveraging linguistic knowledge to implicitly address this challenge. In this paper, inspired by the correlation between recognizer performance and the distributional properties of character representations, as well as the inherent consistency between this correlation and supervised contrastive learning (SupCon), we thoroughly investigate how to integrate SupCon with an STR model to alleviate this challenge, and elucidate some dynamic behaviors underlying the performance improvements. Specifically, we analyze the SupCon-STR models instantiated with different projectors and evaluate their distributional properties through metrics, including intra-class compactness, inter-class separability, and feature redundancy, while assessing performances that involve in-domain accuracy and cross-domain recognition generalization. The main results reveal how the temperature \tau and projectors affect the representation distribution, and highlight that suitable intra-class compactness and sufficient inter-class separability are key factors for delivering competitive performances in both in-domain and cross-domain STR scenarios. Moreover, these results also provide valuable insights into the design of SupCon-STR architectures for diverse resource constraints. Taking existing Chinese STR models as baselines, and combining SupCon-STR with them, the average improvements in cross-domain recognition performance are over 5% across 7 testing datasets. A new state-of-the-art accuracy of 77.19% on the Chinese Scene benchmark is also established.

Abstract:
Object detection in aerial images remains challenging due to significant variations in object scales and uneven object distributions. Previous methods typically employ coarse-to-fine strategies, focusing initially on prominent objects and then finely detecting smaller ones within sub-regions likely to contain multiple objects. However, two essential factors of sub-regions, positional precision and detection difficulty, deserve further consideration. Moreover, object scale fluctuations within sub-regions produce certain false positives, especially in areas where objects are densely distributed. To address the issues, we propose an object-wise density-informed framework, DSENet++, which includes three consecutive stages termed “Discernment, Selection, Elevation.” Specifically, a sophisticated object-wise density map considers both object scales and angles to discern positionally precise sub-regions. Subsequently, sub-regions rated as high detection difficulty are selected based on density intensities and coarse detections collaboratively within the proposed Region Select Module. Afterward, the fine detector head is fine-tuned using the selected sub-regions in conjunction with a newly inserted adapter module, which enables features generated by the backbone to be more effectively processed by the detector head. To mitigate the impact of false positives, we devise a Train-by-False Positives training strategy. It collects false positives and clusters them adaptively to create novel Pseudo-positive categories and combined with original ones for retraining. The final retraining is performed on the enlarged category space to elevate the performance of fine detectors in areas where coarse detections are mediocre. Extensive experiments show that DSENet++ achieves state-of-the-art performance on three popular aerial image datasets: VisDrone, DOTA-V1.5, and SODA-A.

Abstract:
Ultra-high resolution (UHR) image segmentation is a challenging task that requires efficient processing of large images while maintaining high accuracy. Existing approaches usually employ both shallow and deep networks to extract high-resolution details and global context from different-resolution inputs, achieving a balance between performance, memory, and speed. However, these methods still rely on preserving relatively high-resolution features within the deep network, leading to increased time and memory costs. This also indicates that the full potential of the high-resolution information from the shallow network remains underexplored. To address this, we propose a novel framework called the Hierarchical Grafting Network (HGN), wherein the shallow network is hierarchically grafted to the deep network from multiple perspectives, enabling comprehensive utilization of the features from the shallow network. Our framework involves carefully designed global structure aggregated grafting and local structure aligned grafting mechanism, which progressively integrate semantic details and spatial structure from the shallow network to the deep network. In addition, to enhance the discriminative power of the high-resolution local features extracted by the shallow network, we introduce a shallow-deep contrastive loss to encourage the shallow network to learn semantically similar features to those of the deep network. Extensive experiments on several UHR image segmentation datasets demonstrate that our approach outperforms state-of-the-art UHR methods. The results demonstrate an overall improvement in terms of memory efficiency, accuracy, and speed.

Affiliations: Shanghai Key Laboratory of Multidimensional Information Processing, School of Communication and Electronic Engineering, Health Science Center, East China Normal University, Shanghai, China; Department of Emergency and Critical Care, Shanghai Changzheng Hospital, Naval Military Medical University, Shanghai, China; Zhejiang University, Hangzhou, China; Hangzhou EZVIZ Network Company, Ltd., Hangzhou, China; Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China; Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China

Abstract:
In non-contact respiratory monitoring, reducing motion artifact and selecting the appropriate Region of Interest (ROI) pose significant challenges. Most motion artifact removal methods rely on signal periodicity assumptions, while respiratory signals usually are non-periodic in real-world scenarios. Existing automated ROI selection approaches are mostly primarily impacted by the texture of clothing, absence of chest landmarks, and obstruction of face. To improve the quality of respiratory signals, in this study, we propose a framework for automatic respiratory ROI selection based on video, namely, Optimizing Video-based Respiration Monitoring (OVRM), which consists of peak-trough adaptive motion artifact removal and characteristic-driven adaptive ROI selection. This motion artifact removal strategy removes motion artifacts by using a dynamic ratio-based judgment mechanism, and reconstructs signals using sinusoidal interpolation. The adaptive ROI method scores signals based on periodicity, similarity, smoothness, and energy, selecting the highest-scoring blocks as the ROIs to match respiratory signals efficiently. Experimental results, validated across four datasets, demonstrate that OVRM effectively reduces signal noise caused by subject movement and outperforms state-of-the-art non-contact respiratory monitoring algorithms.

Abstract:
To counter the security threats posed by the realism of image inpainting generated through diffusion models and GANs, in this paper, we propose a texture enhancement and progressive refinement network (TEPR-Net) for image inpainting localization (IIL). The IIL task is divided into two phases: coarse and fine locating. In the coarse locating phase, we utilize an anomaly texture encoder to capture tampering traces in textures, employ a texture–context feature interaction strategy to effectively integrate texture features with contextual features, and utilize a pixel-level contrastive learning strategy to enhance feature clustering and model generalization. In the fine locating phase, we first enhance the receptive field features in the frequency domain by transforming the features and separately enhancing the low- and high-frequency components. Then, we utilize the coarse localization result to augment the model’s sensitivity to tampered regions. Additionally, we introduce a progressive edge distribution guidance and reconstruction strategy that progressively refines the edges of the tampered regions at each level, ultimately generating refined localization results. To support the research and evaluation of the IIL task, we create the Inpaint32K dataset, which is characterized by its large scale, diversity, comprehensiveness, high quality, and authenticity. Finally, extensive experiments demonstrate that TEPR-Net has significant advantages in terms of localization performance, generalizability, extensibility, and robustness.

Abstract:
The quantification of repetitive actions in videos, a task commonly referred to as Video Action Counting (VAC), is a critical challenge in understanding and analyzing content in sports, fitness, and daily activities. Traditional approaches to VAC have largely overlooked the nuanced irregularities inherent in action repetitions, such as interruptions and variable lengths between cycles. Addressing this gap, our study introduces a novel perspective on VAC, focusing on Irregular Video Action Counting (IVAC), which emphasizes the importance of modeling the irregular repetition priors present in video content. We conceptualize these priors through two key aspects: Inter-cycle Consistency and Cycle-interval Inconsistency. Inter-cycle Consistency ensures that the spatiotemporal representations across all cycle segments in a video remain homogeneous, thereby reflecting the uniformity of actions between different cycle segments. In contrast, Cycle-interval Inconsistency mandates a clear semantic distinction between the representations of cycle segments and intervals, acknowledging the inherent dissimilarities in content. To effectively encapsulate these priors, we introduce a novel methodology consisting of consistency and inconsistency modules, underpinned by a tailored pull-push loss (\mathbf P^2L) mechanism. This approach employs a pull loss to enhance the cohesion among cycle segment features and a push loss to distinctly differentiate between cycle and interval segment features. Empirical evaluations on the RepCount dataset illustrate that our IVAC-\mathbf P^2L model sets a new benchmark in state-of-the-art performance for the VAC task. Moreover, our model demonstrates adaptability and generalization across diverse video content, achieving superior performance on two additional datasets, UCFRep and Countix, without necessitating dataset-specific fine-tuning. These findings not only validate the effectiveness of our approach in addressing the complexities of irregular repetitions in videos but also open new avenues for future research in video understanding and analysis.

Abstract:
The rapid development of Uncrewed Aerial Vehicles (UAVs) and their unique vantage points present both new opportunities and challenges for person Re-Identification (ReID). Uncertain rotations and scale variations of targets in UAV images, coupled with complex environmental factors, hinder existing methods from extracting robust feature representations. Some methods either make minor modifications to the traditional model architecture or apply simple image rotations but still fail to effectively address the challenges of UAV person ReID. To overcome these limitations, we propose a novel Data Augmentation and Rotation Invariance (DARI) algorithm. First, rotation-invariant convolution is introduced to adaptively extract features, mitigating the uncertainty caused by target rotation. Second, a refined data augmentation correction strategy is employed to reduce noise interference by increasing the richness of global features at different stages. Additionally, considering that multiple features of the same identity should yield consistent recognition result, invariant constraints are designed to enhance the clustering effect. We conducted extensive experiments on both UAV and fixed-camera datasets. The results on PRAI-1581 demonstrate a 5.6% and 6.1% improvement in mAP and Rank-1, respectively, compared to baseline. These findings highlight the model’s effectiveness in addressing the challenges of UAV ReID, demonstrating its robustness and superiority.

Abstract:
Oriented objects in images are typically embedded in complex backgrounds and exhibit arbitrary orientations. When using oriented bounding boxes (OBBs) to represent these objects, the periodicity of the angles and associated variations in side lengths lead to discontinuities in the angle loss. This paper fundamentally addresses this problem by proposing a trigonometric loss function in the complex plane. Moreover, a conformer RPN head is designed with convolution and multi-head self-attention, which can dynamically capture angular and classification information. The proposed loss function and conformer RPN head jointly generate high-quality oriented proposals. A category-aware dynamic label assignment based on predicted category feedback is proposed to address the limitations of solely relying on IoU for oriented proposal label assignment. This method makes negative sample selection more representative, ensuring consistency between classification and regression features. Experiments were conducted on five realistic oriented detection datasets, and the results demonstrate superior performance in oriented object detection with minimal parameter tuning and time costs. Specifically, mean average precision (mAP) scores of 82.02%, 71.99%, 69.87%, 46.45%, and 98.77% were achieved on the DOTA-v1.0, DOTA-v1.5, DIOR-R, STAR, and HRSC2016 datasets, respectively.

Abstract:
Recently, generative models such as diffusion models (DMs) have gained prominence in various applications, and there is a growing demand for their deployment on resource-constrained devices. Model pruning provides an effective solution by reducing the model redundancy without significantly impacting performance. However, most existing model pruning methods are designed for classification models and often lead to substantial performance degradation when applied to generative models. To address this issue, we propose Comp-Diff, a novel two-stage framework of pruning and knowledge distillation tailored for diffusion models. In the pruning stage, we propose a new structured content-aware pruning (CaP) method within Comp-Diff to identify and preserve informative units (filters/channels) that actually contribute to the generative capability of the model. Specifically, we introduce input perturbations to the pre-trained model and measure each unit’s importance score using gradients induced by these perturbations. Units with higher importance scores are considered more informative and are retained to maintain the model’s generative power. In the fine-tuning stage of Comp-Diff, we propose the distribution-aware knowledge distillation (DaKD) method, which effectively transfers fine-grained knowledge from the original model to the pruned one on both attention and noise distribution levels. In addition, DaKD includes an adversarial loss to improve the quality and diversity of generated outputs. To verify and evaluate our method, we apply the proposed Comp-Diff on three representative tasks: unconditional image generation, conditional image generation, and text-to-image generation. Extensive experiments on both multi-step and one-step diffusion models demonstrate that the proposed framework consistently yields compact models and outperforms existing pruning techniques by a large margin.

Abstract:
Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data poisoning attacks against deep hashing (PADHASH). Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.

Abstract:
Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22 k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.

Abstract:
Video anomaly detection is of critical importance in safety-critical scenarios. The key challenge is to effectively capture the spatio-temporal features of videos and learn normal patterns from the training data. However, existing methods often fall short in modelling intra-channel and inter-channel correlations as well as dynamic dependencies between video frames, leading to challenges in model robustness and generalization. To address these issues, we propose DCTFormer, a dual-branch framework that integrates both RGB and optical flow branches to handle Video Anomaly Detection. Firstly, we design a novel module TRAECT (Transformer-based Residual Autoencoder with Cloze Tests), which incorporates high-level semantics and temporal context information to improve the spatio-temporal relationships learning ability by capturing intra-channel and inter-channel correlations. More importantly, conditioned on the RGB branch, we propose a new optical flow completion approach incorporating richer motion dynamics to learn dynamic dependencies between video frames and optical flows through a conditional variational autoencoder. At last, we introduce an ensemble strategy to compute anomaly scores for both branches, and thus fully exploit the branches modality information. The experimentation on three challenging benchmark datasets evinces the efficacy of our framework, which outperforms current state-of-the-art approaches with regard to anomaly detection performance.

Abstract:
Visible-Infrared Person Re-Identification (VI-ReID) plays a crucial role in round-the-clock security surveillance systems, aiming to detect consistent identity recognition across transitions from day to night. A significant challenge in this field is the variation in the appearance of the same identity across visible and infrared modalities, which often leads to coupled noisy labels, referring to both Noisy Annotation (NA) and Noisy Correspondence (NC). Therefore, learning noisy-tolerant and discriminative representations is the primary objective in VI-ReID. However, existing research typically faces two principal limitations: (1) Learning strategies for noisy labeled scenarios usually rely on analyzing the distribution of loss response while ignoring the rich semantic information from neighboring samples. (2) When dealing with identified noisy samples, most previous approaches usually employ filtering strategies to mitigate the impact of noisy samples but fail to consider the valuable information in the noisy samples. To address these challenges, we propose a Propagation based Recycling Contrastive Learning (PRCL) approach. This method utilizes a label propagation strategy to distinguish clean annotations to learn identity-wise semantic information and recycles filtered noisy samples to capture the geometric-wise representation. Thus, even in the presence of noisy labels, the method can help learn robust representations across visible and infrared modalities. Specifically, we design a Noisy-aware Heterogeneous Graph Propagation module, which identifies noisy samples by aggregating the effects of neighboring labels using a graph propagation strategy. In addition, we develop a Cross Modality Recycling Debiased Contrastive Learning algorithm, which leverages the identity-wise information from clean samples and geometry-wise information from noisy samples. This approach utilizes identity-wise and geometric-wise information to mitigate the effect of noisy labels and retain as much valuable information as possible. Extensive experiments on two VI-ReID benchmark datasets demonstrate that our proposed method achieves highly competitive performance.

Abstract:
Image denoising under adverse weather conditions aims to eliminate multiple weather-related noises and restore bright and clear images. Until now, most methods are task-specific while all-in-one algorithms often require a large number of parameters, limiting their model efficiency. Our theoretical analysis and statistical experiments reveal that adverse weather images in Hue channel contain rich contextual information for further processing. With this observation, we propose a novel lightweight HUe-Guided Synergistic Network (HUGS-Net) with multi-scale detail refinement. First, we design a Fourier interaction and evolution module to capture global information from Hue channel without introducing excessive network parameters. Second, we develop a lightweight residue group convolution block to model local texture features, incorporating them with global information to guide noises removal. Third, we introduce a multi-scale fusion module to enhance high-frequency details at a small feature resolution in RGB color space. With the above design, HUGS-Net further supervises and supplements refined background information. Comprehensive experiments showcase the superiority of HUGS-Net across various adverse weather datasets (e.g., image deraining, desnowing, dehazing) with the least parameter size and fast running speed.

Abstract:
Reversible data hiding techniques serve as a cornerstone in the protection of embedded information across diverse media. Traditionally, methods applied to 3D models have primarily focused on modifying vertex coordinates. However, this approach neglects the untapped potential of polygonal faces, which, being more abundant than vertices, offer a scalable and efficient avenue for data embedding. By leveraging polygon indices for reversible data hiding—particularly when integrated with encryption—it becomes possible to randomize the model’s structure, facilitating secure modifications while preserving geometric integrity. This study introduces an innovative reversible data hiding algorithm that embeds messages within the polygon indices of encrypted 3D models. The algorithm harnesses the inherent similarities among vertex indices to conceal additional information. To maintain the consistency of polygon normal vectors, we implement a right circular shifting mechanism that systematically reorganizes the indices, ensuring that the smallest value consistently occupies the initial position. Additionally, we incorporate techniques such as leading zero count and multi-MSB prediction to enhance embedding capacity while keeping index values within permissible ranges. Experimental results demonstrate that our approach significantly outperforms conventional vertex-based methods, yielding substantial improvements in embedding efficiency. Crucially, the reversible nature of the proposed technique ensures the exact restoration of the original 3D model upon data extraction, guaranteeing zero information loss and no compromise in quality. Moreover, the algorithm is designed to integrate with vertex-based reversible data hiding techniques for encrypted 3D models, potentially enhancing data embedding capacity under compatible conditions.

Abstract:
Striking a balance between speed and accuracy remains a significant challenge in real-time semantic segmentation. Existing methods typically employ an encoder-decoder or a multi-branch structure to achieve accurate segmentation while maintaining compactness. However, these methods often overlook the different significance of various pixels and suffer from insufficient spatial information during encoding. To address these challenges, we propose an Image Complexity-aware Two-branch Network (ICTNet) with enhanced decoding. Notably, we introduce image complexity into the spatial branch by supervising a partial of its parameters with image complexity maps. The generated complexity-aware spatial features are fused with context features in the intermediate period of the encoder through our Spatial and Context Fusion (SCFusion) module. The rich context information can provide enhanced object context guidance for spatial features and result in accurate spatial details. To fully integrate the two-branch features and recover sufficient spatial detail information, we design an enhanced decoder which integrates a multi-level context module for feature restoration and develop the Image Complexity Prior Guiding (ICPG) module to fuse the two branch features. The ICPG transforms features into attention maps and modulates the summing of spatial and context features. The fused features are then upsampled through PixelShuffle and bilinear interpolation to produce the final segmentation results. Extensive experiments on the Cityscapes, CamVid, and PASCAL VOC2012 datasets demonstrate the effectiveness of our method.

Abstract:
Evaluating image quality without reference images, known as blind image quality assessment (BIQA), is crucial for image communication. Recently, convolutional neural networks (CNNs) have emerged as a prominent BIQA approach due to their feature learning power. Usually, both high-level semantic information and low-level details significantly impact perceived visual quality. However, most existing CNN-based methods focus on high-level semantic information via aggregating features on top of the last convolutional layer into a global descriptor, neglecting the importance of shallow, low-level cues. To address this limitation, this paper proposes a novel approach that exploits local encoding and histogram-based pyramid pooling on cross-layer features produced by a CNN, achieving a joint local and global analysis. Specifically, we introduce a cross-layer pattern encoding model that characterizes features generated along convolutional layers via a soft histogram of local 3D binary patterns. This leads to a highly informative yet compact descriptor for score regression. By building this module into a ResNet backbone, we present an effective BIQA model demonstrating state-of-the-art performance in extensive experiments on synthetic and authentic datasets.

Abstract:
Generalized zero-shot skeleton-based action recognition (GZSSAR) is an emerging and challenging problem in the computer vision community. It requires models to recognize human actions, including some classes that are unseen during training. Previous studies typically rely solely on action labels to bridge the gap between seen and unseen action classes. However, the limited action semantic information hinders the learning of comprehensive semantic prototypes, thereby restricting the model’s ability to generalize to unseen classes. To address this issue, in addition to the original action labels, we explore four types of textual action descriptions (i.e., interpretive and motional descriptions derived from manual expert annotation and large language model) for each action class. In order to comprehensively utilize multi-view semantic information for zero-shot classification, an Attentional Multi-view Semantic Fusion (AMSF) model is proposed. It effectively integrates the multi-view semantic features and aligns visual and semantic features in a common space, subsequently realizing the recognition of unseen action classes. Furthermore, previous works typically evaluate models in settings that include specific unseen classes, which is insufficient for GZSSAR research. To thoroughly evaluate different models, we introduce two novel distinct experimental settings, termed the “easy setting” and the “hard setting”, based on the semantic similarities between action classes. Extensive experimental results on three large-scale skeleton-based action recognition benchmarks (PKUMMD, NTU-60, and NTU-120) not only validate the advantages of the proposed multiview action descriptions and the AMSF model but also demonstrate the rationality of the novel experimental settings.

Abstract:
Open-vocabulary multi-label classification (OV- MLC) aims to leverage the rich multi-modal knowledge from Vision-language pre-training (VLP) models to further improve the recognition ability for unseen (novel) classes beyond the training set in multi-label scenarios. Existing OV-MLC methods only perform predictions on single hierarchical regions, and aggregate the prediction scores of these regions through simple top-k mean pooling. This fails to unleash the potential of rich hierarchical region clues in multi-label images and does not fully exploit the discriminative information from all regions in the image, resulting in sub-optimal performance. In this work, we propose a novel OV-MLC framework to fully harness the power of multiple hierarchical region clues. Specifically, we first design a hierarchical clue gathering (HCG) module to gather different hierarchical clues, enabling more precise recognition of multiple object categories with different sizes in a multi-label image. Then, by viewing multi-label classification as single-label classification of each region within the image, we present a novel hierarchical score aggregation (HSA) approach, thereby better utilizing the predictions of each image region for each class. We also utilize a well-designed region selection strategy (RSS) to eliminate noise or background regions in an image that are irrelevant to classification, achieving higher multi-label classification accuracy. In addition, we propose a hybrid prompt learning (HPL) strategy to enhance visual-semantic consistency while preserving the generalization capability of label embeddings for unseen classes. Extensive experiments on public benchmark datasets demonstrate that our method significantly outperforms the current state-of-the-art.

Abstract:
Existing JPEG encryption approaches pose a security risk due to the difficulty in changing all block-feature values while considering format compatibility and file size expansion. To address these concerns, this paper introduces a novel JPEG image encryption scheme. First, the security of sketch information against chosen-plaintext attacks is improved by increasing the change rate of block-feature values. Second, a classification global permutation approach is designed to encrypt the undivided run/size, value (RSV)-based AC groups to achieve larger changes in the block-feature values. Third, to reduce file size expansion while maintaining format compatibility, the DC coefficients are rotated based on the mapped DC differences in the same category, and the nonzero AC coefficients are mapped in the same category. Extensive experiments demonstrate that the proposed algorithm is superior to existing schemes in terms of security. Notably, the average change rate of block-feature values is increased by at least 20%. Furthermore, the proposed scheme reduces the file size by an average of 2.036% compared to existing JPEG image encryption methods.

Abstract:
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in some egocentric tasks, Zero-Shot Egocentric Action Recognition (ZS-EAR), entailing VLMs zero-shot to recognize actions from first-person videos enriched in more realistic human-environment interactions. Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this work, we introduce a straightforward yet remarkably potent VLM framework, aka GPT4Ego, designed to enhance the fine-grained alignment of concept and description between vision and language. Specifically, we first propose a new Ego-oriented Text Prompting (EgoTP\spadesuit) scheme, which effectively prompts action-related text-contextual semantics by evolving word-level class names to sentence-level contextual descriptions by ChatGPT with well-designed chain-of-thought textual prompts. Moreover, we design a new Ego-oriented Visual Parsing (EgoVP\clubsuit) strategy that learns action-related vision-contextual semantics by refining global-level images to part-level contextual concepts with the help of SAM. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%\uparrow_\bm +9.4), EGTEA (39.6%\uparrow_\bm +5.5), and CharadesEgo (31.5%\uparrow_\bm +2.6). In addition, benefiting from the novel mechanism of fine-grained concept and description alignment, GPT4Ego can sustainably evolve with the advancement of ever-growing pre-trained foundational models. We hope this work can encourage the egocentric community to build more investigation into pre-trained vision-language models.

Abstract:
Essentially, directly introducing any object detection network to perform copy-move forgery detection (CMFD) inevitably leads to low detection accuracy. Therefore, DCM-Net, an object detection network dominated by diffusion model that incorporates the characteristics of copy-move forgery, is proposed in this paper for obviously enhancing CMFD performance. DCM-Net, as the first diffusion model-based CMFD network, has the following three improvements. Firstly, the high-similarity box padding strategy pads high-similarity boxes, rather than random boxes used in diffusion model, to ground truth boxes, better guiding subsequent dual-attention detection heads (DDHs) to focus more on high-similarity regions. Secondly, different from previous deep learning based CMFD networks that utilize self-correlation calculation to indiscriminately transform all classification features extracted from feature extraction into high-similarly features, an adaptive feature combination strategy is proposed to obtain the optimal feature transformation capable of achieving the best detection performance, enabling DDHs to more effectively distinguish source and target regions. Finally, to make detection heads have more accurate source/target localization and distinguishment, DDHs equipped with efficient multi-scale attention and contextual transformer, are proposed to generate tampered features fusing the entire precise spatial position information and rich contextual global information. The experimental results carried out on three publicly available datasets including USC-ISI, CoMoFoD, and COVERAGE, demonstrate that DCM-Net outperforms several advanced algorithms in terms of similarity detection ability and source/target differentiation ability.

Abstract:
Data is the essential fuel for deep neural networks (DNNs), and its quality affects the practical performance of DNNs. In real-world training scenarios, the successful generalization performance of DNNs is severely challenged by noisy samples with incorrect labels. To combat noisy samples in image classification, numerous methods based on sample selection and semi-supervised learning (SSL) have been developed, where sample selection is used to provide the supervision signal for SSL, achieving great success in resisting noisy samples. Due to the necessary warm-up training on noisy datasets and the basic sample selection mechanism, DNNs are still confronted with the challenge of memorizing noisy samples. However, existing methods do not address the memorization of noisy samples by DNNs explicitly, which hinders the generalization performance of DNNs. To alleviate this issue, we present a new approach to combat noisy samples. First, we propose a memorized noise detection method to detect noisy samples that DNNs have already memorized during the training process. Next, we design a noise-excluded sample selection method and a noise-alleviated MixMatch to alleviate the memorization of DNNs to noisy samples. Finally, we integrate our approach with the established method DivideMix, proposing Modified-DivideMix. The experimental results on CIFAR-10, CIFAR-100, and Clothing1M demonstrate the effectiveness of our approach.

Abstract:
Humans have long fantasized about self-driving vehicles for the sake of luxury, style, safety, and ease. Free road space detection for collision avoidance and path planning is a vital part of autonomous driving vehicles. Despite many researchers focusing on free road space detection, it remains an open and challenging problem for real-world applications. Many studies have attempted to fuse depth and LiDAR features with visual features to improve the overall performance of free road space detection. However, there is no guideline on how such features should be fused to complement the visual features. Additionally, most of the previously proposed methods are computationally expensive and not suitable for real-life applications. The main motivation of this study is to realize a lightweight model that addresses these problems without compromising performance. As the LiDAR and visual features exist in different spaces, the proposed method attempts to learn various transformation and fusion operations from LiDAR features to complement the visual features. To validate the performance of the proposed method, we conduct comprehensive experiments on prominent benchmark datasets. The results of the experiments reveal the superior performance of the proposed model while being lightweight. LRDNet ranks third overall (with a minor difference) and second among LiDAR-based methods on the KITTI road benchmark dataset. Furthermore, the proposed model is the least computationally expensive among state-of-the-art methods and can be considered an optimal trade-off between speed and accuracy.

Abstract:
We present a novel correspondence-learning model for real-time registration of partially overlapping point clouds, between which the relative translation is large and hence identifying the correspondences is challenging. Our goal is to improve the feature learning for accurate correspondence establishment, which enables the promotion of registration performance significantly in terms of efficiency. This is realized by two particular designs. The first is a graph-based feature extraction module, which aggregates both inter and intra contexts of the input point clouds simultaneously for strengthening the connection between inputs. The second is a feature refinement module, which uses sparse convolutions to further widen the feature differences of dissimilar structures. The two modules reinforce each other to improve correspondence learning for robust and fast point cloud registration with low overlap. We evaluate the method on both synthetic and real-world large-scale datasets. The results in real registration tasks show that our method attains competitive registration accuracy with state-of-the-art methods, and is almost two times faster than these competing methods in some scenarios.

Abstract:
Although recent Siamese network-based trackers have achieved impressive perceptual accuracy for single object tracking in LiDAR point clouds, they usually utilized heavy correlation operations to capture category-level characteristics only, and overlook the inherent merit of arbitrariness in contrast to multiple object tracking. In this work, we propose a radically novel one-stream network with the strength of the instance-level encoding, which avoids the correlation operations occurring in previous Siamese network, thus considerably reducing the computational effort. In particular, the proposed method mainly consists of a Template-aware Transformer Module (TTM) and a Multi-scale Feature Aggregation (MFA) module capable of fusing spatial and semantic information. The TTM stitches the specified template and the search region together and leverages an attention mechanism to establish the information flow, breaking the previous pattern of independent extraction-and-correlation. As a result, this module makes it possible to directly generate template-aware features that are suitable for the arbitrary and continuously changing nature of the target, enabling the model to deal with unseen categories. In addition, the MFA is proposed to make spatial and semantic information complementary to each other, which is characterized by reverse directional feature propagation that aggregates information from shallow to deep layers. Extensive experiments on KITTI and nuScenes demonstrate that our method has achieved considerable performance not only for class-specific tracking but also for class-agnostic tracking with less computation and higher efficiency.

Abstract:
Adversarial attacks have been extensively studied in the image field. In recent years, research has shown that video recognition models are also vulnerable to adversarial examples. However, most studies about adversarial attacks for video models have focused on perturbation-based methods, while patch-based black-box attacks have received less attention. Despite the excellent performance of perturbation-based attacks, these attacks are impractical for real-world implementation. Most existing patch-based black-box attacks require occluding larger areas and performing more queries to the target model. In this paper, we propose a hard-sample style guided patch attack with reinforcement learning (RL) enhanced motion patterns for video recognition (HSPA). Specifically, we utilize the style features of video hard samples and transfer their multi-dimensional style features to images to obtain a texture patch set. Then we use reinforcement learning to locate the patch coordinates and obtain a specific adversarial motion pattern of the patch to successfully perform an effective attack on a video recognition model in both the spatial and temporal dimensions. Our experiments on three widely-used video action recognition models (C3D, LRCN, and TDN) and two mainstream datasets (UCF-101 and HMDB-51) demonstrate the superior performance of our method compared to other state-of-the-art approaches.

Abstract:
Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.

Abstract:
The reuse of 3D CAD models is crucial for industrial manufacturing because it shortens development cycles and reduces costs. Significant progress has been made in deep learning-based 3D model retrievals. There are many representations for 3D models, among which the multi-view representation has demonstrated a superior retrieval performance. However, directly applying these 3D model retrieval approaches to 3D CAD model retrievals may result in issues such as the loss of the engineering semantic and structural information. In this paper, we find that multiple views and B-rep can complement each other. Therefore, we propose the view graph neural network (VGNet), which effectively combines multiple views and B-rep to accomplish 3D CAD model retrieval. More specifically, based on the characteristics of the regular shape of 3D CAD models, and the richness of the attribute information in the B-rep attribute graph, we separately design two feature extraction networks for each modality. Moreover, to explore the latent relationships between the multiple views and B-rep attribute graphs, a multi-head attention enhancement module is designed. Furthermore, the multimodal fusion module is adopted to make the joint representation of the 3D CAD models more discriminative by using a correlation loss function. Experiments are carried out on a real manufacturing 3D CAD dataset and a public dataset to validate the effectiveness of the proposed approach.

Abstract:
In this paper, we explore the problem of Online Action Detection (OAD), where the task is to detect ongoing actions from streaming videos without access to video frames in the future. Existing methods achieve good detection performance by capturing long-range temporal structures. However, a major challenge of this task is to detect actions at a specific time that arrive with insufficient observations. In this work, we utilize the additional future frames available at the training phase and propose a novel Knowledge Distillation (KD) framework for OAD, where a teacher network looks at more frames from the future and the student network distills the knowledge from the teacher for detecting ongoing actions from the observation up to the current frames. Usually, the conventional KD regards a high-level teacher network (i.e., the network after the last training iteration) to guide the student network throughout all training iterations, which may result in poor distillation due to the large knowledge gap between the high-level teacher and the student network at early training iterations. To remedy this, we propose a novel progressive knowledge distillation from different levels of teachers (PKD-DLT) for OAD, where in addition to a high-level teacher, we also generate several low- and middle-level teachers, and progressively transfer the knowledge (in the order of low- to high-level) to the student network throughout training iterations, for effective distillation. Evaluated on two challenging datasets THUMOS14 and TVSeries, we validate that our PKD-DLT is an effective teacher-student learning paradigm, which can be a plug-in to improve the performance of the existing OAD models and achieve a state-of-the-art.

Abstract:
In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6 M and MPI-INF-3DHP datasets, our method's performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.

Abstract:
Enabled by hierarchical convolutions and nonlinear mappings, recent action recognition studies have continuously boosted performance with spatiotemporal modelling. In general, motion clues are essential in video-oriented tasks, while existing approaches aggregate the spatial and temporal signatures via specially designed modules in the middle or output stages. To highlight the privilege provided by temporal motions, in this paper, we propose a simple but effective MOTion Estimator (MOTE) to generate the motion patterns from every single frame, avoiding complex dense-frame input. In particular, MOTE follows an encoder-decoder structure, which takes the short-term motion features generated by the pretrained dense-frame network as the learning target. The spatial information of a single frame is utilized to estimate the instantaneous motion appearance. It can support the expression of vulnerable regions, such as the ‘hand' in ’waving hands,’ which would otherwise be suppressed in the feature maps as the ’hand' suffers from motion blur. The training process of MOTE is independent of the action recognition system. Therefore, the trained MOTE can be transplanted to the input-end of existing action recognition methods to provide instantaneous motion estimation as feature enhancement according to practical requirements. Our experiments performed on Something-Something V1, V2, Kinetics-400, and Diving48 verify the effectiveness of the proposed method.

Abstract:
Fine-grained bird image classification (FBIC) is not only meaningful for endangered bird observation and protection but also a prevalent task for image classification in multimedia processing and computer vision. However, FBIC suffers from several challenges, such as bird molting, complex background, and arbitrary bird posture. To effectively tackle these challenges, we present a novel invariant cues-aware feature concentration Transformer (TransIFC), which learns invariant and core information in bird images. To this end, two novel modules are proposed to leverage the characteristics of bird images, namely, the hierarchy stage feature aggregation (HSFA) module and the feature in feature abstraction (FFA) module. The HSFA module aggregates the multiscale information of bird images by concatenating multilayer features. The FFA module extracts the invariant cues of birds through feature selection based on discrimination scores. Transformer is employed as the backbone to reveal the long-dependent semantic relationships in bird images. Moreover, abundant visualizations are provided to prove the interpretability of the HSFA and FFA modules in TransIFC. Comprehensive experiments demonstrate that TransIFC can achieve state-of-the-art performance on the CUB-200-2011 dataset (91.0%) and the NABirds dataset (90.9%). Finally, extended experiments have been conducted on the Stanford Cars dataset to suggest the potential of generalizing our method on other fine-grained visual classification tasks.

Abstract:
The human skeleton establishment aims to provide accurate localization information of the human body from RGB images and establish a complete human skeleton for many applications, such as action recognition, video surveillance, and human-computer interaction. Considering the inherent human body structure, many recent methods group the relevant body parts and utilize the deep convolutional network to learn the visual context from the part groups. However, the grouping approaches used in these methods heavily rely on prior knowledge of the human body shape but lose important relationships between parts. In this paper, we introduce the Accurate Part Grouping Network (Accurate-PGNet), a novel network for hierarchically grouping body parts in a data-driven manner. In contrast to the previous methods, we use neural architecture search (NAS) to optimize the architecture of Accurate-PGNet and properly group the body parts. The part grouping respects the diverse visual patterns of parts, producing groups containing different body parts. From each group, we learn the visual feature map. It helps to capture the correlation between parts and predict their locations. The feature maps of the part groups are merged hierarchically to capture the higher-order context of parts in larger groups. We extensively evaluated our method on the challenging benchmarks, demonstrating that Accurate-PGNet effectively helps to achieve state-of-the-art results.

Abstract:
When transmission medium and compression degradation are intertwined, new challenges emerge. This study addresses the problem of raindrop removal from compressed images, where raindrops obscure large areas of the background and compression leads to the loss of high-frequency (HF) information. The restoration of the former requires global contextual information, while the latter necessitates guidance for high-frequency details, resulting in a conflict in utilizing these two types of information when designing existing methods. To address this issue, we propose a novel transformer architecture that leverages the advantages of attention mechanism and HF-friendly design to effectively restore the compressed raindrop images at the framework, component, and module levels. Specifically, at the framework level, we integrate relative position multi-head self-attention and convolutional layers into the proposed low-high-frequency transformer (LHFT), where the former captures global contextual information and the latter focuses on high-frequency information. Their combination effectively resolves the issue of mixed degradation. At the component level, we utilize high-frequency depth-wise convolution (HFDC) with zero-mean kernels to improve the capability to extract high-frequency features, drawing inspiration from typical high-frequency filters like Prewitt and Sobel operators. Finally, at the module level, we introduce a low-high-attention module (LHAM) to adaptively allocate the importance of low and high frequencies along channels for effective fusion. We establish the JPEG-compressed raindrop image dataset and conduct extensive experiments on different compression rates. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods without increasing computational costs.

Abstract:
Automated object detection in aerial images is crucial in both civil and military applications. Existing computer vision-based object detection methods are not robust enough to precisely detect dim objects in aerial images due to the cluttered backgrounds, various observing angles, small object scales, and severe occlusions. Recently, electroencephalography (EEG)-based object detection methods have received increasing attention owing to the advanced cognitive capabilities of human vision. However, how to combine the human intelligence with computer intelligence to achieve robust dim object detection is still an open question. In this paper, we propose a novel approach to efficiently fuse and exploit the properties of multi-modal data for dim object detection. Specifically, we first design a brain-computer interface (BCI) paradigm called eye-tracking-based slow serial visual presentation (ESSVP) to simultaneously collect the paired EEG and image data when subjects search for the dim objects in aerial images. Then, we develop an attention-based multi-modal fusion network to selectively aggregate the learned features of EEG and image modalities. Furthermore, we propose an adaptive multi-teacher knowledge distillation method to efficiently train the multi-modal dim object detector for better performance. To evaluate the effectiveness of our method, we conduct extensive experiments on the collected dataset in subject-dependent and subject-independent tasks. The experimental results demonstrate that the proposed dim object detection method exhibits superior effectiveness and robustness compared to the baselines and the state-of-the-art methods.

Abstract:
Underwater imagery often suffers from light attenuation and color distortion, resulting in images with low contrast and blurriness. Enhancing these images is crucial yet challenging due to the complex degradation and noise inherent in underwater environments. In this study, we introduce a novel diffusion model, termed Underwater Image Enhancement(UIE) Diffusion, which leverages a global feature prior for effective underwater image enhancement. To our knowledge, this is the inaugural application of a diffusion model to the task of underwater image enhancement, setting a new benchmark in performance. Our approach begins with the introduction of a global feature prior to augment the diffusion model, mitigating the impact of noise and distortion during training. We then incorporate an underwater image degradation model to facilitate the learning of mappings between high-quality and degraded underwater images. To address over-enhancement caused by high-frequency components, we employ scaling factors to modulate the influence of frequency features during diffusion. Additionally, we enhance the model's stability during inference by integrating a backward diffusion process into its training. Comprehensive evaluations on multiple public datasets demonstrate that UIE Diffusion surpasses existing state-of-the-art methods in both subjective outcomes and objective assessments.

Abstract:
Despite the significant progress in image denoising, it is still challenging to restore fine-scale details while removing noise, especially in extremely low-light environments. Leveraging near-infrared (NIR) images to assist visible RGB image denoising shows the potential to address this issue, becoming a promising technology. Nonetheless, existing works still struggle with taking advantage of NIR information effectively for real-world image denoising, due to the content inconsistency between NIR-RGB images and the scarcity of real-world paired datasets. To alleviate the problem, we propose an efficient Selective Fusion Module (SFM), which can be plug-and-played into the advanced denoising networks to merge the deep NIR-RGB features. Specifically, we sequentially perform the global and local modulation for NIR and RGB features, and then integrate the two modulated features. Furthermore, we present a Real-world NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse scenarios as well as various noise levels. Extensive experiments on both synthetic and our real-world datasets demonstrate that the proposed method achieves better results than state-of-the-art ones.

Abstract:
Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle the attribute binding problem of multi-subject generation. Additionally, we present a disentangled motion customization strategy to finetune the temporal modules so that we can generate videos with both customized subjects and motions. To evaluate the performance of customized multi-subject text-to-video generation, we introduce the MultiStudioBench benchmark. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects.

Abstract:
Referring Multi-Object Tracking (RMOT) aims to dynamically track an arbitrary number of referred targets in a video sequence according to the language expression. Previous methods mainly focus on cross-modal fusion at the feature level with designed structures. However, the insufficient visual-linguistic alignment is prone to causing visual-linguistic mismatches, leading to some targets being tracked but not correctly referred especially when facing the language expression with complex semantics or motion descriptions. To this end, we propose to conduct visual-linguistic alignment with semantic and kinematic guidance to effectively align the visual features with more diverse language expressions. In this paper, we put forward a novel end-to-end RMOT framework SKTrack, which follows the transformer-based architecture with a Language-Guided Decoder (LGD) and a Motion-Aware Aggregator (MAA). In particular, the LGD performs deep semantic interaction layer-by-layer in a single frame to enhance the alignment ability of the model, while the MAA conducts temporal feature fusion and alignment across multiple frames to enable the alignment between visual targets and language expression with motion descriptions. Extensive experiments on the Refer-KITTI and Refer-KITTI-v2 demonstrate that SKTrack achieves state-of-the-art performance and verify the effectiveness of our framework and its components.

Abstract:
Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.

Abstract:
Being able to estimate monocular depth for spherical panoramas is of fundamental importance in 3D scene perception. However, spherical distortion severely limits the effectiveness of vanilla convolutions. To push the envelope of accuracy, recent approaches attempt to utilize Tangent projection (TP) to estimate the depth of 360 ^\circ images. Yet, these methods still suffer from discrepancies and inconsistencies among patch-wise tangent images, as well as the lack of accurate ground truth depth maps under a supervised fashion. In this paper, we propose a geometry-aware self-supervised 360 ^\circ image depth estimation methodology that explores the complementary advantages of TP and Equirectangular projection (ERP) by an asymmetric dual-domain collaborative learning strategy. Especially, we first develop a lightweight asymmetric dual-domain depth estimation network, which enables to aggregate depth-related features from a single TP domain, and then produce depth distributions of the TP and ERP domains via collaborative learning. This effectively mitigates stitching artifacts and preserves fine details in depth inference without overspending model parameters. In addition, a frequent-spatial feature concentration module is devised to simultaneously capture non-local Fourier features and local spatial features, such that facilitating the efficient exploration of monocular depth cues. Moreover, we introduce a geometric structural alignment module to further improve geometric structural consistency among tangent images. Extensive experiments illustrate that our designed approach outperforms existing self-supervised 360 ^\circ depth estimation methods on three publicly available benchmark datasets.

Affiliations: Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Laboratory of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China; JD Explore Academy, Beijing, China; School of Computer Science and Technology, Anhui University, Heifei, China; FIST LAB, School of Information Science and Engineering, Yunnan University, Kunming, China

Abstract:
Recently, Mix-style data augmentation methods (e.g., Mixup and CutMix) have shown promising performance in various visual tasks. However, these methods are primarily designed for single-label images, ignoring the considerable discrepancies between single- and multi-label images, i.e., a multi-label image involves multiple co-occurred categories and fickle object scales. On the other hand, previous multi-label image classification (MLIC) methods tend to design elaborate models, bringing expensive computation. In this article, we introduce a simple but effective augmentation strategy for multi-label image classification, namely SpliceMix. The “splice” in our method is two-fold: 1) Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; 2) We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together. Furthermore, such splice in our SpliceMixed mini-batch enables interactions between mixed images and original regular images. We also provide a simple and non-parametric extension based on consistency learning (SpliceMix-CL) to show the potential of extending our SpliceMix. Extensive experiments on various tasks demonstrate that only using SpliceMix with a baseline model (e.g., ResNet) achieves better performance than state-of-the-art methods. Moreover, the generalizability of our SpliceMix is further validated by the improvements in current MLIC methods when married with our SpliceMix.

Abstract:
Recently, an increasing number of researchers have been dedicated to transferring the impressive novel view synthesis capability of Neural Radiance Fields (NeRF) to resource-constrained mobile devices. One common solution is to pre-train NeRF and bake it into textured meshes which are well supported by mobile graphics hardware. However, the training process of existing methods often requires several hours even with multiple high-end NVIDIA V100 GPUs. The underlying reason is that these schemes mainly rely on photometric rendering loss, neglecting the geometric relationship between the pre-trained NeRF and the baked results. Standing on this point, we present ATM-NeRF (Accelerating Training for Mobile rendering based on NeRF), which is the first to apply effective geometric regularization constraints during both the pre-training and the baking training stages for faster convergence. Specifically, in the initial NeRF pre-training stage, we enforce consistency of the multi-resolution density grids representing the scene geometry to mitigate the shape-radiance ambiguity problem to some extent, achieving a coarse mesh with smoothness. In the second stage, we utilize the positions and geometric features of 3D points projected from the pre-trained posed depths to provide geometric supervision for joint refinement of geometry and appearance of the coarse mesh. As a result, our ATM-NeRF achieves comparable rendering quality to MobileNeRF with a training speed that is about 30× ～ 70× faster while maintaining finer structure details of the exported mesh.

Abstract:
Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168 k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple Glass surface sEgMentor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings.

Abstract:
With the prevalence of advanced displays devices, many attempts have been successfully made in bit-depth enhancement (BDE) to restore the low bit-depth (LBD) images to visually pleasant high bit-depth (HBD) images. However, most methods are still far from satisfactory when addressing real-world LBD images owing to their heavy dependence on LBD-HBD data pairs through direct pixel quantization. Therefore, in this paper, we propose a novel network dubbed RealGAN to generate real-world LBD images by simulating the complex quantization procedure in camera imaging process. Particularly, we design a two-mode differentiable quantization block embedded in the synthesis network facilitating adaptively simulation of the complicated quantization distortions. Furthermore, a simple residual group network is proposed in order to learn the distribution of degradation and non-linear processing in the Image Signal Processing (ISP) pipeline. In the absence of paired HBD and LBD data, the synthesis model is trained end-to-end within the generative adversarial framework using non-paired LBD and HBD images. Finally, we demonstrate that a series of BDE models can benefit from the proposed synthetic dataset and exhibit improved visual quality with sharper edges and finer textures on real-world scenes compared with the original versions trained on directly quantized LBD-HBD pairs.

Abstract:
The Facial Expression Recognition (FER) technique has increasingly matured over time. However, recognizing facial expressions in wild environments poses great challenges in achieving promising performance. The main obstacles arise from various factors, such as illumination changes, head pose variations, and occlusions. To overcome interferences from external environments and improve recognition accuracy, we propose a novel Quaternion Wavelet TRansformer (QWTR) model for FER in the wild. Specifically, we present a Quaternion Value Transformer (QVT) network that combines quaternion multi-head attention with quaternion CNN to capture emotional cues from global and local perception. To preserve the color structure while enhancing image contrast and brightness, we introduce a Quaternion Histogram Equalization (QHE) representation to transform color images into quaternion matrices representation. After that, to alleviate the impact of head pose and occlusion together with feature redundancy, a Quaternion Wavelet Feature Selection (QWFS) scheme is designed to decompose quaternion features and select the most correlated signals. Extensive experiments have been conducted on four in-the-wild FER datasets and several specific FER benchmarks under various conditions. The qualitative and quantitative results demonstrate that QWTR outperforms other state-of-the-art methods in FER benchmarks, e.g., 68.37% vs. 66.31% accuracy on the AffectNet dataset.

Abstract:
Visual grounding has attracted wide attention thanks to its broad application in various visual language tasks. Although visual grounding has made significant research progress, existing methods ignore the promotion effect of the association between text and image features at different hierarchies on cross-modal matching. This paper proposes a Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction Visual Grounding method. It first generates a mask through decoupled sentence phrases, and a text and image hierarchical matching mechanism is constructed, highlighting the role of association between different hierarchies in cross-modal matching. In addition, a corresponding target object position progressive correction strategy is defined based on the hierarchical matching mechanism to achieve accurate positioning for the target object described in the text. This method can continuously optimize and adjust the bounding box position of the target object as the certainty of the text description of the target object improves. This design explores the association between features at different hierarchies and highlights the role of features related to the target object and its position in target positioning. The proposed method is validated on different datasets through experiments, and its superiority is verified by the performance comparison with the state-of-the-art methods.

Affiliations: Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Shandong Provincial Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan, China; School of Computer Science, Nanjing University of Information Science and Technology, Nanjing, China; School of Computer Science and Engineering, Southeast University, China

Abstract:
Reinforcement learning (RL) aims to formulate the recommendation task as a Markov decision process (MDP) and trains an agent to automatically learn the optimal recommendation policy from interaction trajectories through trial-and-error and reward mechanisms. However, most existing RL-based approaches overlook the correlation between items and the dynamics of user interests implied in temporally close interactions. Therefore, in this paper, we propose a reinforcement learning method that incorporates a “recent-k items” distribution to capture users' local preferences. Specifically, we model the output layer as two distinct branches. The “recent-k items” branch, formulated with a Kullback-Leibler divergence loss, learns the recent interests of users, whereas the other branch utilizes a one-step temporal difference error to capture long-term preferences. The proposed structure is integrated into deep Q-learning and actor-critics, resulting in two enhanced methods named RkQ and RkAC, respectively. Furthermore, a novel soft inter-reward is carefully designed to enhance the proposed method, and we theoretically prove the convergence of the proposed algorithm. We perform extensive experiments on two large real-world datasets and conduct further analysis of the influences of different action sequences, time intervals, and enhancement capabilities for state-of-the-art models. The experimental results demonstrate the efficacy of our proposed methods.

Abstract:
We address the video prediction task by putting forth a novel model that combines (i) a novel hierarchical residual learning vector quantized variational autoencoder (HR-VQVAE), and (ii) a novel autoregressive spatiotemporal predictive model (AST-PM). We refer to this approach as a sequential hierarchical residual learning vector quantized variational autoencoder (S-HR-VQVAE). By leveraging the intrinsic capabilities of HR-VQVAE at modeling still images with a parsimonious representation, combined with the AST-PM's ability to handle spatiotemporal information, S-HR-VQVAE can better deal with major challenges in video prediction. These include learning spatiotemporal information, handling high dimensional data, combating blurry prediction, and implicit modeling of physical characteristics. Extensive experimental results on four challenging tasks, namely KTH Human Action, TrafficBJ, Human3.6 M, and Kitti, demonstrate that our model compares favorably against state-of-the-art video prediction techniques both in quantitative and qualitative evaluations despite a much smaller model size. Finally, we boost S-HR-VQVAE by proposing a novel training method to jointly estimate the HR-VQVAE and AST-PM parameters.

Abstract:
Pose-guided person image generation aims to synthesize images of human in various poses, often encountering issues such as occlusions and texture transfers. Previous methods have utilized attention mechanisms, flow field, normalization techniques, and diffusion model. Among them, flow field and attention are the two most commonly used methods. Flow fields are good at preserving detailed textures, while attention is better at generating reasonable semantic structures. Previous networks often used only one of the two and failed to make full use of their advantages. At the same time, the flow field and attention also showed complementary functions in the frequency domain. The flow field was good at preserving the high frequency information of the image with the detailed texture, while the semantic structure of attention was good at generating the image with the low frequency information, and few networks used this to improve the generation effect. Based on these facts, this paper introduces the AplusN network, which innovatively addresses the image generation problem by processing from low to high frequencies. For low-frequency information, a conditional large-kernel convolutional attention mechanism (CLA) is employed to capture the global information of the human body. High-frequency information is refined using a spatial-channel normalization module (SCN) to enhance the body's detailed textures. Additionally, we propose a wavelet loss function to align the frequency domain information of the generated images with the target images. Both qualitative and quantitative experiments demonstrate the superiority of our method over state-of-the-art (SOTA) methods, yielding better-defined overall body contours, local details, and higher-quality image generation.

Abstract:
Multi-modal Emotion Recognition (MER) has demonstrated competitive performance in affective computing, owing to synthesizing information from diverse modalities. However, many existing approaches still face unresolved challenges, such as: (i) how to learn compact yet representative features from multi-modal data simultaneously and (ii) how to address differences among subjects and enhance the generalization of the emotion recognition model, given the diverse nature of individual biological signals. To this end, we propose a Dynamic Interactive Network with Self-Distillation (DISD-Net) for cross-subject MER. The DISD-Net incorporates a dynamin interactive module to capture the intra- and inter-modal interactions from multi-modal data. Additionally, to enhance compactness in modal representations, we leverage the soft labels generated by the DISD-Net model as supplemental training guidance. This involves incorporating self-distillation, aiming to transfer the knowledge that the DISD-Net model contains hard and soft labels to each modality. Finally, domain adaptation (DA) is seamlessly integrated into the dynamic interactive and self-distillation components, forming a unified framework to extract subject-invariant multi-modal emotional features. Experimental results indicate that the proposed model achieves a mean accuracy of 75.00% with a standard deviation of 7.68% for the DEAP dataset and a mean accuracy of 65.65% with a standard deviation of 5.08% for the SEED-IV dataset.

Abstract:
Chroma intra prediction aims to reduce chroma redundancies within a frame, which plays an important role in improving the coding efficiency of intra coding. Existing chroma intra prediction methods typically utilize the spatial relationship between the current luma block and its neighboring reference luma blocks to predict its chroma samples. However, the spatial properties of luma components differ from those of chroma components, which limits the accuracy of chroma intra prediction. To tackle this issue, an efficient Exemplar Colorization Network (ECNet)-based chroma intra prediction method is proposed in this paper, in which the colorization relationship between reference luma and chroma components is exploited to predict the chroma components for the current luma component. Inspired by the principle that semantic information in an image exhibits short-range continuity, a Spatial-consistency-based Colorization Transfer Network (SCTNet) is proposed, which builds and transfers colorization representations of neighboring reference blocks for chroma prediction. To improve the chroma prediction capability of SCTNet, a colorization learning module is developed to learn the robust mapping relationship from the luma component to the chroma component in a region-to-pixel manner, and a weight-adaptive reconstruction module is designed to adaptively utilize reference information from neighboring blocks to generate an initial prediction result. In addition, to further improve the accuracy of chroma intra prediction, a multi-reference-based chroma refinement network is proposed, which simultaneously uses the spatial information of neighboring reference chroma blocks and the current luma block to eliminate blocking and color-bleeding artifacts in the initial prediction result. Experimental results demonstrate that our proposed ECNet outperforms the state-of-the-art chroma intra prediction methods in terms of coding performance.

Abstract:
Emotion-Cause Pair Extraction (ECPE) in conversations aims to identify the emotional utterances (even their categories) along with their corresponding causal utterances, which is crucial in understanding the cause-effect relationship in dialogues. While prior studies of ECPE have predominantly focused on purely textual dialogues and neglected the exploration on the natural scenario of the dialogues with multimodal features, i.e., Multimodal Emotion-Cause Pair Extraction (MECPE) in conversations. To attempt this scenario, we propose a Generative approach for Multimodal Emotion-Cause pair extraction (GMEC) with a single stage, thus effectively reducing errors associated with the propagation and accumulation for MECPE. This approach can not only uniformly handle the information of diverse modalities, but also address all emotion and cause analysis tasks uniformly. Additionally, instead of utilizing the fixed commonsense knowledge base as previously, we resort to the Large Language Models (LLMs), which possess a powerful ability to emerge new knowledge, thereby acting as implicit knowledge engines for MECPE. We refer to this approach as enhanced GMEC. Extensive experimental results and detailed analysis demonstrate a notable improvement in the generative approach. Moreover, the integration of external knowledge from LLMs optimizes the efficiency of data utilization, particularly in few-shot scenarios. The integration of the generative model with LLMs has resulted in a cumulative enhancement of 4.94%, 10.90% on MECPE and MECPE-C (with emotion Category).

Abstract:
Driven by rising demands in autonomous driving, robotics, etc., 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.

Abstract:
Video question answering (Video-QA) has emerged as a core task in the vision-language domain, which requires the models to understand a given video and answer textual questions related to the video. Compared to conventional image-language tasks, Video-QA is designed for improving the models' capacity of memorizing and integrating multi-frame temporal cues associated with the questions. While significant performance improvements have recently been witnessed on public benchmarks, in this work, we rethink whether these improvements truly stem from better understanding of video temporal context as expected. To this end, we accomplish a strong single-frame baseline model trained with knowledge distillation. With this model, we surprisingly find that visiting only one single frame, without incorporating multi-frame and temporal information, is sufficient to achieve state-of-the-art (SOTA) performance on multiple mainstream benchmarks. This finding reveals the prevalence of single-frame bias in current benchmarks for the first time. Around the single-frame bias, we conduct an in-depth analysis on multiple popular benchmarks, which demonstrate that: (i) merely relying on one frame is able to achieve comparable performance with SOTA temporal Video-QA models; (ii) simply ensembling the prediction scores of only 3 separate frames is able to surpass temporal SOTAs. Furthermore, we observe that most of the benchmarks are biased towards central segments, and even the latest benchmarks tailored for temporal reasoning still suffer from severe single-frame bias. In case study, we find two key properties of low-bias instances: the question emphasizes temporal dependency and contextual understanding, and the associated video content presents significant variability in scenes, actions or interactions. Through further analysis on compositional reasoning datasets, we find that constructing explicit object/event interactions upon videos to fill in well-designed temporal question templates can effectively reduce the single-frame bias during annotation. We hope our analysis helps facilitate future efforts in the field towards mitigating static bias and highlighting temporal reasoning.

Abstract:
Image dehazing is essential to boost the visual quality of images captured in hazy conditions. Recently, many learning-based methods were proposed to achieve single image dehazing with the training of tremendous paired synthetic hazy/ real clean images. Due to the domain gap between real and synthetic scenes, these models cannot generalize well to various real hazy scenes, leading to under-dehazed results. To overcome this problem, we propose a real scene image Dehazing Network with Multi-prior Guidance and Domain Transfer (DNMGDT). Our DNMGDT is based on a parameter shared architecture trained by synthetic hazy images and real hazy images simultaneously. For real hazy images, multiple prior-based dehazed images are adopted as pseudo clean images. An Image Quality Guided Adaptive Weighting (IQGAW) scheme is proposed to form the supervision by automatically weighting different parts of these prior-based dehazed images and suppressing negative information of them. Moreover, to reduce the domain gap between real and synthetic hazy scenes, a Physical Model Guided image level Domain Transfer (PMGDT) mechanism is proposed to regularize the learning process with consistency constraint. Experiments on various datasets demonstrated the effectiveness of our proposed method especially for real hazy scenes.

Abstract:
Learning-based Image Quality Assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where Mean Opinion Score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the Labor-Abundant MOS (LA-MOS) typically requires large collections of opinion scores from multiple annotators for each image, which significantly increases the learning cost. In this paper, we aim to learn robust IQA models from Low-Cost MOS (LC-MOS), which only requires very few opinion scores or even a single opinion score for each image. More specifically, we consider the LC-MOS as the noisy observation of LA-MOS and enforce the IQA model learned from LC-MOS to approach the unbiased estimation of LA-MOS. Thus, we represent the subjective bias between LC-MOS and LA-MOS, and the model bias between IQA predictions learned from LC-MOS and LA-MOS (i.e., dual-bias) as two latent variables with unknown parameters. By means of the expectation-maximization-based alternating optimization, we can jointly estimate the parameters of the dual-bias, which suppresses the misleading of LC-MOS via a gated dual-bias calibration (GDBC) module. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels. Theoretical analysis and extensive experiments on four popular IQA datasets show that the proposed method is robust toward different bias rates and annotation numbers and significantly outperforms the other Learning-based IQA models when only LC-MOS is available. Furthermore, we also achieve comparable performance with respect to the other models learned with LA-MOS.

Abstract:
Traditional fine-grained classification focuses on visible light domains, such as animals and cars. However, these methods often perform poorly when applied to radar images and images of satellites because of challenges such as distinguishing between noise and objects and the significant scale differences among object components. To address these unique scenarios, we propose the scale-shift attention in polarization domain (SAPD) method for fine-grained classification in satellite ISAR images. Specifically, radar emits different types of waves, each with distinct imaging effects. We utilize multipolarization inputs and introduce a polarization domain query module to integrate complementary features from various radar wave types captured from the same viewpoint. This multipolarization learning helps distinguish noise and leverages complementary features from different inputs. Moreover, to handle the substantial scale differences between centimeter-level payloads and the overall meter-level structure of satellites, we propose a scale-shift attention mechanism based on shift kernels. This mechanism extends attention in the direction specified by the shift kernel by incorporating adjacent pixels, allowing for the diffusion of attention. This is beneficial for capturing features of satellite components with varying scales and shapes. Extensive experiments on a novel satellite ISAR image dataset validate the effectiveness and superiority of the SAPD.

Abstract:
Deep learning based defect inspection methods have achieved promising performance, which usually relies on a large number of well-labeled training samples. However, it requires much effort to obtain enough annotated samples especially pixel-level annotations in practical production. Generative adversarial networks (GANs) can be utilized to generate defect samples. However, training GANs typically requires a large amount of defect data and most of them cannot generate defect samples with pixel-level annotations. In this paper, we present a Mask-Guided Defect image generation method, called MGDefect, which can generate high-quality defect samples with pixel-level annotations and effectively improves the performance of downstream tasks. Specifically, MGDefect consists of a Mask-Guided Defect Generation GAN (MGDG-GAN) and a Defect Mask GAN (DM-GAN). MGDG-GAN generates images containing defects with specific locations, shapes, and sizes via mask guidance and the dual discrimination for defects at the region level and image level. DM-GAN aims to generate diverse and rational masks for MGDG-GAN. It also adopts region-level and image-level dual discrimination for masks to generate compatible masks with the target objects. MGDG-GAN mainly focuses on generating local defect regions and DM-GAN specializes in generating masks, which are both trained on limited defect samples and abundant normal samples. Experiments conducted on the MVTec AD, DAGM 2007, and KolektorSDD2 benchmark datasets demonstrate that our method achieves promising results compared with other state-of-the-art approaches. Meanwhile, the generated defect samples significantly improve the performance of defect inspection tasks including classification and segmentation. Specifically, our method achieves KID× 10^3/IS scores of 48.35/2.27 on MVTec AD, 15.37/2.44 on DAGM 2007, and 19.70/2.01 on KolektorSDD2. Furthermore, our method improves mIoU by 10.59%, 2.20%, and 2.17% on these datasets, respectively, using U-Net as the segmentation model.

Abstract:
Deep learning-based AI models typically require a large amount of high-quality annotated data to achieve optimal performance. However, the label distribution shift caused by noisy annotations can lead to perturbations in the classification boundary, reducing the robustness and generalization capabilities of deep learning models. To mitigate this issue, we transform the problem of learning from noisy labels into a semi-supervised learning problem, and propose a novel Semi-Supervised Distribution Alignment (SSDA) framework that strategically integrates noise-robust distribution alignment within a unified semi-supervised learning paradigm for combating noisy labels. By leveraging the similarity distribution between historical predictions, the proposed SSDA approach benefits from a flexible multi-historical regression modeling strategy, which aims to identify high-confidence samples/pairs and recalibrate the label shift through pseudo-labels. Furthermore, our approach employs a comprehensive multi-granularity distribution adaptation strategy, incorporating both instance-wise and class-aware distribution alignment to quantitatively minimize semantic discrepancies across different mixed feature domains. In this way, our SSDA approach ultimately achieves more resilient and generalizable performance against label noise, even in the presence of substantial noise. Extensive experiments conducted on multiple simulated and real-world noisy benchmark datasets consistently demonstrate the superiority and effectiveness of our SSDA method compared to existing state-of-the-art baselines.

Abstract:
We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room’s each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room’s flexibility in generating and editing 3D room meshes, and prove our framework’s superiority to an existing model quantitatively and qualitatively.

Abstract:
Light field saliency object detection (LF SOD) methods have made significant progress recently. Most of them explore abundant multi-modal information from the all-focus image and the focal stacks at all focal planes to enrich scene details and depth perception. However, in light-field images, the spatial and depth information varies slightly across different slices, raising redundancy within focal stacks. Besides, the noise can appear repeatedly in multiple images of the focal stacks, which brings interference. To address these issues, in this work, we propose VMKNet, an effective approach that leverages innovative variance-maximized key slice selection and interacts with the all-focus image, to improve LF SOD. Specifically, we measure consistency differences between the all-focus image and each focal slice in the salient region as saliency scores. Then, we randomly assemble sets of them, where each score corresponds to a certain slice. The one exhibiting the highest variance is singled out to determine key focal slices as they reveal the diversity of salient objects. Then, the bidirectional guidance module (BGM) is presented to learn attentive features of all-focus and selected key slices in a mutual guidance manner, thus producing enhanced and holistic features. With hierarchical BGMs, our model can progressively aggregate common salient semantics and meaningful contextual details, generating more discriminative representations. Moreover, we introduce the edge enhancement module in conjunction with BGM to improve the sharpness of saliency maps. Extensive experiments on common light field datasets demonstrate that our method, termed VMKNet, outperforms recent state-of-the-art LF, RGB-D, and RGB methods.

Abstract:
Traditional cross-domain tasks, including unsupervised domain adaptation (UDA), domain generalization (DG) and test-time adaptation (TTA), rely heavily on the training model by source domain data whether for specific or arbitrary target domains. With the recent advance of vision-language models (VLMs), recognized as natural source models that can be transferred to various downstream tasks without any parameter training, we propose a novel cross-domain task directly combining the strengths of both UDA and DG, named Training-Free Adaptive Domain Generalization (TF-ADG). However, current cross-domain datasets have many limitations, such as unrealistic domains, unclear domain definitions, and the inability to fine-grained domain decomposition, which hinder the real-world application of current cross-domain models due to the lack of accurate and fair evaluation of fine-grained realistic domains. These insights motivate us to establish a novel realistic benchmark for TF-ADG. Benefiting from the introduced hierarchical definition of domain shifts, our proposed dataset DomainVerse addresses these issues by providing about 0.5 million images from 390 realistic, hierarchical, and balanced domains, allowing for decomposition across multiple domains within each image. With the help of the constructed DomainVerse and VLMs, we further propose two algorithms called Domain CLIP and Domain++ CLIP for training-free adaptive domain generalization. Extensive and comprehensive experiments demonstrate the significance of the dataset and the effectiveness of the proposed methods.

Abstract:
Normalizing flows, a category of probabilistic models famed for their capabilities in modeling complex data distributions, have exhibited remarkable efficacy in unsupervised anomaly detection. This paper explores the potential of normalizing flows in multi-class anomaly detection, wherein the normal data is compounded with multiple classes without providing class labels. Through the integration of vector quantization (VQ), we empower the flow models to distinguish different concepts of multi-class normal data in an unsupervised manner, resulting in a novel flow-based unified method, named VQ-Flow. Specifically, our VQ-Flow leverages hierarchical vector quantization to estimate two relative codebooks: a Conceptual Prototype Codebook (CPC) for concept distinction and its concomitant Concept-Specific Pattern Codebook (CSPC) to capture concept-specific normal patterns. The flow models in VQ-Flow are conditioned on the concept-specific patterns captured in CSPC, capable of modeling specific normal patterns associated with different concepts. Moreover, CPC further enables our VQ-Flow for concept-aware distribution modeling, faithfully mimicking the intricate multi-class normal distribution through a mixed Gaussian distribution reparametrized on the conceptual prototypes. Through the introduction of vector quantization, the proposed VQ-Flow advances the state-of-the-art in multi-class anomaly detection within a unified training scheme, yielding the Det./Loc. AUROC of 99.5%/98.3% on MVTec AD.

Affiliations: School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China; Computer Science and the School of Cyber Science and Engineering, Wuhan University, Wuhan, China; Visual Analytics for Knowledge Laboratory (VISKNOW Lab), Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul, Republic of Korea; TECNALIA, Basque Research and Technology Alliance (BRTA), Derio, Spain; MOE Key Laboratory of AI, School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Computer Science and Engineering, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Unsupervised person re-identification (ReID) has recently gained significant attention from researchers. ReID matches images of the same person from different camera views in various scenes without any labels. Existing clustering methods primarily rely on a fixed threshold (the maximum distance between sample points and clustering centroids) and overlook the importance of adjusting this threshold during continuous model optimization. This mismatch between clustering thresholds and inter- or intra-class spacing reduces clustering accuracy. To address this issue, this study proposes an Adaptive Clustering and Weighted Regularization Contrastive Learning (ACWRCL) framework for unsupervised person ReID. The ACWRCL framework comprises two main components: (1) the Clustering Threshold Adaptive Adjustment (CTAA) module, and (2) the Weighted Regularization Contrastive Learning (WRCL) module. The CTAA module dynamically adjusts the clustering threshold to align with model optimization, ensuring that the threshold remains within an appropriate range to prevent under- or over-robustness in the clustering model. The WRCL module uses the similarity ratio between the query sample and the clustering centroid relative to the overall similarity of all samples with the same labels as the query sample. This ratio is used as the weight in the loss function to penalize incorrect clustering and improve pseudo-label generation accuracy. Extensive experiments on public ReID datasets—Market-1501, MSMT17, Veri776, CUHK03, and PersonX—demonstrate the effectiveness of the proposed method.

Affiliations: School of Electronics and Communication Engineering, Shenzhen Campus of Sun Yat-Sen University, Sun Yat-Sen University, Shenzhen, China; Institute for Infocomm Research (IR) & Centre for Frontier AI Research (CFAR), A*STAR, Singapore; College of Electronics and Information Engineering, Sichuan University, Chengdu, China; College of Computing and Data Science, Nanyang Technological University, Singapore; School of Electronics and Communication Engineering, the Shenzhen Campus of Sun Yat-Sen University, Sun Yat-Sen University, Shenzhen, China

Abstract:
City-scale point cloud semantic segmentation is an important yet challenging task. Despite progress, existing methods rely heavily on point-wise annotations. An alternative solution is to apply the Unsupervised Domain Adaptation (UDA) approach. Recently, the 2D foundation model has achieved significant progress with training with internet-scale images. Therefore, adapting 2D foundation models to 3D City-scale point clouds is an attempting idea. Due to the data protection and storage issue, 2D source domain data is typically unavailable. Thus, we focus on Source-Free Domain Adaptation (SFDA) and propose a Source-Free City-scale point cloud semantic segmentation method, namely SF-City. Our method leverages knowledge from 2D pre-trained models to generate point-wise pseudo labels for training a 3D semantic segmentation network. We convert point clouds into remote-sensing-like images using Bird’s-Eye-View (BEV) projection. However, directly using source models for pseudo label generation is hindered by domain gaps such as viewpoint variations, concept divergences, and geometry loss. To tackle these problems, we propose a Multi-scale Content Feature Extractor (MCFE) to extract holistic and contextual feature representations. Then, an Uncertainty-guided Inter-Model Feature Integrator (UIFI) is introduced to integrate inherent knowledge across source models. Furthermore, the Geometric-guided Pseudo Label Generator (GPLG) is leveraged to introduce geometric information to regulate pseudo labels. Through extensive experiments on two public benchmarks, SF-City demonstrates superior performance, achieving an mIoU of 28.8% on the SensatUrban dataset, outperforming recent state-of-the-art methods CLIP-FO3D by about 6.3% .

Abstract:
Automatic segmentation of 3D dental models into individual teeth is an important step in orthodontic computer-aided design (CAD) systems. However, most existing methods rely on single-view dental models and ignore the intrinsic relationships between upper and lower dental models, hindering the handling of complex tooth structures. In this paper, a collaborative learning framework with coupling graph Transformers (CGT-CLF) is proposed for automatic tooth segmentation on 3D dental models. The framework collaboratively learns geometric features of both upper and lower dental models, capturing their interactivity and complementarity by facilitating interaction between graph-Transformer encoders to improve segmentation of complex and diverse teeth. Specifically, CGT-CLF consists of three key components as follows: First, a graph embedding-based boundary perception module (GEBPM) is developed to aggregate fine-grained geometric features within the neighborhood graph domain, enhancing the network’s ability to perceive and distinguish intricate tooth boundaries. Then, coupling geometric Transformers are designed to capture the intrinsic relationships of pair-wise dental models by promoting the exchange of relevant information to gain a comprehensive understanding of the overall tooth structure, allowing for better identification of adjacent teeth with similar appearances. Finally, a collaborative cross-scale feature fusion (CCFF) strategy is utilized to obtain interactive and complementary information by modeling the inter-relationships between dual-stream features. Experimental results on a clinical dental model dataset demonstrate that the proposed CGT-CLF framework outperforms state-of-the-art methods, delivering superior segmentation performance.

Abstract:
Ultrasound video-based breast lesion segmentation provides valuable assistance in early breast lesion detection and discrimination. However, this field faces two key challenges: the first is how to simultaneously utilize both intra-frame and inter-frame lesion cues to accurately segment breast lesions, and the second is that the availability of breast ultrasound video datasets is quite limited. In this paper, we propose a novel Spatial-Temporal Progressive Fusion Network (STPFNet) for video-based breast lesion segmentation problem. The proposed STPFNet comprises three main components. First, we propose to adopt a unified network architecture to capture spatial dependencies within each ultrasound frame and temporal correlations between different frames together for feature representation of ultrasound video. Second, we propose a new fusion module called Multi-Granularity Feature Fusion (MGFF) to fuse the extracted information with different granularities for lesion segmentation. MGFF can help improve the issue of lesion boundary blurring. Third, we propose to take the segmentation result of the previous frame as prior knowledge to suppress the noisy background and learn a more robust representation. To further promote the research in this field, we construct a new ultrasound video breast lesion segmentation dataset, called UVBLS200, comprising 200 videos (80 benign and 120 malignant lesions). Experiments on the proposed dataset demonstrate that the proposed STPFNet achieves a better breast lesion detection performance than state-of-the-art methods.

Abstract:
Most existing image quality assessment methods need to be retrained when dealing with a new type of task. This approach wastes computing resources and time. Therefore, these methods fail to suit the application scenarios that require processing of multi-type image quality assessment tasks. In the human visual system, the eyes of human tend to pay varying degrees of attention to different regions. Inspired by this system, this paper proposes a multi-type image quality assessment method based on multi-region deep feature fusion under meta-learning (MMQA). First, we utilize the differences in the structural information to screen out salient and non-salient regions. Second, a deep multi-stream network is designed to comprehensively consider and fuse different features related to the quality in salient regions, non-salient regions and the entire image. Third, meta-learning is applied to quickly learn and update the parameters of the model when facing new types of images. By summarizing the prior knowledge in the training of one type of task, the model can be quickly fine-tuned for other types of images. The experimental results demonstrate that the proposed method has advantages over the existing methods in generalization and robustness. Furthermore, the proposed method can adapt well to different distortion types and different image types quickly and accurately.

Abstract:
Effective and robust 3D panoptic segmentation is crucial for scene perception in autonomous driving. Modern methods widely adopt multi-modal fusion based simple feature concatenation to enhance 3D scene understanding, resulting in generated multi-modal representations typically lack comprehensive semantic and geometry information. These methods focused on panoptic prediction in a single step also limit the capability to progressively refine panoptic predictions under varying noise levels, which is essential for enhancing model robustness. To address these limitations, we first utilize BEV space to unify semantic-geometry perceptual representation, allowing for a more effective integration of LiDAR and camera data. Then, we propose PrimePSegter, a progressively combined diffusion 3D panoptic segmentation model that is conditioned on BEV maps to iteratively refine predictions by denoising samples generated from Gaussian distribution. PrimePSegter adopts a conditional encoder-decoder architecture for fine-grained panoptic predictions. Specifically, a multi-modal conditional encoder is equipped with BEV fusion network to integrate semantic and geometric information from LiDAR and camera streams into unified BEV space. Additionally, a diffusion transformer decoder operates on multi-modal BEV features with varying noise levels to guide the training of diffusion model, refining the BEV panoptic representations enriched with semantics and geometry in a progressive way. PrimePSegter achieves state-of-the-art performance on the nuScenes and competitive results on the SemanticKITTI, respectively. Moreover, PrimePSegter demonstrates superior robustness towards various scenarios, outperforming leading methods.

Abstract:
Few-shot learning aims to generalize the recognizer from seen categories to an entirely novel scenario. With only a few support samples, several advanced methods initially introduce class names as prior knowledge for identifying novel classes. However, obstacles still impede achieving a comprehensive understanding of how to harness the mutual advantages of visual and textual knowledge. In this paper, we set out to fill this gap via a coherent Bidirectional Knowledge Permeation strategy called BiKop, which is grounded in human intuition: a class name description offers a more general representation, whereas an image captures the specificity of individuals. BiKop primarily establishes a hierarchical joint general-specific representation through bidirectional knowledge permeation. On the other hand, considering the bias of joint representation towards the base set, we disentangle base-class-relevant semantics during training, thereby alleviating the suppression of potential novel-class-relevant information. Experiments on four challenging benchmarks demonstrate the remarkable superiority of BiKop, particularly outperforming previous methods by a substantial margin in the 1-shot setting (improving the accuracy by 7.58% on miniImageNet).

Abstract:
This work presents a weakly supervised referring image segmentation method, named CollabLearn, that segments objects described by free-form referring expression utilizing solely image-text pairs. Existing methods suffer from incorrect localization of referring expressions due to the lack of high-level semantics in cross-modal alignment or rough segmentation of referenced objects stemming from the absence of low-level details. To address these issues, we propose an innovative framework for generating cross-modal features encompassing both high-level semantics and low-level details via two fusion modules: a semantic awareness module and a detail cognition module. Each of these modules generates an activation map, and they mutually correct each other through a collaborative learning strategy. Specifically, the semantic awareness module performs in-depth cross-modal interaction and achieves accurate localization in a top-down manner. The detail cognition module facilitates the segmentation of entire objects in a bottom-up manner. A collaborative learning strategy is designed to enable interaction between these two modules, enforcing sufficient vision-language alignment. Experiments on three benchmarks demonstrate that CollabLearn consistently outperforms state-of-the-art weakly supervised methods.

Abstract:
Recent years have seen the optimization of quality of experience (QoE) through learning adaptive bitrate (ABR) algorithms from internet video streams. However, the complex nature of the real-world Internet, characterized by heavy-tailed behavior, diversity, and unpredictability, hinder the effective learning of off-the-shelf reinforcement learning (RL)-based ABR algorithms. As a result, existing methods inevitably fail to achieve optimal performance under various network conditions and user QoE objectives. We propose Fortuna, a novel offline meta RL ABR algorithm that can effectively learn from these heavy-tailed internet data features and become more practical. Fortuna is primarily divided into two phases. In the offline phase, Fortuna utilizes diverse offline data for learning to reduce the costly online RL interaction expense, while in the online phase, we gradually increase video streaming sessions complexity through curriculum learning to quickly adapt to specific network conditions. Fortuna then utilizes meta-learning to optimize ABR policies and enhance generalization. Additionally, to better learn network features, Fortuna further optimizes QoE by learning low-level TCP congestion control information. Experimental results from trace-driven and real-world scenarios demonstrate that Fortuna enhances learning efficiency by more than 7.5%–4 ×, reduces stall time by 4.6%–14.2%, and generalizes to different network conditions and video streams.

Affiliations: School of Optical Engineering, Xi’an Research Institute of Hi-Tech, Xi’an, China; School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan, China; State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan, China; Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; Computer Science and Engineering Department, State University of New York, Buffalo, NY, USA

Abstract:
Self-supervised hyperspectral image (HSI) clustering remains a fundamental yet challenging task due to the absence of labeled data and the inherent complexity of spatial-spectral interactions. While recent advancements have explored innovative approaches, existing methods face critical limitations in clustering accuracy, feature discriminability, computational efficiency, and robustness to noise, hindering their practical deployment. In this paper, a self-supervised efficient low-pass contrastive graph clustering (SLCGC) is introduced for HSIs. Our approach begins with homogeneous region generation, which aggregates pixels into spectrally consistent regions to preserve local spatial-spectral coherence while drastically reducing graph complexity. We then construct a structural graph using an adjacency matrix A and introduce a low-pass graph denoising mechanism to suppress high-frequency noise in the graph topology, ensuring stable feature propagation. A dual-branch graph contrastive learning module is developed, where Gaussian noise perturbations generate augmented views through two multilayer perceptrons (MLPs), and a cross-view contrastive loss enforces structural consistency between views to learn noise-invariant representations. Finally, latent embeddings optimized by this process are clustered via K-means. Extensive experiments and repeated comparative analysis have verified that our SLCGC contains high clustering accuracy, low computational complexity, and strong robustness.

Abstract:
Existing online video super-resolution methods utilize implicit memories of previous frames to provide reference information, which have a single memory stream path and are highly dependent on the continuous memory stream. However, video capture in real-world scenes is typically affected by abnormal exposures resulting in sudden changes of lightness thus interrupting the memory stream, while long-term memories suffer from memory vanishing problems during transmission. To address this problem, we propose a novel multi-memory streams based online video super-resolution paradigm that adaptively corrects for abnormal exposures and creates multi-memory streams to accurately converge long-term memories. Specifically, we first propose an exposure detection-correction module, which utilizes optical flow overfitting property and temporal lightness information to detect and correct abnormal exposures to avoid interruption of memory streams. In addition, we propose a dynamic-static decoupled alignment strategy, which can adaptively select the alignment method based on pixel displacement, thus accurately aggregating past long-term memories to create multiple memory streams. Further, we propose an adaptive memory fusion module to mine complementary information between multiple memory streams to solve the memory vanishing problem. Extensive experimental results show that our method outperforms existing video super-resolution methods on complex exposure datasets. We also conduct detailed ablation experiments to analyze and validate our contributions.

Abstract:
Effectively representing and transferring user preferences across various domains presents a significant challenge in cross-domain recommendation (CDR). Some approaches utilize graph neural networks that use interaction behavior to establish relationships between entities, providing a comprehensive understanding of user interests. However, the impact of consistent semantics across various types, fields, and perspectives of social media information on user preferences is overlooked, i.e. the multidimensional consistency of user preferences. This oversight results in graph node representations that inadequately reflect user preferences. To address these limitations, we propose a multi-layer transfer learning network (MTLG) for CDR based on graph node representation enhancement via multi-dimensional consistent user preferences. Firstly, the model introduces a set of globally shared semantic units to perform different-grained semantic alignment of multiple media information without clear alignment boundaries, thereby modeling multi-dimensional consistent user preference features. These features are then seamlessly integrated with the initial high-order graph structure embedding features, thus significantly improving the quality of graph node representation. Secondly, the model innovatively designs a multi-layer transfer learning network that hierarchically aligns the domain distribution differences. It calculates the similarity between domains to derive layer weights for more precise transfer learning, thereby mitigating the possibility of information error accumulation resulting from inaccurate feature aggregation processes. We conducted numerous experiments on 3 scenarios, including 7,954,943 rating information from the Amazon dataset. The results indicate that MTLG’s recommendation accuracy surpasses those of state-of-the-art methods.

Abstract:
Deepfake has recently raised severe public concerns about security issues, such as creating fake news of celebrities. As countermeasures, identity-aware detection methods expose forged videos by measuring identity consistency between the suspicious input and its reference samples. However, existing methods suffer from notable performance degradation due to undesired factors, such as various head poses. In this work, we conduct a statistical analysis to illustrate the influence of different facial regions for forensic purposes, which infers more reliable identity information is located in critical face regions. Motivated by it, we propose a graph learning-based identity-aware deepfake detection framework considering critical contour prior as guidance. First, feature sampling based on contour landmarks is applied to construct the graph data for our critical contour prior-guided graph attention network (CP-GAT), where a node position prediction task is constructed as auxiliary supervision to explore rich relationships between nodes. To enhance pose-invariant ability, a rotation compensation block is integrated into CP-GAT and trained using a pose-calibrated contrastive learning to extract identity features, which takes highquality front faces as the calibration goal with progressively updating selection. Besides, an adversarial node masking-based training strategy is proposed as feature augmentation to further enhance the reliability. During the inference stage, the similarity between CP-GAT's identity features of the input sample and its reference samples is used to obtain detection results. Extensive experiments are conducted on various face forgery datasets and state-of-the-art methods are compared to verify the superiority of the proposed method in terms of detection capability and robustness.

Abstract:
Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems’ accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Abstract:
Camouflaged object detection aims to identify objects that blend seamlessly with their background, posing a greater challenge compared to general object detection tasks. Due to its ability to recognize camouflaged objects, such detection models hold significant practical value across various fields. To accurately identify camouflaged targets in various complex environments, we designed a dual-guided camouflaged object detection network based on boundary and texture information (BTDGNet). The process consists of two main stages. The first stage is the localization stage, which leverages a convolutional neural network (CNN) to capture boundary and texture information of objects. These features are then fused to achieve coarse localization of the camouflaged objects. In the second stage, the recognition stage, we employ a Transformer to extract global information from the image, enhancing the differentiation between foreground and background. An interactive fusion module is designed to fully exploit and integrate both global and local features, producing precise prediction images. By leveraging boundary and texture information, the model’s adaptability to different camouflaged objects is improved. The integration of local and global features enhances the model’s detection accuracy from various perspectives, ultimately building a camouflaged object detection model suitable for a wide range of complex scenarios. The proposed method was extensively compared with other state-of-the-art methods across four public datasets, and the results demonstrated superior performance. Furthermore, benefiting from our dual-guidance strategy that leverages both texture and boundary information, our model demonstrates robust performance. We conducted tests on detection tasks across four different domains, and the results confirm that our model can accurately segment camouflaged objects in complex scenes.

Abstract:
In this paper, we propose DIRE (Domain-Invariant Representation Learning for Expression), a novel approach to enhance the generalizability of facial expression recognition (FER) models in unseen domains. Traditional FER models often struggle with distribution shifts between training and test datasets, leading to significant performance drops. Based on the concept of Single-Source Domain Generalization, we introduce a novel domain augmentation technique that applies pixel-level and feature-level perturbations to domain-variant regions while preserving semantic consistency. Additionally, we incorporate semantic alignment regularization and domain information minimization loss so that domain-invariant features effectively represent facial expressions. Extensive experiments on multiple FER datasets demonstrate that our method significantly improves generalization across diverse target domains, even when trained on a single source domain. The proposed DIRE approach offers a robust solution to real-world FER tasks, where unseen domain generalizability is crucial.

Abstract:
The Metaverse is growing rapidly, resulting in thousands of rich virtual universes. This results in a difficult search process for the user, making advanced search tools a necessity. Existing methods leverage contrastive learning to obtain a function mapping a 3D scene and its textual descriptions into similar representations. However, Metaverse scenarios are complex, multimedia-rich 3D scenes containing many elements, making cross-modal alignment difficult. For instance, a museum dedicated to Van Gogh is unrelated to Warhol, yet it shares similarities with Matisse or Monet. To make the mapping functions aware of these nuances, we propose a novel learning strategy to integrate Adaptive Optimization Constraints, computing data-dependent distances using a language-based method we design and enforcing them between the representations at training time. This novelty sets our approach apart from standard procedures enforcing the same distance. We validate the effectiveness of two datasets, one including 6000 apartments, and a novel dataset of 3000 museums that we collect. We observe consistent improvements compared to existing methods. Moreover, we obtain better generalization when with very complex scenarios, e.g. on the museums dataset it obtains an average R@1 of 5.2% compared to 1.2% obtained by existing methods.

Abstract:
In HTTP Adaptive Streaming (HAS), a video is encoded at various bitrate-resolution pairs, collectively known as the bitrate ladder, allowing users to select the most suitable representation based on their network conditions. Optimizing this set of pairs to enhance the Quality of Experience (QoE) requires accurately measuring the quality of these representations. VMAF and ITU-T’s P.1204.3 are highly reliable metrics for assessing the quality of representations in HAS. However, in practice, using these metrics for optimization is often impractical for live streaming applications due to their high computational costs and the large number of bitrate-resolution pairs in the bitrate ladder that need to be evaluated. To address their high complexity, our paper introduces a new method called VQM4HAS, which extracts low-complexity features, including (i) video complexity features, (ii) frame-level encoding statistics logged during the encoding process, and (iii) lightweight video quality metrics. These extracted features are then fed into a regression model to predict VMAF or P.1204.3. The VQM4HAS model is designed to operate on a per bitrate-resolution pair, per-resolution, and cross-representation basis, optimizing quality predictions across different scenarios. Our experimental results demonstrate that VQM4HAS achieves a high correlation with VMAF and P.1204.3, with Pearson correlation coefficients (PCC) ranging from 0.95 to 0.96 for VMAF and 0.97 to 0.99 for P.1204.3, depending on the resolution. Despite achieving a high correlation with VMAF and P.1204.3, VQM4HAS exhibits significantly less complexity than both metrics, with 98% and 99% less complexity for VMAF and P.1204.3, respectively, making it suitable for live streaming scenarios. We also conduct a feature importance analysis to further reduce the complexity of the proposed method. Furthermore, we evaluate the effectiveness of our method by using it to predict subjective quality scores. The results show that VQM4HAS achieves a higher correlation with subjective scores at various resolutions despite its minimal complexity.

Abstract:
In recent years, massive datasets have significantly driven the advancement of visual learning such as multi-modal large model at the expense of high computational costs and extensive storage requirements. Dataset distillation (DD) aims to address this challenge by learning a small synthetic dataset such that a model trained on it can achieve a test performance comparable to that of the model trained on the original dataset. This task can be formulated as a bi-level learning problem where the outer loop optimizes the learned dataset and the inner loop updates the model parameters based on the distilled data. Different from previous studies that focus primarily on optimizing the inner loop in this bi-level problem, we delve into the task of dataset distillation from the perspective of sample cruciality. We find that discarding easy samples and keeping the hard ones that are difficult to be represented by the learned synthetic samples in the outer loop can be beneficial for DD. Motivated by this observation, we further develop an Infinite Semantic Augmentation (ISA) based dataset distillation algorithm, which discards some easier samples and implicitly enriches harder ones in the semantic space through continuous interpolation between two target feature vectors. Through detailed mathematical derivation, the joint contribution to the training loss of all interpolated feature points is formed into an analytical closed-form solution of an integral that can be optimized with almost no extra computational cost. Experimental results on several benchmark datasets demonstrate the effectiveness of our approach in reducing the dataset size while preserving the accuracy of the model. Furthermore, we show that high-quality distilled data can also benefit downstream applications, such as continual learning and membership inference defense.

Abstract:
Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples, similar to self-supervised learning. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.

Abstract:
3D scene graph prediction is important for intelligent agents to gather information and perceive semantics of their environments. However, constructing an effective graph is nontrivial given the complexity of natural scenes. Existing solutions for graph representation of 3D scenes still distinguish each detailed discrepancy among all the relationships as flat thinking, ignoring the mechanism used by humans to perform this task. Inspired by the role of the prefrontal cortex in hierarchical reasoning, we analyze this problem from a novel perspective: exploring hierarchical spatial layout cues in 3D space and navigating that hierarchy to make the 3D scene graph more accurate in a vertical division to horizontal propagation strategy. To this end, we first encode the contextual object features for fine-gained object category classification. Next, we build a bottom-up hierarchical graph to predict remarkably diverse support relationships in a single concept regardless of numerous irrelevant relationships. Finally, equipped with the spatially-true and semantically-meaningful support relationships, we focus on the local region layout to propagate the semantic features to predict the additional non-support relationships under the guidance of the given referred hierarchical graph nodes. Experiments on the challenging 3DSSG benchmark show that our algorithm outperforms existing state-of-the-art, and can also alleviate the impact of the long-tailed distribution of training data.

Abstract:
Automatic teeth segmentation and labeling on dental models are basic tasks in computer-aided dentistry. Many existing works can achieve promising results in teeth segmentation, but they heavily rely on aligned input dental models, which leads to additional manual intervention. Moreover, tooth labeling is an essential task in digital dentistry for treatment planning (e.g., orthodontic), and is usually ignored in these methods. In this article, we propose an AlignNet for aligning dental models of arbitrary sizes and orientations automatically. Meanwhile, a multi-task hybrid learning network is designed that effectively plays the advantages of semantic segmentation and instance segmentation, and synergistically improves the performance of teeth point clouds segmentation and labeling. Particularly, for the teeth-gingival boundaries with large segmentation errors, we utilize the filtered curvature information as a constrained feature to detect the weak boundary more accurately. At last, we propose a DiffLoss and postprocessing step based on the dental arch to address the teeth classification problem. Through extensive evaluations of oral scanning models, our method is robust to handle dental model point clouds with arbitrary size and orientation, and outperforms state-of-the-art teeth segmentation and labeling methods, demonstrating its full automation and robustness in clinical practice.

Abstract:
In this golden age of multimedia, realistic content is in high demand with users seeking more immersive and interactive experiences. As a result, new image modalities for 3D representations have emerged in recent years, among which point clouds have deserved especial attention. Naturally, with this increase in demand, efficient storage and transmission became a must, with standardization groups such as MPEG and JPEG entering the scene, as it happened before with other types of visual media. In a surprising development, JPEG issued a Call for Proposals on point cloud coding targeting exclusively learning-based solutions, in parallel to a similar call for image coding. This is a natural consequence of the growing popularity of deep learning, which due to its excellent performances is currently dominant in the multimedia processing field, including coding. This article presents the coding solution selected by JPEG as the best-performing response to the Call for Proposals and adopted as the first version of the JPEG Pleno Point Cloud Coding Verification Model, in practice the first step for developing a standard. The proposed solution offers a novel joint geometry and color approach for point cloud coding, in which a single deep learning model processes both geometry and color simultaneously. To maximize the RD performance for a large range of point clouds, the proposed solution uses down-sampling and learning-based super-resolution as pre- and post-processing steps. Compared to the MPEG point cloud coding standards, the proposed coding solution comfortably outperforms G-PCC, for both geometry, color, and joint quality metrics.

Abstract:
In 3D point cloud-based object detection, attention mechanism in Group-Free (Liu et al.,2021) learns direct relationships between proposals and all seed points, providing each proposal with a global context in the form of a cross-attention map. However, our analysis and experimental comparison show that the attention mechanism assigns inappropriately large attention weights to certain seed points far from a proposal, which is not conducive to detecting objects correctly. In this work, we alleviate the above problem by proposing a mask method. For an initial proposal, our method first calculates a spatial distance-based mask, which measures the spatial relationship between all seed points and the proposal. Then, we fuse the mask into cross-attention layers in stacked attention modules and get a refined cross-attention map. In essence, our mask gives each proposal a local context; after it is fused with the global context given by the attention mechanism, the refined cross-attention map could suppress the negative impact of some distant seed points on a proposal. We present two alternative strategies to compute the mask, a hard mask, and a soft mask. Experimental results demonstrate that the soft mask brings better performance. In the soft mask, for each initial proposal's 3D-box shape, we use a parametric approximate ellipsoid as the basis of the mask's calculation, which has only two learnable parameters. Experimental results show our work could outperform Group-Free 0.7 mAP@0.25 at the cost of increasing inference time by less than 1%. The performance of our algorithm on the public dataset SUN RGB-D is 63.7 mAP@0.25 and 45.5 mAP@0.5, which is the best performance among algorithms that preserve the irregular of seed points.

Abstract:
Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.

Abstract:
Deep reinforcement learning (DRL) demonstrates its promising potential in adaptive video streaming and has recently received increasing attention. However, existing DRL-based methods for adaptive video streaming mainly use application (APP) layer information, adopt heuristic training methods, and are not robust against continuous network fluctuations. This paper aims to boost the quality of experience (QoE) of adaptive wireless video streaming by using cross-layer information, deriving a rigorous training method, and adopting effective online tuning methods with real-time data. First, we formulate a more comprehensive and accurate adaptive wireless video streaming problem as an infinite stage discounted Markov decision process (MDP) problem by additionally incorporating past and lower-layer information. This formulation allows a flexible tradeoff between QoE and computational and memory costs for solving the problem. In the offline scenario (only with pre-collected data), we propose an enhanced asynchronous advantage actor-critic (eA3C) method by jointly optimizing the parameters of parameterized policy and value function. Specifically, we build an eA3C network consisting of a policy network and a value network that can utilize cross-layer, past, and current information and jointly train the eA3C network using pre-collected samples. In the online scenario (with additional real-time data), we propose two continual learning-based online tuning methods for designing better policies for a specific user with different QoE and training time tradeoffs. The proposed online tuning methods are robust against continuous network fluctuations and more general and flexible than the existing online tuning methods. Finally, experimental results show that the proposed offline policy can improve the QoE by 6.8% to 14.4% compared to the state-of-the-arts in the offline scenario, and the proposed online policies can achieve 6.3% to 55.8% gains in QoE over the state-of-the-arts in the online scenario.

Abstract:
Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.

Abstract:
In this article, we propose a position and orientation-aware one-shot learning framework for medical action recognition from signal data. The proposed framework comprises two stages and each stage includes signal-level image generation (SIG), cross-attention (CsA), and dynamic time warping (DTW) modules and the information fusion between the proposed privacy-preserved position and orientation features. The proposed SIG method aims to transform the raw skeleton data into privacy-preserved features for training. The CsA module is developed to guide the network in reducing medical action recognition bias and more focusing on important human body parts for each specific action, aimed at addressing similar medical action related issues. Moreover, the DTW module is employed to minimize temporal mismatching between instances and further improve model performance. Furthermore, the proposed privacy-preserved orientation-level features are utilized to assist the position-level features in both of the two stages for enhancing medical action recognition performance. Extensive experimental results on the widely-used and well-known NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets all demonstrate the effectiveness of the proposed method, which outperforms the other state-of-the-art methods with general dataset partitioning by 2.7%, 6.2% and 4.1%, respectively.

Abstract:
One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30 K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.

Abstract:
The virtual surgery engine represents a crucial research domain within biomedical and information sciences. To address the real-time and realistic demands of virtual surgical robots for soft tissue deformation and cutting, the surface information and internal structure of organ models have been redefined. An enhanced Three-Parameter Mass-Ogden model, which incorporates nonlinearity and viscoelasticity in soft tissues, has been developed based on extended position dynamics. Cluster constraints for filling particles were introduced to improve the smoothness of surgical procedures. The relaxation and creep characteristics of real soft tissues were accounted for by evaluating the responses of various biological tissues to external stress and loads using the HY-0580 high-performance mechanical testing machine. Eight experiments were conducted for each tissue type, and five sets of valid data were averaged and fitted using the Three-Parameter Mass-Ogden mixed model. Surgical simulations were conducted using Abaqus, incorporating Young's modulus, stress-strain relationships, cutting depth, pressure distribution, real-time feedback, and comprehensive visualization. The model's effectiveness was further validated. The surgical platform was integrated into a virtual reality-based digital twin robot simulator for minimally invasive surgery, achieving a surgical operation refresh rate of 78.5 Hz, a visual refresh rate of 60 Hz, and a haptic feedback refresh rate of 1000 Hz. Comparative analysis with the Mass-Spring Model (MSM) and Finite Element Method (FEM) shows our model's superior balance of accuracy and efficiency. MSM is fast but imprecise, while FEM is accurate but computationally intensive.

Abstract:
Emotion Recognition in Conversation (ERC) has gained considerable attention due to its importance in human-computer interaction. In ERC task, the combination of multimodal information and contextual information is necessary since it can help the model understand emotional changes in the context from multiple perspectives. As Graph Neural Networks (GNNs) have shown the superiority in relation modeling, many graph-based methods have been proposed to improve the performance of emotion recognition by utilizing the edges to mine the contextual relationship and multimodal relationship in a conversation. However, the existence of numerous redundant edges and excessively complex modality interaction in the graph hinders the model from capturing the truly effective dependency information for emotion recognition. In this paper, we propose a Hypergraph based Contextual Relationship Modeling Method (HyperCRM) to carry out the ERC task. HyperCRM models a conversation as a hypergraph instead of a graph, which defines two types of hyperedges, namely speaker-level hyperedge and sequence-level hyperedge, to represent the contextual relationship within the same speaker and the local sequence of the conversation, respectively. Multimodal information is leveraged here as the node feature representation by the feature concatenation. In addition, an improved hypergraph convolution method is designed to capture the long-range contextual information by three-stage information propagation in the hypergraph, including node-hyperedge, hyperedge-hyperedge and node-hyperedge. The extensive experiments on two public datasets shows the new State-Of-The-Art (SOTA) results, to further demonstrate that the proposed method can simply make use of the multimodal information and effectively model the complex contextual relationships in the conversation.

Abstract:
Pre-trained vision-language models (VLMs), equipped with parameter-efficient tuning (PET) methods like prompting, have shown impressive knowledge transferability on new downstream tasks, but they are still prone to be limited by catastrophic forgetting and overfitting dilemma due to large gaps among tasks. Furthermore, the underlying physical mechanisms of prompt-based tuning methods (especially for visual prompting) remain largely unexplored. It is unclear why these methods work solely based on learnable parameters as prompts for adaptation. To address the above challenges, we present a new prompt-based framework for vision-language models, termed Uni-prompt. Our framework transfers VLMs to downstream tasks by designing visual prompts from an attention perspective that reduces the transfer/solution space, which enables the vision model to focus on task-relevant regions of the input image while also learning task-specific knowledge. Additionally, Uni-prompt further aligns visual-text prompts learning through a pretext task with masked representation modeling interactions, which implicitly learns a global cross-modal matching between visual and language concepts for consistency. We conduct extensive experiments on the few-shot classification task and achieve significant improvement using our Uni-prompt method while requiring minimal extra parameters cost.

Abstract:
Existing zero-shot learning based image classification methods transform the zero-shot learning problem into supervised learning by applying generative adversarial network (GAN) to synthesize visual features of unseen classes. However, the visual features generated by the generator tend to be biased towards seen classes, and the discriminator is too weak to generate high-quality image features. To solve these problems, we propose a novel zero-shot food image classification method based on low dimensional embedding of visual features. Our method applies reinforced semantic guidance to increase the discriminative ability of the model by enhancing the strong distribution of input features. Moreover, the visual space is utilized as the embedding space to reduce the bias towards seen classes by reducing the distance between semantic information and visual features in the embedding space. Finally, the feature distribution of unseen classes is further specified by improving the prototype similarity function. Extensive experiments on three food datasets and four general benchmark datasets demonstrate the effectiveness of the proposed method.

Abstract:
The precise recognition of food categories plays a pivotal role for intelligent health management, attracting significant research attention in recent years. Prominent benchmarks, such as Food-101 and VIREO Food-172, provide abundant food image resources that catalyze the prosperity of research in this field. Nevertheless, these datasets are well-curated from canteen scenarios and thus deviate from food appearances in daily life. This discrepancy poses great challenges in effectively transferring classifiers trained on these canteen datasets to broader daily-life scenarios encountered by humans. Toward this end, we present two new benchmarks, namely DailyFood-172 and DailyFood-16, specifically designed to curate food images from everyday meals. These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain. In addition, we also propose a simple yet effective baseline method named Multi-Cluster Reference Learning (MCRL) to tackle the aforementioned domain gap. MCRL is motivated by the observation that food images in daily-life scenarios exhibit greater intra-class appearance variance compared with those in well-curated benchmarks. Notably, MCRL can be seamlessly coupled with existing approaches, yielding non-trivial performance enhancements. We hope our new benchmarks can inspire the community to explore the transferability of food recognition models trained on well-curated datasets toward practical real-life applications.

Affiliations: State Key Laboratory of Advanced Rail Autonomous Operation, the School of Computer Science and Technology, and Visual Intellgence +X International Cooperation Joint Laboratory of MOE, Bejing Jiaotong University, Beijing, China; School of Computer Science and Engineering, Central South University, Changsha, China; Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education and the School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; School of Mechanical Engineering, Guizhou University, Guiyang, China

Abstract:
3D Region-of-Interest (RoI) Captioning involves translating a model's understanding of specific objects within a complex 3D scene into descriptive captions. Recent advancements in Large Language Models (LLMs) have shown great potential in this area. Existing methods capture the visual information from RoIs as input tokens for LLMs. However, this approach may not provide enough detailed information for LLMs to generate accurate region-specific captions. In this paper, we introduce Self-RoI, a Large Language Model with multi-modal self-perception capabilities for 3D RoI captioning. To ensure LLMs receive more precise and sufficient information, Self-RoI incorporates Implicit Textual Info. Perception to construct a multi-modal vision-language information. This module utilizes a simple mapping network to generate textual information about basic properties of RoI from vision-following response of LLMs. This textual information is then integrated with the RoI's visual representation to form a comprehensive multi-modal instruction for LLMs. Given the limited availability of 3D RoI-captioning data, we propose a two-stage training strategy to optimize Self-RoI efficiently. In the first stage, we align 3D RoI vision and caption representations. In the second stage, we focus on 3D RoI vision-caption interaction, using a disparate contrastive embedding module to improve the reliability of the implicit textual information and employing language modeling loss to ensure accurate caption generation. Our experiments demonstrate that Self-RoI significantly outperforms previous 3D RoI captioning models. Moreover, the Implicit Textual Info. Perception can be integrated into other multi-modal LLMs for performance enhancement. We will make our code available for further research.

Abstract:
Label biases in facial expression recognition (FER) datasets, caused by annotators' subjectivity, pose challenges in improving the performance of target datasets when auxiliary labeled data are used. Moreover, training with multiple datasets can lead to visible degradations in the target dataset. To address these issues, we propose a novel framework called the AU-aware Vision Transformer (AU-ViT), which leverages unified action unit (AU) information and discards expression annotations of auxiliary data. AU-ViT integrates an elaborately designed AU branch in the middle part of a master ViT to enhance representation learning during training. Through qualitative and quantitative analyses, we demonstrate that AU-ViT effectively captures expression regions and is robust to real-world occlusions. Additionally, we observe that AU-ViT also yields performance improvements on the target dataset, even without auxiliary data, by utilizing pseudo AU labels. Our AU-ViT achieves performances superior to, or comparable to, that of the state-of-the-art methods on FERPlus, RAFDB, AffectNet, LSD and the other three occlusion test datasets.

Abstract:
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation.

Abstract:
JPEG reversible data hiding (RDH) refers to covert communication technology to accurately extract secret data while also perfectly recovering the original JPEG image. With the development of cloud services, a large number of private JPEG images can be efficiently managed in cloud platforms by embedding user ID or authentication labels. Nevertheless, data embedding operations may inadvertently disrupt the encoding sequence of the original JPEG image, resulting in severe distortion of the host image when it is re-compressed to JPEG format. To address this problem, this paper proposes a new JPEG RDH scheme based on block sorting optimization and dynamic iterative histogram modification. We firstly design a block ordering optimization strategy by combining the number of zero coefficients and the quantization table values of non-zero coefficients in a DCT block. Subsequently, a dynamic iterative histogram modification scheme is proposed by considering the local features and embedding capability of histograms generated from different texture images. According to the given payloads, we introduce different parameters to control the iterations of two-dimensional histogram and then adaptively generate the optimal histogram modification mapping, which can realize low JPEG file size increments by guaranteeing most of the AC coefficients unchanged as much as possible. Numerous experiments have shown that our scheme can achieve an effective balance among embedding capacity, visual quality, file size increment, computational complexity, and outperforms the state-of-the-arts in terms of the above metrics.

Abstract:
Emotion recognition in conversations (ERC) is a crucial aspect of human-computer interaction and plays an important role in various domains, including healthcare, entertainment, and education. Since the conversation data in the form of multimodal sequences is well suited to be constructed into graphs, the methods based on graph convolutional network (GCN) show incomparable advantages. However, existing methods attempt to model the highly uncertain emotional relationships between different speakers, which is not an easy task and may even introduce interference information. Therefore, we propose an identity and modality attributes driven multimodality fusion network (dubbed IMDNet) for emotion recognition in conversations. Specifically, we construct a speaker-centric graph that only connects nodes of the same speaker within modalities to each other, reducing the interference between the emotions of different speakers. We also introduce the attribute embedding mechanism, which facilitates the correct calculation of correlations between nodes for better multimodal feature fusion. Considering that the emotional correlation between utterances will decrease over time, we present an utterance distance attention to make the fusion network pay more attention to the adjacent utterances. Furthermore, we explore the solution to the data imbalance problem suitable for conversation scenarios. Given the presence of possible anomalous samples in the dataset, we opt for the BoundaryFocalLoss. Experiments on the IEMOCAP and MELD datasets show that our IMDNet outperforms the state-of-the-art methods.

Abstract:
Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1 k and ADE20 K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64× speed-up with only a 0.2% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness.

Abstract:
The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360^\circ video streaming) can provide users with a 360^\circ immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.

Abstract:
The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a Multi-Granularity Relation Graph Aggregation Framework (MGRG) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.

Abstract:
Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.

Abstract:
Exploiting the correlation between multimodal data to generate tactile data has become a preferred approach to enhance tactile rendering fidelity. Nevertheless, existing studies have often overlooked the temporal dynamics of force tactile data. To fill this gap in the literature, this paper introduces a joint visual-audio approach to generate a temporal tactile data (VA2T) algorithm, focusing on the temporal and long-term dependencies of force tactile data. VA2T uses a feature extraction network to extract audio and image features and then uses an attention mechanism and decoder to fuse these features. The tactile reconstructor generates temporal friction and a normal force, with dilated causal convolution securing the temporal dependencies in the force tactile data. Simulation experiments on the LMT dataset demonstrate that compared with the transformer and audio-visual-aided haptic signal reconstruction (AVHR) algorithms, the VA2T algorithm reduces the RMSE for generated friction by 29.44% and 32.37%, respectively, and for normal forces by 23.30% and 35.43%, respectively. In addition, we developed a haptic rendering approach that combines electrovibration and mechanical vibration to render the generated friction and normal force. The subjective experimental results showed that the rendering fidelity of the data generated using the VA2T method was significantly higher than that of the data generated using the transformer and AVHR methods.

Abstract:
The prevalent convolution neural network (CNN) and Transformer have revolutionized the area of single-image super-resolution (SISR). Though these models have significantly improved performance, they often struggle with real-time applications or on resource-constrained platforms due to their complexity. In this paper, we propose TBag, a lightweight hybrid network that combines the strengths of CNN and Transformer to address these challenges. Our method simplifies the Transformer block with three key optimizations: 1) No projection layer is applied to the value in the original self-attention operation; 2) The number of tokens is rescaled before the self-attention operation and then rescaled back for easing of computation; 3) The expansion factor of the original feed-forward network (FFN) is adjusted. These optimizations enable the development of an efficient hybrid network tailored for real-time SISR. Notably, the hybrid design of CNN and Transformer further enhances both local detail recovery and global feature modeling. Extensive experiments show that TBag achieves a competitive trade-off between effectiveness and efficiency compared to previous lightweight SISR methods (e.g., +0.42 dB PSNR with an 86.7% reduction in latency). Moreover, TBag's real-time capabilities make it highly suitable for practical applications, with the TBag-Tiny version achieving up to 59 FPS on hardware devices. Future work will explore the potential of this hybrid approach in other image restoration tasks, such as denoising and deblurring.

Abstract:
Recently, numerous learning-based point cloud compression methods with outstanding performance have been developed. The majority of them concentrate on point cloud geometry compression, and several works have demonstrated advances in the color attribute compression for dense point clouds. However, compression of the reflectance attribute attached to the point captured by the light detection and ranging (LiDAR) sensors remains a major challenge. In this article, we present a lossless reflectance compression method for LiDAR point clouds (LPCs) that learns reflectance probability distributions with a deep hierarchical k-nearest-neighbors (KNN) context model, namely, the HK-PCRC. We first represent the original LPC with a series of hierarchical layers. Relying on the hierarchical structure, points in the same layer are coded in parallel by referencing the points in the previously coded layers. The approach balances the coding efficiency and time complexity while also supporting the progressive coding functionality. By introducing the KNN context, the context size is significantly reduced, which eases the computational burden while maintaining the coding performance. To enrich the context information, we further search for enhanced neighbors for each point in the context window. For each enhanced neighbor, in addition to its reflectance value, the relative distance, elevation angle, and local density are further collected. Then, a transformer-style sequential model is applied to construct an accurate deep context model. Furthermore, to efficiently fuse context features from different sources, a cross-feature fusion attention mechanism is designed for the transformer network. The comprehensive experimental results on SemanticKITTI, a large scale LiDAR benchmark, and Ford, an MPEG-specified dataset, demonstrate that our proposed framework achieves a state-of-the-art reflectance lossless compression performance, with average bit savings of 11.3% and 9.6% when compared to the state-of-the-art hand-crafted methods.

Affiliations: Metaverse Research Institute, School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China; Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China; Biomedical and Multimedia Information Technology Research Group, School of Computer Science, The University of Sydney, Sydney, NSW, Australia; Department of Computing, School of Design, and Research Institute for Sports Science and Technology, The Hong Kong Polytechnic University, Hong Kong

Abstract:
Accurate and high-quality tooth mesh generation from cone-beam computerized tomography (CBCT) is an essential computer-aided technology for digital dentistry. However, existing segmentation-based methods require complicated post-processing and significant manual correction to generate regular tooth meshes. In this paper, we propose a method of continuous bijection supervised pyramid diffeomorphic deformation (PDD) for learning tooth meshes, which could be used to directly generate high-quality tooth meshes from CBCT Images. Overall, we adopt a classic two-stage framework. In the first stage, we devise an enhanced detector to accurately locate and crop every tooth. In the second stage, a PDD network is designed to deform a sphere mesh from low resolution to high one according to pyramid flows based on diffeomorphic mesh deformations, so that the generated mesh approximates the ground truth infinitely and efficiently. To achieve that, a novel continuous bijection distance loss on the diffeomorphic sphere is also designed to supervise the deformation learning, which overcomes the shortcoming of loss based on nearest-neighbour mapping and improves the fitting precision. Experiments show that our method outperforms the state-of-the-art methods in terms of both different evaluation metrics and the geometry quality of reconstructed tooth surfaces.

Abstract:
The increasing commercial value of micro-videos has spurred a rising demand for grasping their contents. The abundant multimodal cues in micro-videos exhibit substantial potential in enhancing content comprehension. However, effectively harnessing the collaborative characteristics across different modalities remains a significant challenge, especially in multi-label scenarios due to inconsistent behaviors regarding label correlations. To better tackle this issue, in this paper, we first introduce a multimodal dual-graph collaborative network with serial attentive aggregation mechanism (MDGCN) for micro-video multi-label classification. In MDGCN, we exploit an asymmetric encoder-decoder framework, which incorporates multiple parallel encoders with complementary representations and a decoder to ensure the completeness of encoded results. Meanwhile, an adversarial constraint is used to ensure individual differences prominently featured within each modality. Furthermore, considering the inconsistency of label correlations across various modalities, we then construct a serial attentive graph convolutional network that employs an interactive dual-graph attention paradigm to sequentially integrate multimodal representations and dynamically explore label correlations. The experiments conducted on two datasets demonstrate that our proposed method outperforms state-of-the-art approaches.

Abstract:
Fine-grained 3D shape classification (FGSC) has garnered significant attention recently and has made notable advancements. However, due to high inter-class similarity and intra-class diversity, it is still a challenge for existing methods to capture subtle differences between different subcategories for FGSC. On the one hand, one-hot labels in loss function are too hard to describe the above data characteristics, and on the other hand, local details are submerged in the global features extraction process and final network constraints, impacting classification results. In this paper, we propose a duplex label smoothing-based hierarchical context-aware network for fine-grained 3D shape classification, named DLS-HCAN. Specifically, DLS-HCAN firstly employs a hierarchical context-aware network (HCAN), in which the intra-view context attention mechanism (intra-ATT) and the inter-view context multilayer perceptron (inter-MLP) are designed to focus on and discern the beneficial local details. Subsequently, we propose a novel duplex label smoothing (DLS) regularization in which shape-level and view-level smooth labels are separately applied in two improved loss functions, adapting to the fine-grained data characteristics and considering the varying uniqueness of different views. Notably, our approach does not require additional annotation information. Experimental results and comparison with state-of-the-art methods demonstrate the superiority of our proposed DLS-HCAN for FGSC. In addition, our approach also achieves comparable performance for the coarse-grained dataset on ModelNet40.

Abstract:
The integration of visual and textual data in Vision- Language Pre-training (VLP) models is crucial forenhancing vision-language understanding. However, the adversarial robustness of these models, especially in the alignment of image-text features, has not yet been sufficiently explored. In this paper, we introduce a novel gradient-based multimodal adversarial attack method, underpinned by contrastive learning, to improve the transferability of multimodal adversarial samples in VLP models. This method concurrently generates adversarial texts and images within imperceptive perturbation, employing both image-text and intra-modal contrastive loss. We evaluate the effectiveness of our approach on image-text retrieval and visual entailment tasks, using publicly available datasets in a black-box setting. Extensive experiments indicate a significant advancement over existing single-modal transfer-based adversarial attack methods and current multimodal adversarial attack approaches.

Abstract:
Deep learning (DL) networks have recently achieved excellent performance on image compressed sensing. However, most existing methods rely on burdened and complex network structures, resulting in significant computational and storage requirements that defeat the purpose of compressed sensing. This severely hinders their applicability in real-world resource-limited devices. In this paper, a lightweight cycle network driven by physical and deep priors for image compressed sensing is proposed which integrates the learning of the sensing matrix and compressive image reconstruction. Specifically, the regularization terms and a likelihood term derived from the physical observation model are learned in an end-to-end cycle network, simultaneously estimating the reconstructed image and sensing matrix in the image and feature domains. Moreover, a dual-domain fusion reconstruction module is proposed. It creates simulated measurement residuals for enhancing reconstruction in the compressed domain, which leads to high reconstruction performance and reduces computational load by bonding together the compressed image domains in the cyclic network. Extensive experiments demonstrate that our model delivers superior performance and alleviates model complexity, which is of great importance in low-budget applications.

Abstract:
Compared with traditional video steganography, coverless video steganography (CVS) can completely avoid being detected by steganalysis algorithms. Recently, the study of CVS has developed rapidly. However, it is still far from the theoretical maximum values in capacity, i.e., the theoretical limit is 2^\ell for a hash sequence length of \ell. Besides, most existing CVS methods have only considered limited types of video attacks in robustness. In this paper, a novel coverless video steganography based on two-level discrete cosine transform (DCT) features is proposed. First, pre-processing is accomplished on the public video datasets. Then, two-level DCT features are calculated and the Coverless Video Database (CVD) is constructed by the K-means++ clustering algorithm. After that, the mapping table is established to map the secret segments to the CVD. Finally, each secret segment corresponds to a video sequence in the CVD by the mapping table to complete the process of information embedding and extraction. The proposed method first evaluates the robustness against the frame swapping attack, which is a common video attack. Experimental results show that the proposed method can achieve the theoretical maximum value in effective capacity and better robustness compared to the state-of-the-art works.

Abstract:
The objective of Multi-Source Domain Adaptation (MSDA) is to train a neural network on labeled data from multiple joint source distributions (source domains) and unlabeled data from a joint target distribution (target domain), and use the trained network to estimate the target data labels. The challenge in this MSDA problem is that the multiple joint source distributions are relevant but distinct from the joint target distribution. To address this challenge, we propose a Joint Distribution Weighted Alignment (JDWA) approach to align a weighted joint source distribution to the joint target distribution under the relative entropy. Specifically, the weighted joint source distribution is defined as the weighted sum of the multiple joint source distributions, and is parameterized by the relevance weights. Since the relative entropy is unknown in practice, we propose a Kernel Relative Entropy Estimation (KREE) method to estimate it from data. Our KREE method first reformulates relative entropy as the negative of the minimal value of a functional, then exploits a function from the Reproducing Kernel Hilbert Space (RKHS) as the functional’s input, and finally solves the resultant convex problem with a global optimal solution. We also incorporate entropy regularization to enhance the network’s performance. Together, we minimize cross entropy, relative entropy, and entropy to learn both the relevance weights and the neural network. Experimental results on benchmark image classification datasets demonstrate that our JDWA approach performs better than the comparison methods. Intro video and Pytorch code are available at https://github.com/sentaochen/Joint-Distribution-Weighted-Alignment. Interested readers are also welcome to visit https://github.com/sentaochen for more source codes of the domain adaptation, partial domain adaptation, multi-source domain adaptation, and domain generalization approaches.

Affiliations: School of Computer, Wuhan University, Wuhan, China; School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China; School of Computer, National University of Defense Technology, Changsha, China; Shandong Provincial Key Laboratory of Computer Networks, Shandong Computer Science Center (National Supercomputing Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China; School of Information and Communication Engineering, North University of China, Taiyuan, China; National Key Laboratory of Electromagnetic Energy, Naval University of Engineering, Wuhan, China; Hexagon AB, Qingdao, China

Abstract:
Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed Spectral Discrepancy and Cross-Modal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.

Abstract:
The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel robust self-adaptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.

Abstract:
Palmprint-based biometric recognition has gained widespread attention due to its rich features, contactless acquisition, and low invasiveness. However, most existing methods neglect image quality, making them less effective for low-quality, noisy palmprint images. In this paper, we propose a palm intrinsic features learning selective state space model (PalmMamba) for palmprint image denoising, which consists of shallow feature representation, noise-insensitive palmprint-specific feature learning, and sharp palmprint image restoration modules. First, we convert the degraded noisy palmprint image into a high-dimensional shallow feature representation through a single-layer convolution backbone. Then, we develop parallel learning branches, including a second-order attention-based selective state space model and a mixed difference convolution module, to exploit diverse palmprint-specific features with both global and local details. Finally, we map the fine-grained palmprint-intrinsic feature map into the identity-preserved sharp palmprint image via a commonly used convolution layer. Extensive experimental results on five public palmprint databases demonstrate the encouraging performance of the proposed PalmMamba in palmprint image denoising.

Abstract:
Although multi-modal large language models (MLLMs) have impressive cross-modal reasoning and prediction capabilities, a unified and rigorous evaluation standard is still lacking. In this paper, we propose a future event prediction task to evaluate their cross-modal temporal prediction capability. This task requires the model to generate descriptions of events that may occur in future based on the input premise video. We build a dataset on the existing datasets for model evaluation. This task faces many challenges, including the complexity of processing video data, such as understanding changes in objects, actions, and time dimensions within the video and the interference of redundant information. To address these challenges, we propose a novel cross-modal prediction framework that introduces cross-modal supplementary learning and template-based reasoning chains based on MLLMs. Cross-modal supplementary learning aims to promote visual and text information to supplement and mine their respective information, primarily to capture critical information in videos, relying on the adaptive temporal filter and casual Q-Former. The template-based reasoning chain drives GPT-4 to generate a series of template question pairs through design prompts, gradually guiding the model to perform hierarchical reasoning to support the final prediction. Through experimental evaluation, the performance of the current MLLMs may not meet the requirements, and our model outperforms all existing models in predicting future events. It shows that the capabilities of MLLMs can be further explored.

Affiliations: School of Artificial Intelligence, Henan Engineering Research Center for Industrial Internet of Things, Henan University, Zhengzhou, China; Department of Radiology, Medical Imaging Research Institute, Huaihe Hospital of Henan University, Kaifeng, China; School of Software, Intelligent Data Processing Engineering Research Center of Henan Province, Institute of Intelligent Network System, Henan University, Kaifeng, China; School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang, China

Abstract:
The popularity of wearable medical data collection and surveillance devices provides real-time guarantees for the whole process of a patient’s medical treatment, especially medical image data plays a key role. However, existing medical images face data leakage, pollution and vulnerability to attacks during transmission over wireless body area networks(WBANs). To address these issues, a privacy protection algorithm based on Hopfield cross neural network (HCNN) for medical data is proposed. Specifically, the HCNN model is first constructed and its dynamic behavior is analyzed, which is suitable for application to image encryption. Then, a confusion method of NZ fractal curve sorting matrix (NZ-FCSM) is designed to achieve good encryption effect. Subsequently, the secret image sharing (SIS) technique based on sharing matrix is introduced to enhance the algorithm robustness. Finally, an alignment embedding of double diamond prediction (AEDDP) method is proposed to implement lossless hiding of private information. The present issues in medical image protection include ensuring the security and effectiveness of encryption algorithms while maintaining the robustness and concealment of ciphertext data, and balancing the need for preservation with the limited resources of complex work environment. Experimental results show that the proposed algorithm achieves PSNR of 53 dB for the cipher image, more than 36 dB for the reconstructed image, and the information entropy of the secret image is over 7.99, and displays good robustness. These findings highlight the validity of the algorithm in medical image data privacy preserving applications that ensure confidentiality and extend to practical applications of concealed transmission of confidential information and secure multi-party transactions.

Abstract:
One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.

Abstract:
Polarization provides valuable physical information, making it beneficial for various computer vision tasks. However, haze reduces both the color and polarization information of a scene. While existing single-image dehazing methods can restore color information, they are poor at recovering polarization information. Furthermore, current polarization-based dehazing approaches neglect the physical mechanisms of polarization degradation, resulting in inaccurate reconstruction of polarization information. In this paper, we propose a novel polarization dehazing algorithm, along with a polarization degradation model, to accurately recover both polarization and color information. First, we combine two key characteristics (the polarization achromatism prior and polarization attenuation prior) with the polarization degradation model to precisely reconstruct the scene’s polarization. Then, we utilize the reconstructed polarization information to recover the color information of the scene. Finally, a multi-scale fusion optimization framework is introduced to further enhance the image quality. Our method shows excellent performance on both real-world indoor and outdoor polarized images, outperforming existing dehazing algorithms in both objective evaluation metrics and subjective visual assessment.

Abstract:
Current event stream-based pattern recognition models typically present the event stream as the point cloud, voxel, image, and the like, and formulate multiple deep neural networks to acquire their features. Although considerable results can be achieved in simple cases, however, the performance of the model might be restricted by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this article, we put forward a novel dual-stream framework for event stream-based pattern recognition through differentiated fusion, which is called EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be separately learned by making use of Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be provided to the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Comprehensive experiments validate that the framework we have proposed attains cutting-edge performance on a variety of extensively utilized event stream-based classification datasets. Particularly, we have realized a freshly pioneering performance on the Bullying10 k dataset, precisely 90.51%, and this outpaces the runner-up by +2.21%.

Abstract:
Image restoration in adverse weather conditions is a critical research focus in computer vision and autonomous driving. In this work, we introduce COFP, a collaborative optimization framework designed to simultaneously enhance the performance of image de-raining, de-snowing, and de-hazing tasks across diverse datasets. The core of COFP lies in its adaptive optimization of weather-shared and weather-specific parameters, enabling the extraction of polyhedral features that effectively integrate both weather-shared and weather-specific attributes, thus substantially boosting multi-weather performance of the network. Technically, we design a polyhedral feature extraction module (PFEM) to facilitate the acquisition of weather-shared attribute and weather-specific attribute. In PFEM, we first introduce an element-adaptive sharing strategy (ESS) that dynamically activates either weather-shared or weather-specific parameters for each element of the weight matrix based on learnable scores, thereby adaptively determining which parameters are shared or not. Secondly, we develop a feature extraction enhancement strategy (FEES), which extends two pathways in PFEM comprised of standard convolutional layers to enhance the capture of polyhedral features, further promoting the model performance. Furthermore, we propose a gradient balancing algorithm (GBA) that mitigates the unequal competition among tasks for shared parameters during network optimization by adaptively adjusting the direction of task gradients, effectively addressing the negative transfer issue induced by domain variations in multi-task learning. Experimental results demonstrate that COFP delivers state-of-the-art performances across various adverse weather image restoration benchmarks.

Abstract:
Instance segmentation can help vehicles or robots enhance their understanding of a scene through the pixel-level segmentation of different objects. However, occlusion and boundary blur, especially in cases with similar colors or textures, are still challenges encountered in real-time robust segmentation tasks. To segment a complete instance boundary, the existing 2D approaches fuse local and abstract semantic features derived from the color domain, which leads to homogeneous semantic information, and efficiently separating different objects is difficult in some cases. To address these complicated scenes, inspired by a human prediction processing strategy, where “the brain fills in missing information in advance to help make better decisions”, this study proposes a real-time asymmetric dual-stream instance segmentation algorithm embedding a depth-predictive architecture that provides the covisible depth information of objects. Furthermore, a cross-domain data fusion method and an enhancement-decoupling loss are designed to complement RGB data by utilizing the rich foreground and boundary details of the predicted depth map. In addition, our model can be fine-tuned to integrate it with real depth domain data provided by different input devices. Extensive experiments conducted on the COCO, OCHuman and CityScapes datasets demonstrate the effectiveness of our method. We further deployed our DSDP method on a UAV platform for validation purposes and qualitatively confirmed its validity.

Abstract:
The evolution of individuals’ living standards has transformed clothing preferences, elevating fashion beyond mere utility to a potent means of self-expression. However, the intricate task of outfit selection persists as a challenge, marked by traditional methods facing challenges such as the oversight of combined factors of scene and body shape, insufficient emphasis on detail-oriented matching, and overreliance on rigid hierarchical structures. To tackle these challenges, this article introduces a novel model, termed Global-Local matching network towards Outfit recommendation for diverse body Shapes and Scenes (GLOSS). Specifically, we first introduce a newly compiled fashion dataset, StreetFashion, to capture the combined factors of body shapes and scene characteristics. Additionally, we develop innovative multi-level globality- and locality-aware matching methods to enhance the accuracy of outfit recommendations by comprehensively considering both global and local relationships among clothing items, outfits, users, and scenes. Furthermore, we develop a personalized outfit heterogeneous graph that incorporates historical interactions among fashion entities, enabling effective modeling of nonstrict hierarchical relationships. Evaluation conducted on both our collected dataset and an adapted existing dataset demonstrates the effectiveness of our proposed approach in outfit recommendation.

Abstract:
The popularity of cloud services has provided a new medium for video streams transmission. Cloud desktops, as a representative multimedia application, facilitate interaction between users and cloud via video streams, garnering widespread adoption in various fields. The network condition directly affects the transmission. Therefore, accurate throughput prediction helps guide the allocation of network resources, avoiding a decline in user experience due to insufficient resources and waste caused by excessive resources. Recent works focus more on the temporal characteristics of throughput. However, we believe that throughput of video streaming is significantly influenced by usage scenario. In this paper, we propose a transfer-based autoencoder framework DeskTransfer for throughput prediction in frequent switching cloud desktop scenarios. Specifically, we construct the Scenario Autoencoder and Throughput Autoencoder to respectively learn the scenario and throughput features from historical usage records. By adopting an adversarial mechanism, we design transfer algorithm using latent vectors, enabling the model suitable for multiple scenarios. We collect real-world data from a project cooperated with Lenovo Research for experiment and compare our solution with leading methods on public datasets to validate its effectiveness.

Affiliations: Guangdong Provincial Key Laboratory of Intelligent Transport System, School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou, China; School of Transportation Science and Engineering, Beihang University, Beijing, China; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; School of Information Engineering, Changan University, Xi’an, China; School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore

Abstract:
Existing Unsupervised Domain Adaptation (UDA) methods often fall short in fully leveraging contextual information from the target domain, leading to suboptimal decision boundary separation during source and target domain alignment. To address this, we introduce GrabDAE, an innovative UDA framework designed to tackle domain shift in visual classification tasks. GrabDAE incorporates two key innovations: the Grab-Mask module, which blurs background information in target domain images, enabling the model to focus on essential, domain-relevant features through contrastive learning; and the Denoising Auto-Encoder (DAE), which enhances feature alignment by reconstructing features and filtering noise, ensuring a more robust adaptation to the target domain. These components empower GrabDAE to effectively handle unlabeled target domain data, significantly improving both classification accuracy and robustness. Extensive experiments on benchmark datasets, including VisDA-2017, Office-Home, and Office31, demonstrate that GrabDAE consistently surpasses state-of-the-art UDA methods, setting new performance benchmarks. By tackling UDA’s critical challenges with its novel feature masking and denoising approach, GrabDAE offers both significant theoretical and practical advancements in domain adaptation.

Abstract:
Rotation invariant point cloud analysis is essential for many real-world applications where objects can appear in arbitrary orientations. Traditional local rotation-invariant methods rely on lossy region descriptors, limiting the global comprehension of 3D objects. Conversely, global features derived from pose alignment can capture complementary information. To leverage both local and global consistency for enhanced accuracy, we propose the Global-Local-Consistent Hypergraph Cross-Attention Network (GLC-HCAN). This framework includes the Global Consistent Feature (GCF) representation branch, the Local Consistent Feature (LCF) representation branch, and the Hypergraph Cross-Attention (HyperCA) network to model complex correlations through the global-local-consistent hypergraph representation learning. Specifically, the GCF branch employs a multi-pose grouping and aggregation strategy based on PCA for improved global comprehension. Simultaneously, the LCF branch uses local farthest reference point features to enhance local region descriptions. To capture high-order and complex global-local correlations, we construct hypergraphs that integrate both features, mutually enhancing and fusing the representations. The inductive HyperCA module leverages attention techniques to better utilize these high-order relations for comprehensive understanding. Consequently, GLC-HCAN offers an effective and robust rotation-invariant point cloud analysis network, suitable for object classification and shape retrieval tasks in SO(3). Experimental results on both synthetic and scanned point cloud datasets demonstrate that GLC-HCAN outperforms state-of-the-art methods.

Abstract:
As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity due to the difficulty of extracting discriminative features from irregular geometric signals. In this article, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method.

Abstract:
Although 3D point cloud classification neural network models have been widely used, the in-depth interpretation of the activation of the neurons and layers is still a challenge. We propose a novel approach, named Relevance Flow, to interpret the hidden semantics of 3D point cloud classification neural networks. It delivers the class Relevance to the activated neurons in the intermediate layers in a back-propagation manner, and associates the activation of neurons with the input points to visualize the hidden semantics of each layer. Specially, we reveal that the 3D point cloud classification neural network has learned the plane-level and part-level hidden semantics in the intermediate layers, and utilize the normal and IoU to evaluate the consistency of both levels' hidden semantics. Besides, by using the hidden semantics, we generate the adversarial attack samples to attack 3D point cloud classifiers. Experiments show that our proposed method reveals the hidden semantics of the 3D point cloud classification neural network on ModelNet40 and ShapeNet, which can be used for the unsupervised point cloud part segmentation without labels and attacking the 3D point cloud classifiers.

Abstract:
Self-supervised monocular depth estimation has been widely studied for 3D perception, as it can infer depth, pose, and object motion from monocular videos. However, existing single-view and multi-view methods employ separate networks to learn specific representations for these different tasks. This not only results in a cumbersome model architecture but also limits the representation capacity. In this paper, we revisit previous methods and have the following insights: (1) these three tasks are reciprocal and all depend on matching information and (2) different representations carry complementary information. Based on these insights, we propose Uni-DPM, a compact self-supervised framework to complete these three tasks with a shared representation. Specifically, we introduce an U-net-like model to synchronously complete multiple tasks by leveraging their common dependence on matching information, and iteratively refine the predictions by utilizing the reciprocity among tasks. Furthermore, we design a shared Appearance-Matching-Temporal (AMT) representation for these three tasks by exploiting the complementarity among different types of information. In addition, our Uni-DPM is scalable to downstream tasks, including scene flow, optical flow, and motion segmentation. Comparative experiments demonstrate the competitiveness of our Uni-DPM on these tasks, while ablation experiments also verify our insights.

Abstract:
Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences, which can be categorized into two primary training paradigms: supervised learning and self-supervised learning. However, the former one-hot classification requires labor-intensive predefined action categories annotations, while the latter involves skeleton transformations (e.g., cropping) in the pretext tasks that may impair the skeleton structure. To address these challenges, we introduce a novel skeleton-based training framework (C^2VL) based on Cross-modal Contrastive learning that uses the progressive distillation to learn task-agnostic human skeleton action representation from the Vision-Language knowledge prompts. Specifically, we establish the vision-language action concept space through vision-language knowledge prompts generated by pre-trained large multimodal models (LMMs), which enrich the fine-grained details that the skeleton action space lacks. Moreover, we propose the intra-modal self-similarity and inter-modal cross-consistency softened targets in the cross-modal representation learning process to progressively control and guide the degree of pulling vision-language knowledge prompts and corresponding skeletons closer. These soft instance discrimination and self-knowledge distillation strategies contribute to the learning of better skeleton-based action representations from the noisy skeleton-vision-language pairs. During the inference phase, our method requires only the skeleton data as the input for action recognition and no longer for vision-language prompts. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our method outperforms the previous methods and achieves state-of-the-art results.

Abstract:
With the growing popularity of high-resolution (HR) video and the continuous growth of network bandwidth, the challenge of object removal detection in HR videos has attracted significant attention. Expert forgers leverage the rich detail in HR videos for meticulous pixel manipulation and apply sophisticated postprocessing techniques to hide high-frequency artifacts, thereby making forgery detection and localization more difficult when existing schemes are used. Additionally, the end-to-end framework simplifies the detection and localization process, which has not been considered in previous work. To solve the above issues, a spatiotemporal encoder−decoder network (SEDN) is proposed for end-to-end object removal forgery detection in HR videos. In the SEDN, a new model composed of a 3D asymmetric dual-stream network (3D-ADSN) and Transformer is proposed. The 3D-ADSN is utilized as the encoder, which fully integrates the high-frequency and low-frequency spatiotemporal information of videos. Transformer is utilized as the decoder to capture the global structure spatiotemporal information of the long-range feature sequence obtained by the encoder. This network combination successfully achieves simultaneous detection in the temporal and spatial domains without any additional postprocessing calculations. The experimental results demonstrate the better performance of the SEDN at different resolutions.

Abstract:
Generating food images from recipes is a challenging task in food analysis, as recipes contain lengthy texts far beyond the semantic information in food images, making it difficult to align the features of two modalities. Existing studies usually concatenate the representations of ingredients and cooking instructions directly, and use the concatenated representations to generate food images through generative adversarial networks (GANs). However, previous models generally ignore the sequential information contained in complicated procedural instructions, which leads to semantic inconsistency between recipes and generated food images. Furthermore, it is still difficult for current models to distinguish and control fine-grained features, causing the entangled ingredient features in food images. To this end, we propose CookGALIP, which strengthens semantic consistency and controllability for food image generation. Based on the recently proposed text-to-image framework GALIP, two modules are specially designed. 1) To incorporate the sequential relationships into the food image generation process, we propose a Recipe Fusion Module (RFM) to fuse the semantics of cooking instructions, so as to balance the semantic complexity between modalities and improve the semantic consistency of recipes and generated food images. 2) To distinguish and control the fine-grained ingredient features, we introduce the Ingredient Control Module (ICM) to generate sequential ingredient prompts, which enables more refined control over the recipe-to-food synthesis process. Experimental results on Recipe1M and Vireo Food-172 datasets show that the proposed model outperforms the state-of-the-art methods.

Abstract:
The extensive range of food safety standards poses a significant challenge to efficiently accessing specific information within this domain, necessitating innovative solutions to streamline the process. In response, researchers are focusing on constructing a knowledge graph based on food safety standards to facilitate efficient associative querying. Named entity recognition is a pivotal element in this endeavor due to its critical impact on the accuracy and quality of the knowledge graph. To address the nuanced challenges of accurately identifying nested entity boundaries and rectifying entity class imbalances in food safety standards, we present PGD-GP, a novel Chinese named entity recognition model. This model is based on Projected Gradient Descent for adversarial training and Global Pointer. The model innovatively refines the Chinese Bert model at the encoding layer, employing the adversarial training method PGD to iteratively introduce perturbations to character vectors, thereby significantly enhancing the model's robustness and adaptability to texts. The decoding layer leverages Global Pointer to accurately determine dependencies and relative positional relationships between characters, thus facilitating more precise recognition of entity boundaries. To combat the issue of class imbalance, Circle Loss is utilized as the loss function. We developed and annotated the Food Safety Standard Dataset using a specifically tailored ontology rule for food safety standards. Comparative experiments conducted on the Food Safety Standard Dataset and the public Resume dataset demonstrate that PGD-GP surpasses six mainstream baseline models in performance, thereby validating the effectiveness and robustness of PGD-GP. Building upon the foundation of PGD-GP and the Food Safety Standard Dataset, we implemented a prototype system that integrates a food safety standard-based knowledge graph with associated queries. This system serves as an efficient, accurate, and comprehensive intelligent assistant, enabling researchers to effectively acquire food safety standard information.

Abstract:
Light field (LF) captures both spatial and angular information of scenes, enabling accurate depth estimation. However, previous deep learning methods have typically model surface depth only, while ignoring the continuous nature of depth in 3D scenes. In this paper, we use displacement field (DF) to describe this continuous property, and propose a novel depth-continuous scene representation for robust LF depth estimation. Experiments demonstrate that our representation enables the network to generate highly detailed depth maps with fewer parameters and faster speed. Specifically, inspired by signed distance field in 3D object description, we aim to exploit the intrinsic depth-continuous property of 3D scenes using DF, and define a novel depth-continuous scene representation. Then, we introduce a simple yet general learning framework for depth-continuous scene embedding, and the proposed network, DepthDF, achieves state-of-the-art performance on both synthetic and real-world LF datasets, ranking 1st on the HCI 4D Light Field benchmark. Furthermore, previous LF depth estimation methods can also be seamlessly integrated into this framework. Finally, we extend this framework beyond LF depth estimation to various tasks, including multi-view stereo depth inference, LF super-resolution, and LF salient object detection. Experiments demonstrate improved performance when the continuous scene representation is applied, suggesting that our framework can potentially bring insights to more fields.

Affiliations: School of Information Science and Technology the Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, the Beijing Laboratory of Smart Environmental Protection, the Beijing Key Laboratory of Computational Intelligence and Intelligent System, the Beijing Institute of Artificial Intelligence, Beijing University of Technology, Beijing, China; School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand; College of Computing and Data Science, Nanyang Technological University, Singapore; Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China

Abstract:
This paper proposed a novel image-based air pollution monitor (IAPM) by incorporating local and global information in the self-adaptive multiscale transform domain, so as to achieve the timely and effective leakage detection of typical air pollutants from a single image. To be specific, this paper first developed a screen-shaped module according to two significant findings in visual neuroscience, which include the high sensitivity of human eyes to horizontal and vertical stimuli and the center-surround inhibition, by designing and fusing the square module, horizontal strip module and vertical strip module parallelly for simulating the behaviour of human eyes to extract local features. Second, the learnable weights and proportional mapping were applied to incorporate the screen-shaped module and lightweight vision transformer as backbone, towards more richly exploiting and fusing local and global information just as the way a brain perceives external stimuli. Third, a new self-adaptive multiscale transform domain method was devised based on two motivations from the visual characteristics of multiscale perception and the brain characteristics of self-adaptive domain transform to modify the backbone by using the operations of pooling and pointwise convolution. Extensive experiments implemented on the datasets of carbon particulate matters and ethylene leakage confirmed the superior monitoring performance of the proposed IAPM model beyond the state-of-the-art (SOTA) peers by an accuracy gain of about 4%. Furthermore, the proposed IAPM model only required 0.089 GFLOPs and 0.15 million model parameters, remarkably outperforming SOTA competitors in computational efficiency and storage resources.

Abstract:
The image enhancement task requires a complex balance between extracting high-level contextual information and optimizing spatial details in the image to improve the visual quality. Most of existing methods have limited capability in capturing contextual features and optimizing spatial details when they only rely on a single modality. To address the above issues, this paper introduces a novel multi-modal image enhancement network based on Transformer, named as TITFormer, which combines textual and simulating infrared modalities firstly for this important task. TITFormer comprises a text channel attention fusion (TCF) network block and an infrared-guided spatial detail optimization (SDO) network block. The TCF extracts contextual features from the high-dimensional features compressed after spatial channel transformation of the textual feature and image feature. The SDO module uses simulating infrared images characterized by pixel intensity to guide the optimization of spatial details with contextual features adaptively. Experimental results demonstrate that TITFormer achieves state-of-the-art performance on two publicly available benchmark datasets.

Abstract:
In the previous efforts to counteract Deepfake, detection methods were most adopted, but they could only function after-effect and could not undo the harm. Preemptive defense has recently gained attention as an alternative, but such defense works have either limited their scenario to facial-reenactment Deepfake models or only targeted specific face-swapping Deepfake model. Motivated to fill this gap, we start by establishing the Deepfake scenario modeling and finding the scenario difference among categories, then move on to the face-swapping scenario setting overlooked by previous works. Based on this scenario, we first propose a novel Black-Box Penetrating Defense Process that enables defense against face-swapping models without prior model knowledge. Then we propose a novel Double-Blind Feedback Regulation Strategy to solve the reality problem of avoiding alarming distortions after defense that had previously been ignored, which helps conduct valid preemptive defense against face-swapping Deepfake models in reality. Experimental results in comparison with state-of-the-art defense methods are conducted against popular face-swapping Deepfake models, proving our proposed method valid under practical circumstances.

Abstract:
Sign language provides communication support for deaf and severely hearing-impaired people. Sign language translation (SLT) bridges the hearing-impaired and hearing communities. Existing SLT methods use gloss as intermediate supervisory information to help the model sense gesture boundaries and understand global semantics. However, annotating gloss requires great cost, especially in multimodal SLT task. This paper proposes GFTLS-SLT: gloss-free Transformer based lexical and semantic awareness framework for SLT. The multimodal alignment and fusion module in GFTLS-SLT utilizes cross-attention to align multimodal features, and fuses them using the improved statistical and contrastive attention. To replace the role of gloss, GFTLS-SLT designs gesture lexical awareness (GLA) and global semantic awareness (GSA) modules. The GLA module utilizes the defined observation matrix to obtain the lexical meaning matrix, and makes the model sense gesture boundaries by the designed dynamic step-size lexical matching algorithm. The multimodal semantic header is used by GSA module to represent the sign language global semantic and is aligned with the spoken semantic on semantic space. In addition, the experiment results of GFTLS-SLT on publicly available multimodal SLT datasets show that its performance reaches that of SLT methods with gloss supervision.

Abstract:
Multimodal sarcasm detection, aiming to uncover sarcastic sentiment behind multimodal data, has gained substantial attention in multimodal communities. Recent advancements in multimodal sarcasm detection (MSD) methods have primarily focused on modality alignment with pre-trained vision-language (V-L) model. However, text-image pairs often exhibit weak or even opposite semantic correlations in MSD tasks. Consequently, directly aligning these modalities can potentially result in feature shift and inter-class confusion, ultimately hindering the model's ability. To alleviate this issue, we propose the Enhancing Semantic Awareness Model (ESAM) for multimodal sarcasm detection. Specifically, we first devise a Modality-decoupled Framework (MDF) to separate the textual and visual features from the fused multimodal representation. This decoupling enables the parallel integration of the Sentimental Congruity Constraint (SCC) within both visual and textual latent spaces, thereby enhancing the semantic awareness of different modalities. Furthermore, given that certain outlier samples with ambiguous sentiments can mislead the training and weaken the performance of SCC, we further incorporate Automatic Outlier Masking. This mechanism automatically detects and masks the outliers, guiding the model to focus on more informative samples during training. Experimental results on two public MSD datasets validate the robustness and superiority of our proposed ESAM model.

Abstract:
An ideal artificial intelligence (AI) system should have the capability to continually learn like humans. However, when learning new knowledge, AI systems often suffer from catastrophic forgetting of old knowledge. Although many continual learning methods have been proposed, they often ignore the issue of misclassifying similar classes and make insufficient use of textual priors of visual classes to improve continual learning performance. In this study, we propose a continual learning framework based on a pre-trained vision-language model (VLM) that does not require storing old class data. This framework utilizes parameter-efficient fine-tuning of the VLM's text encoder for constructing a shared and consistent semantic textual space throughout the continual learning process. The textual priors of visual classes are encoded by the adapted VLM's text encoder to generate discriminative semantic representations, which are then used to guide the learning of visual classes. Additionally, fake out-of-distribution (OOD) images constructed from each training image further assist in the learning of visual classes. Extensive empirical evaluations on three natural datasets and one medical dataset demonstrate the superiority of the proposed framework.

Abstract:
Optical remote sensing images are inevitably affected by cloud cover. To remove clouds from optical remote sensing images, a series of deep learning-based thin cloud removal methods have been developed. However, these methods have not explored the long-range modeling ability of state space models in optical remote sensing image thin cloud removal. In this paper, we propose a frequency-domain assisted Mamba for thin cloud removal, which is called CR-Famba. In CR-Famba, to better extract global and local features of images, we design a frequency-domain assisted state space layer (FDA-SSL). The FDA-SSL consists of two core components: residual state space block (RSSB) and frequency domain detail enhancement block (FDDEB). The RSSB utilizes the visual state space module (VSSM) to extract long-range dependencies of images from a spatial perspective while adding convolutional layers to overcome local pixel forgetting. Due to the rich detailed information of remote sensing images, we present FDDEB equipped with discrete wavelet transform (DWT) to supplement the extracted local information from the frequency domain perspective. We conduct experiments on different types of cloud-containing datasets, and the results show that our method can recover images with clearer texture details compared to other methods.

Abstract:
In this article, we develop a hierarchical uncertainty-aware 360 ^\circ image salient object detection methodology that explicitly explores the geometric and spatial complementary coherence of Tangent projection (TP) and Equirectangular projection (ERP) by a collaborative learning strategy. Concretely, to mitigate spherical distortion, we first intend to learn saliency-related features from less-distorted tangent images, in which a deformation-aware attention block is introduced to mitigate the geometric distortion caused by projecting a 360 ^\circ image onto a 2D plane. However, the discrepancies among tangent images pose a new challenge to 360 ^\circ image salient object detection. To tackle this issue and achieve accurate localization for salient objects of all sizes, we design a spatial-frequency saliency feature aggregation module to leverage fast Fourier convolution to capture global contextual information from ERP images, such that obtaining more representative saliency features. Moreover, a hierarchical uncertainty-aware bi-projection consistency learning module with strong local-global information embedding capabilities is constructed, which learns the geometric and spatial correlations between tangent images and ERP images via a collaborative learning strategy. Ultimately, salient object maps are produced for 360 ^\circ images on the basis of the merged saliency features driven by the uncertainty. Extensive experiments show that our developed method improves \mathrmF_\beta ^\sigma by an average of 31.67% compared to twenty existing advanced methods on the publicly available 360-SOD dataset.

Abstract:
Traditional multilabel feature selection (MFS) typically relies on pre-computing global information within the feature space. However, in real-world applications, features are dynamically generated and continuously arrive over time, known as streaming features, rendering many existing approaches ineffective. Some MFS methods for streaming features have been developed, several challenges persist: (1) Previous research often uses certain strategies to model streaming feature evaluation, failing to process fuzzy information effectively; (2) The maximum correlation between features and class is emphasized, while inter-class separability is ignored, leading to inaccurate feature evaluation; (3) The continuous influx of streaming features brings the dynamics and unknowns to data distribution, has been largely overlooked in previous work; (4) Streaming feature selection requires immediate feedback on newly arriving features, posing challenges to the algorithm’s real-time responsiveness. Motivated by these observations, this paper introduces a novel online MFS strategy for streaming features. First, the weighted manifold distance is designed, and the fuzzy manifold similarity learning strategy is formalized to analyze the instance relationships of unknown distribution. Second, the fuzzy manifold intra-class correlation and inter-class separability are devised to quantify feature discriminability. Finally, a novel multilabel streaming feature analysis framework is established, with feature discriminability as the guiding factor. Incoming features are categorized as weakly relevant, strongly relevant, or redundant, culminating in generating a reliable feature selection subset. Extensive experiments on fifteen public datasets demonstrate that our algorithm achieves competitive performance compared to nine state-of-the-art offline and online algorithms.

Abstract:
We investigate multitask edge-user communication-computation resource allocation for 360^\circ video streaming in an edge-computing enabled millimeter wave (mmWave) multi-user virtual reality system. To balance the communication-computation trade-offs that arise herein, we formulate a video quality maximization problem that integrates interdependent multitask/multi-user action spaces and rebuffering time/quality variation constraints. We formulate a deep reinforcement learning framework for multi-task rate adaptation and computation distribution (MTRC) to solve the problem of interest. Our solution does not rely on a priori knowledge about the environment and uses only prior video streaming statistics (e.g., throughput, decoding time, and transmission delay), and content information, to adjust the assigned video bitrates and computation distribution, as it observes the induced streaming performance online. Moreover, to capture the task interdependence in the environment, we leverage neural network cascades to extend our MTRC method to two novel variants denoted as R1C2 and C1R2. We train all three methods with real-world mmWave network traces and 360^\circ video datasets to evaluate their performance in terms of expected quality of experience (QoE), viewport peak signal-to-noise ratio (PSNR), rebuffering time, and quality variation. We outperform state-of-the-art rate adaptation algorithms, with C1R2 showing best results and achieving 5.21\!-\!6.06 dB PSNR gains, 2.18\!-\!2.70x rebuffering time reduction, and 4.14\!-\!4.50 dB quality variation reduction.

Abstract:
Reversible data hiding in encrypted images (RDHEI) has been recognized as an effective method for overcoming management difficulties within picture archiving and communication system (PACS). However, most existing RDHEI algorithms still encounter notable challenges when applied to the PACS, specifically in terms of their key management, embedding capacity, and security. This paper introduces a novel framework and corresponding algorithm for reversible data hiding in encrypted medical images (RDHEMI) to bridge this gap. The framework employs a unique key for each patient and maintains consistency in the key linked to patient images regardless of changes in doctor, thereby addressing key management challenges. In the proposed algorithm, Huffman tree coding (HTC) integrates Huffman coding with innovative leaf-to-leaf coding, achieving a better compression performance for medical images than move-to-front (MTF) cache and Huffman coding, as medical images contain more smooth areas. Count-encryption (CE) produces encryption keys according to the frequency of encryption occurrences for an image and ensures a peak signal-to-noise ratio under 8 dB for multiple encryptions with the same key, enhancing the algorithm’s resistance to attacks. The experimental results demonstrate that the proposed algorithm achieves high security to counter various attacks and outperforms existing algorithms in terms of the time complexity and embedding capacity, with an improvement of 0.21 bpp.

Abstract:
The rapid adoption of remote work, online conferencing, and shared-screen collaboration has significantly increased the usage of screen content videos (SCVs), creating a growing need for reliable quality assessment to maintain excellent quality of service. While several full-reference SCV quality assessment (SCVQA) methods have been proposed, their practical application is often limited by the unavailability of reference videos. Existing no-reference SCVQA (NR-SCVQA) methods rely on handcrafted features and focus solely on specific distortions and features, potentially limiting their generalization ability. Moreover, they fail to explore the underlying spatiotemporal information of SCVs, which could hinder their performance. In this work, we propose a novel deep learning-based NR-SCVQA model specifically tailored to capture the comprehensive spatiotemporal features of SCVs to overcome these issues and challenges posed by the SCVQA task. Our approach incorporates a dual-channel spatiotemporal convolutional neural network (DCST-CNN) module to extract both content-aware and edge-aware spatiotemporal quality features, which enables an effective spatiotemporal quality feature representation learning for the downstream SCVQA task. Building upon the DCST-CNN, we further propose a Temporal Pyramid Transformer (TPT) module to fuse spatiotemporal features across multiple temporal scales, enabling the model to capture both short-term and long-term temporal dependencies within an SCV for hierarchical learning. The proposed DCST-CNN and TPT modules work together to provide a robust and accurate NR-SCVQA framework. We conduct experiments on SCVQA databases to validate the effectiveness of our model, which outperforms existing state-of-the-art NR-SCVQA method. The results demonstrate the strength and applicability of our approach in real-world SCVQA tasks.

Abstract:
Although screen-camera resilient watermarking addresses issues such as privacy leakage and copyright infringement in digital images to some extent during screen-camera communication. However, in screen-camera scenarios, uncontrolled shooting environments, various display devices, and different lens types introduce more complex noise into the watermarked images. Because some noise generated during the screen-camera process cannot be quantitatively analyzed, the integrity of the embedded watermark is compromised, making copyright verification and information acquisition still difficult. To solve this problem, we establish a large-scale screen-camera image dataset (SCISet) and propose a noise simulation network (NoS-Net). Specifically, we obtain 36,000 screen-camera images under various shooting environments with multiple types of screens and cameras. Then, we use SCISet to train the proposed NoS-Net based on the U-Net architecture, which can learn multi-level and complementary feature information of screen-camera images, enhancing its ability to simulate complex noise. Experimental results show that integrating the proposed NoS-Net into mainstream screen-camera resilient watermarking methods significantly improves their ability to resist screen-camera noise attacks. Furthermore, the diversity of SCISet plays an important role in advancing robust watermarking research.

Abstract:
Attribute-Based Signature (ABS) provides a critical solution for ensuring data integrity, fine-grained access control, and anonymous authentication in security-sensitive systems such as the Multimedia Internet of Things (MIoT) and multimedia streaming platforms. However, practical adoption of ABS faces three fundamental challenges: vulnerability to key exposure and escrow risks, linear growth of computational cost, and insufficient robustness in multi-authority environments. To address these issues, we propose a forward secure and threshold authorized multi-authority ABS scheme called FORT in this paper. By employing a binary tree structure to divide multiple time periods, historical signatures remain valid even in the event of key exposure. Furthermore, to balance robustness and resistance to corruption while mitigating the key escrow problem, we construct a threshold authorized multi-authority structure based on Lagrange interpolation. This structure effectively reduces the impact of a single authority on the MIoT. Additionally, through the adoption of outsourced computation technology, which offloads complex computations in the signature and verification phases to the edge server, the computational burden for both the signer and verifier is significantly reduced to a small constant. Rigorous security analysis demonstrates that the FORT scheme achieves forward security, collusion attack resistance, corrupt authority resistance and anonymity. Theoretical comparisons and simulation experiments demonstrate the lightweight nature of the FORT scheme in terms of computation and communication.

Abstract:
Recently, sentiment analysis research has made significant improvements in addressing sentiment and subjectivity within textual content. The advent of multimodal deep learning techniques has further broadened this scope, enabling the integration of diverse modalities such as voice and image features alongside text. However, despite these advancements, the analysis of the Korean language remains challenging due to its inherently agglutinative nature and linguistic ambiguity, primarily examined at the sentence level. To effectively address this challenge, we propose a novel Multimodal Sentimental Deep Learning Framework for Korean (MSDLF-K), which can examine not only Korean text but also its associated speech. Our framework, MSDLF-K, integrates spectrograms and waveforms from Korean voice data with embedding vectors derived from script sentences, creating a unified multimodal representation. This approach facilitates the identification of both shared and unique features within the latent space, thereby offering valuable insights into their respective impacts on sentiment analysis performance. To validate the efficacy of MSDLF-K, we conducted a set of experiments using the emotion speech synthesis dataset. Our findings demonstrate that MSDLF-K achieves a remarkable accuracy of 79.0% in valence and 81.7% in arousal for emotion classification, metrics previously unexplored in the literature. Furthermore, empirical analysis reveals the significant influence of multimodal representations, encompassing both text and voice, on enhancing emotion analysis performance. In summary, our study not only presents a pioneering solution for sentiment analysis in the Korean language but also underscores the importance of incorporating multimodal approaches for more comprehensive and accurate sentiment analysis across diverse linguistic contexts.

Abstract:
Semantic foggy scene understanding (SFSU) emerges a challenging task under out-of-domain distribution (OD) due to uncertain cognition caused by degraded visibility. With the strong assumption of data centralization, unsupervised domain adaptation (UDA) reduces vulnerability under OD scenario. Whereas, enlarged domain gap and growing privacy concern heavily challenge conventional UDA. Motivated by gap decomposition and data decentralization, we establish a decentralized domain adaptation (DDA) framework called Translate thEn Adapt (abbr. TEA) for privacy preservation. Our highlights lie in. (1) Regarding federated hallucination translation, a Disentanglement and Contrastive-learning based Generative Adversarial Network (abbr. DisCoGAN) is proposed to impose contrastive prior and disentangle latent space in cycle-consistent translation. To yield domain hallucination, client minimizes cross-entropy of local classifier but maximizes entropy of global model to train translator. (2) Regarding source-free regularization adaptation, a Prototypical-knowledge based Regularization Adaptation (abbr. ProRA) is presented to align joint distribution in output space. Soft adversarial learning relaxes binary label to rectify inter-domain discrepancy and inner-domain divergence. Structure clustering and entropy minimization drive intra-class features closer and inter-class features apart. Extensive experiments exhibit efficacy of our TEA which achieves 55.26% or 46.25% mIoU in adaptation from GTA5 to Foggy Cityscapes or Foggy Zurich, outperforming other DDA methods for SFSU.

Abstract:
The learned image compression (LIC) technique has surpassed the state-of-the-art traditional codecs (H.266/VVC) in case of rate-distortion (R-D) performance. Its real-time deployments are far advanced. In order to achieve more flexible deployments, an LIC technique should be flexible in adjusting its computational complexity and rate as demanded by a situation and its environment. In this paper, we propose a unified Rate-Distortion-Complexity (R-D-C) framework for LIC under channel energy concentration criteria. Specifically, we first introduce an Energy Asymptotic Nonlinear Transformation (EANT) designed to directly concentrate on the channel energy of latent representations, thus laying the groundwork for a scalable entropy coding. Next, leveraging this energy concentration characteristic, we propose a corresponding Heterogeneous Scalable Entropy Model (HSEM) for flexibly scaling bitstreams as needed. Finally, utilizing the proposed EANT, we construct a fine-grained scalable codec for formulating, in combination with HSEM, a comprehensive scalable R-D-C framework under the energy concentration criteria. The obtained experimental results demonstrate that the proposed method could enable seamless transitions between 13 different widths of sub-models within a single network, allowing for fine-grained control over the model bitrate, complexity, and hardware inference time. Additionally, the proposed method exhibits competitive R-D performance compared to many existing methods.

Abstract:
The commonly used standard convolutional layers cannot adaptively adjust the number and locations of sampling points according to the scales and shapes of tampered regions, which increases the difficulty of detecting images containing tampered regions of different sizes. Therefore, the selective sampling attention (SSA) is proposed to automatically learn the number and locations of sampling points as well as the weight of each sampling point within a certain context range of the input feature map through backpropagation, which can help the network better adapt to tampered regions of different scales and shapes. In addition, the self-correlation calculation (SCC), aiming at calculating the similarity between every two feature points in a feature map, necessarily incurs an expensive computational burden when used for high-resolution feature maps. To remedy the problem, the two-step SCC (TS-SCC) with low computation burden is proposed to pick out highly similar regions by means of the feature similarity obtained from low-resolution version of the input feature map, so that the high-resolution version merely needs to calculate the similarity between every two feature points within its high-similarity regions. Finally, to predict the edges and interiors of copy-move tampered regions more precisely, adaptive dual-branch feature fusion module is proposed to employ a lightweight multi-scale atrous convolutional module to adaptively fuse multi-level features before TS-SCC and the correlation features after TS-SCC, thereby improving the detection performance. Combining these three structures, a lightweight, fast, low-cost and high-precision CMFD network, ST-Net, is designed in this paper. Experimental results on four publicly available datasets verify that ST-Net outperforms several related CMFD networks in terms of detection accuracy, number of parameters, computational cost and inference time.

Abstract:
Purely chromatic background images are widely used in computer wallpapers and advertisements, leading to issues such as copyright infringement and the loss of interest of holders. Image hashing is a technique used for comparing the similarity between images, and is often used for image verification, search, and copy detection due to its insensitivity to subtle changes in the original image. In a purely chromatic background image, the central detail of the image is the primary part and the key for copyright authentication. As the perception hash (pHash) algorithm only retains the low-frequency portion of the discrete cosine transform (DCT) matrix, it is unsuitable for purely chromatic background images. To deal with this issue, we propose an improved perception hash (ipHash) algorithm to enhance the universality of the algorithm by extracting purely chromatic background image features. Meanwhile, the development of image hashing is restricted due to the requirement of a trusted third party. To solve this issue, a secure blockchain-based image copyright protection scheme is designed. It realizes the copyright authentication and traceability, and overcomes the issue of a lack of trusted third parties. Experimental results show that the proposed method outperforms the state-of-the-art image copyright protection schemes.

Abstract:
Periocular recognition is regarded as an alternative trait for biometric recognition that can effectively solve the identification problem under large occlusions. However, few datasets are tailored for periocular recognition. For most compromises, iris datasets at near-infrared wavelengths, miss information about the eyebrows or eyelids. In this paper, a challenging dataset for real scenarios named CASIA-PR-V1 with evaluation protocols is released for periocular recognition. It is collected from multiple types of mobile devices with different resolutions or wavelengths. A rich set of attributes, e.g., ethnicities, is tagged to support fine-grained classification tasks. Moreover, we consider a wide range of noisy data in unconstrained environment, especially for glasses. Superior to its counterparts, this periocular dataset is highly valuable for studying cross-device and cross-spectral periocular recognition with occlusions, as well as fine-grained attribute classification. Additionally, a multiscale disentangled model is proposed to extract discriminating representations for periocular recognition with severe occlusions. Extensive experiments are conducted on CASIA-PR-V1, and the results indicate the superiority of our model for unconstraint periocular recognition.

Abstract:
Displaying high-quality images on edge devices, such as augmented reality devices, is essential for enhancing the user experience. However, these devices often face power consumption and computing resource limitations, making it challenging to apply many deep learning-based image compression algorithms in this field. Implicit Neural Representation (INR) for image compression is an emerging technology that offers two key benefits compared to cutting-edge autoencoder models: low computational complexity and parameter-free decoding. It also outperforms many traditional and early neural compression methods in terms of quality. In this study, we introduce a new Mixed AutoRegressive Model (MARM) to significantly reduce the decoding time for the current INR codec, along with a new synthesis network to enhance reconstruction quality. MARM includes our proposed AutoRegressive Upsampler (ARU) blocks, which are highly computationally efficient, and ARM from previous work to balance decoding time and reconstruction quality. We also propose enhancing ARU’s performance using a checkerboard two-stage decoding strategy. Moreover, the ratio of different modules can be adjusted to maintain a balance between quality and speed. Comprehensive experiments demonstrate that our method significantly improves computational efficiency while preserving image quality. With different parameter settings, our method can achieve over a magnitude acceleration in decoding time without industrial level optimization or achieve state-of-the-art reconstruction quality compared with other INR codecs. To the best of our knowledge, our method is the first INR-based codec comparable with (Ballé et al., 2018) in both decoding speed and quality while maintaining low complexity.

Abstract:
While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module—a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: (i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and (ii) Attribute binding, ensuring that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models like Stable Diffusion and Gligen, markedly enhancing models’ performance in addressing these key challenges. We assess our technique on the well-established CompBench and TIFA score benchmarks, and HRS dataset where B2B not only surpasses methods specialized in either attribute binding or layout guidance but also uniquely excels by integrating these capabilities to deliver enhanced overall performance.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable abilities in image-language reasoning. However, they deal with Video Question Answering (VideoQA) insufficiently, especially for questions demanding causal-temporal reasoning. Typically, they directly concatenate features of uniformly sampled frames as visual inputs for VideoQA. This gives rise to two challenges. For one thing, uniformly sampled frames are discrete and separately distributed across different timestamps, disrupting the coherence of question-critical events or actions. For another, it considers every scene within videos equally and introduces redundant frames that may distract the model from discovering the truth. Towards this, we highlight the importance of identifying continuous frames that are crucial for answering the questions, and propose a lightweight and differentiable Coherence Recognizer (CoRe) to achieve this. Guided by the semantics of questions, CoRe computes scores recording the relevance between each frame and the question, and selects a set of continuous frames with the highest scores for answer prediction. Additionally, CoRe encodes the unselected frames into a short and coarse-grained representation as a completion of the general context. Equipped with CoRe, we can efficiently fine-tune the current MLLMs for VideoQA in an end-to-end manner, without suffering from the problems of incoherence or distraction. Extensive experiments demonstrate that our method achieves substantial improvements on several VideoQA benchmarks.

Abstract:
Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve dense predictions without laborious annotations. However, due to the ambiguous contexts and fuzzy regions, the performance of WSSS, particularly during the stages of generating Class Activation Maps (CAMs) and refining pseudo masks, is widely hindered by ambiguity. Despite this, this issue has received little attention in previous literature. In this work, we propose UniA, a unified single-staged WSSS framework, to efficiently tackle this issue from the perspectives of uncertainty inference and affinity diversification. When activating class objects, we argue that the false activation stems from the bias to ambiguous regions during the feature extraction. Therefore, we formulate a robust feature representation with a Gaussian distribution and introduce the uncertainty estimation to avoid the bias. A distribution loss is proposed to supervise the process, which effectively captures the ambiguity and models the complex dependencies among features. When refining pseudo labels, we observe that the affinity from the prevailing refinement methods intends to be overly similar among ambiguities. To this end, we design an affinity diversification module to promote diversity among semantics. A mutual complementing refinement is first proposed to statically rectify the ambiguous affinity with multiple inferred pseudo labels. Then a contrastive affinity loss is further designed to dynamically diversify the relations among unrelated semantics. It stably propagates the diversity into the feature representation and helps generate better pseudo masks. Extensive experiments are conducted on PASCAL VOC, MS COCO, and medical ACDC datasets, which validate the efficiency of UniA tackling ambiguity and its superiority over recent single-staged or even most multi-staged competitors.

Abstract:
Hyperspectral image (HSI) super-resolution through the fusion of low-resolution HSI (LrHSI) and high-resolution multispectral image (HrMSI) has emerged as a critical technique for enhancing the quality of HSIs. The recent progress in this field predominantly assume a known mapping relationships between high-resolution HSI (HrHSI) and low-resolution version, relying on networks to learn this mapping to generate HrHSI. However, this assumption is often unrealistic in practical applications. To address this limitation, we propose the Spatial-Spectral-Integrated Transparent Diffusion Model (S^2TD) for blind HSI-SR, which is more adaptive to scene-variant degradations with a universal framework for both spatial and spectral reconstruction. Specifically, we design a multi-order degradation pool to generate diverse samples, thereby reducing the distribution gap between low-resolution images in real scenes. Additionally, we develop a spatial-spectral consistent degradation model, which is iteratively solved using an optimization algorithm and unrolled into neural networks for separate restoration in spatial and spectral aspects. Furthermore, the capabilty of progressive reconstruction in the diffusion model is involved to fit various degradations in different dimensions using similar network architectures, thereby enhancing the overall robustness of the network to various and complex scenarios. Comprehensive experiments conducted on three publicly synthetic datasets and one real-world dataset validate the superior performance of the proposed method under the condition that the degradation remains unknown.

Affiliations: Key Lab of Digital Signal and Image Processing of Guangdong Province, Department of Electronic Engineering, Shantou University, Shantou, China; School of Mathematics, Foshan University, Foshan, China; State Key Laboratory of Internet of Things for Smart City and the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China; Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China

Abstract:
The challenges in point cloud analysis are primarily attributed to the irregular and unordered nature of the data. Numerous existing approaches, inspired by the Transformer, introduce attention mechanisms to extract the 3D geometric features. However, these intricate geometric extractors incur high computational overhead and unfavorable inference latency. To tackle this predicament, in this paper, we propose a lightweight and faster attention-based network, named Dual Perception MAM (DuPMAM), for point cloud analysis. Specifically, we present a novel simple Point Multiplicative Attention Mechanism (PMAM). It is implemented solely through single feed-forward fully connected layers, hence leading to lower model complexity and superior inference speed. Based on that, we further devise a dual perception strategy by constructing both a local attention block and a global attention block to learn fine-grained geometric and overall representational features, respectively. Consequently, compared to the existing approaches, our method has excellent perception of local details and global contours of the point cloud objects. In addition, we ingeniously design a Graph-Multiscale Perceptual Field (GMPF) testing strategy for model performance enhancement. It has significant advantage over the traditional voting strategy and is generally applicable to point cloud tasks, encompassing classification, part segmentation and indoor scene segmentation. Empowered by the GMPF testing strategy, DuPMAM delivers the new State-of-the-Art on the real-world dataset ScanObjectNN, the synthetic dataset ModelNet40 and the part segmentation dataset ShapeNet, and compared to the recent GB-Net, our DuPMAM trains 6 times faster and tests 2 times faster.

Abstract:
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose ETC (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

Abstract:
Fish quality and shelf life can be evaluated using various assessment methods, such as sensory analysis, biochemical tests, microbiological evaluations, and physicochemical analyses. However, these methods are invasive and time-consuming, driving interest in technologies capable of estimating shelf life through non-invasive procedures. This study investigates the potential of hyperspectral imaging as a non-invasive technology for predicting the shelf life of Atlantic cod. A storage experiment was conducted that included both gutted fish with heads (GFWH) and fillets, with sensory evaluation and biochemical measurements employed to determine shelf life. Subsequently, hyperspectral images of the fish samples were captured under industrial production conditions, and the spectral data were analyzed using different regression algorithms. The majority of the regression techniques utilized in this research successfully predicted shelf life for both fillets and GFWH, achieving a root mean square error (RMSE) lower than one day. While most regression models exhibited comparable performance in predicting the shelf life of fillets, deep learning-based models demonstrated superior performance for GFWH. These results suggest that hyperspectral imaging technology has significant potential as a non-invasive tool for estimating the shelf life of Atlantic cod, thereby enabling effective quality-based sorting, reducing food waste, and enhancing sustainability in the seafood supply chain.

Abstract:
For visual disbalance defects (VDDs) in low-light images, such as brightness unevenness and color imbalance, existing enhancement methods struggle to extract defect features from local regions and apply adaptive enhancement based on varying degrees of these defects. To address these challenges, we propose an unsupervised multi-modal enhancement method based on a high-order adaptive curve, named CLIP-AE. Specifically, we introduce a multi-modal recurrent optimization approach utilizing contrastive language-image pre-training (CLIP). This method iteratively optimizes variable embedded prompts and an Adaptive Enhancement Module (AEM) to establish dependencies between the prompts and detailed style features in the images, guiding the AEM to perform adaptive image enhancement. Additionally, we implement a progressive feature alignment strategy to enhance the model's ability to perceive style features and improve optimization efficiency by using multiple enhanced images with identical content features and incremental style features. In the AEM, the optimized Hyperparameters Generative Network (HGN) generates the optimal hyperparameters, which drive a High-Dimensional Nested Gamma correction (HDN-Gamma) to perform pixel-wise adaptive enhancement for VDDs. HDN-Gamma further maps pixel values using specific enhancement curves to avoid artifacts. Extensive experiments demonstrate that our method effectively improves visual disbalance defects and reduces artifacts. Compared to seven state-of-the-art algorithms, our method shows significant improvements (PSNR: 16.46%, 16.89%, and 15.14%; SSIM: 9.26%, 8.02%, and 9.85%; MUSIQ: 6.37%, 6.54%, and 7.45%) on the LOL, SICE, and MIT-Adobe FiveK datasets. Our approach offers a novel solution for applying multimedia technology in low-light image enhancement tasks.

Abstract:
In the field of HEVC (High Efficiency Video Coding) double compression detection, relocated I-frame (RI frame) detection and original GOP size estimation are two significant problems for video forensics. However, little research explores the interconnection between the two problems, and effective methods to resolve them are still lacking. In this paper, a novel feature model called In-loop Filtering and CU Depth Map (IFCDM) is proposed to accurately detect RI frames, and the intrinsic correlation between RI frames and GOP structure is explored, which can be used for original GOP size estimation. Theoretical and statistical analysis of HEVC recompression process is first carried out. Then, sub-features of HEVC in-loop filtering modes and CU partition depth are extracted, and transformed into grey-scale maps to construct IFCDM. A neural network, consisting of tiny Vision Transformer and LSTM, is trained to learn spatial and temporal representations of input features, and further derive the RI frame detection results. Finally, an adaptive periodic analysis algorithm is designed, to integrate the RI frame detection results and estimate the original GOP size of recompressed videos. Experiments show that our method can outperform the existing state-of-the-art methods in both frame level and video level.

Abstract:
As two intimate reciprocal tasks, scene-aware human motion synthesis and analysis require a joint understanding between multiple modalities, including 3D body motions, 3D scenes, and textual descriptions. In this paper, we integrate these two paired processes into a Co-Evolving Synthesis-Analysis (CESA) pipeline and mutually benefit their learning. Specifically, scene-aware text-to-human synthesis generates diverse indoor motion samples from the same textual description to enrich human-scene interaction intra-class diversity, thus significantly benefiting training a robust human motion analysis system. Reciprocally, human motion analysis would enforce semantic scrutiny on each synthesized motion sample to ensure its semantic consistency with the given textual description, thus improving realistic motion synthesis. Considering that real-world indoor human motions are goal-oriented and path-guided, we propose a cascaded generation strategy that factorizes text-driven scene-specific human motion generation into three stages: goal inferring, path planning, and pose synthesizing. Coupling CESA with this powerful cascaded motion synthesis model, we jointly improve realistic human motion synthesis and robust human motion analysis in 3D scenes.

Abstract:
Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

Abstract:
Transformer architecture has demonstrated significant potential in hyperspectral object tracking by leveraging global correlation learning to accurately represent the data distribution. However, existing hyperspectral object trackers based on transformer models typically rely on costly pre-trained models, making them prone to crashing due to overfitting when tuned on small-scale hyperspectral videos, greatly limiting their performance. To address this challenge, in this paper, a Hybrid Transformer with Adaptive Content and Position Embedding (HTACPE) tracker is proposed to improve the learning efficiency of the tracking model, and fully explore the spectral-spatial information. Specifically, an Adaptive Content and Position Embedding Module (ACPEM) is designed to dynamically learn the balance between focusing on positional and content-based information, which allows the model to effectively handle datasets of various sizes. To enhance the spectral-spatial information, a Spectral Grouping Module (SGM) is designed to learn the high-frequency information in complex scenarios, thereby enhancing diversified features. It operates in parallel with the ACPEM feature learning module. Furthermore, a Dynamic Reliability Refinement Module (DRRM) is incorporated to address challenges related to accurate object position perception, iteratively refining prediction parameters to enhance the reliability of the model. Extensive experiments demonstrate that the proposed HTACPE achieves satisfactory tracking performance both qualitatively and quantitatively, especially with insufficient training data.

Abstract:
Tone-Mapping Operators (TMOs) aim at converting high dynamic range (HDR) images into standard dynamic range (SDR) ones that are suitable for being displayed on standard screens. As the visual quality of tone-mapped image (TMI) is paramount, conducting quality assessment of TMIs becomes crucial. Despite the growing body of research on TMI quality assessment, the existing metrics are often limited to a narrow selection of hand-picked examples generated by a restricted range of TMOs. Consequently, their ability of generalizing to the wide array of TMIs encountered in practical scenarios remains unclear. Moreover, the quality degradation in practical TMIs can be intricate, diverse, and complex. To overcome these limitations, we construct so far the largest subjective-annotated TMI quality assessment dataset which comprises a total number of 14,000 TMIs generated by applying 20 representative TMOs to 700 HDR images. The dataset is accompanied by subjective scores that encompass multiple quality dimensions, i.e., TMI quality dataset in terms of Detail visibility, Color naturalness, and overall Quality (TDCQ). In addition, we also design a multi-branch deep neural network tailored to characterize the multi-dimensional quality perception of TMIs, i.e., Color naturalness-, Detail visibility-aware TMI Quality (CDTIQ) metric, allowing for a comprehensive and multifaceted quality assessment of TMIs. Through extensive experiments, we demonstrate the superiority of our proposed metric, showcasing a higher correlation with subjective rating results compared to other relevant no-reference image quality metrics.

Abstract:
The operation of traditional multi-object trackers on a moving autonomous aerial vehicle (AAV) faces many difficulties due to the irregular motion of AAV, the occlusion problem, and in particular arbitrarily oriented targets that are densely distributed with complex backgrounds. To solve these difficulties, this paper proposes a novel multi-object tracking framework, namely ArbiTrack, for a moving AAV to effectively detect and track arbitrarily oriented targets on the grounds. The proposed framework consists of an oriented object detection module to capture ground objects, a multi-scale context aggregation (MCA) module to improve the detection accuracy of small objects, and an adaptive motion switching (AMS) module to deal with the nonlinear complexity among AAV and ground objects. Historical information from multiple moments is used in this framework to learn the spatio-temporal characteristics so that the occlusion problem can be solved effectively. Experiments are conducted by using our OriDrone dataset and the public dataset UAVDT dataset. Results demonstrate that the proposed method achieves state-of-the-art tracking performance.

Abstract:
Performing semantic segmentation on point clouds is the primary method by which machines perceive 3D scenes in a fine-grained manner. Deep learning algorithms usually require many pointwise annotations obtained with specialized tools, which is a laborious and inefficient process. To this end, we develop two frameworks for training point cloud semantic segmentation networks, one that utilizes fewer projected image annotations and another that employs sparse scribble image annotations, making the process more flexible and user friendly. However, back-projecting 2D-pixel labels to 3D points during loss calculations always introduces errors. To increase the back-projection accuracy of our approach, we first identify and record potential pixel-point correspondence errors and then develop strategies for constructing an accurate back-projection mapping matrix. Specifically, we filter out occluded and noisy points to avoid incorrect label allocations and permit multiclass assignments to adjust the ambiguity of boundary points. By incorporating an accurate back-projection mechanism into the loss functions of the proposed training frameworks, our networks can perform well with only four projected image annotations or even sparse scribble image annotations for each scene. This results in state-of-the-art performance compared with that of other weakly supervised point cloud semantic segmentation approaches, and the outcomes are even comparable to those produced by fully supervised methods on the S3DIS and ScanNet-v2 datasets.

Abstract:
Both the fusion expression of scene information from multi-modal images and pipeline of downstream tasks have become a new focus in image fusion field. Recently, most studies propose multi-task driven fusion methods. However, these methods employ specific trained multi-modal fusion parts for a certain downstream task, ignoring the broader scenario description and application value of the fusion task itself. In order to focus on the visual perception of depth features from the fusion scenes, we design a new method (CDFGAN) based on Scene Fusion (ScF), with the multi-modal geometric depth as background. Concretely, we leverage adaptive feature maps and recoverable depth information supplement for infrared and visible image fusion. First, by devising a Successive Generating Network (SGN) based on geometric interpolation, the structural consistency of fusion scenes is enhanced. We propose an Adaptive Discriminator Network (ADN) based on Elastic Feature Mapping Module (EFMM). This reduces time consumption caused by the design of modules in generator and improves the effectiveness of the discriminator as well as the generator. Furthermore, a multi-modal Poisson loss function is proposed to align the pixel distribution of different modals, ensuring the fusion results have more similar structural information to inputs. Extensive experiments have validated that our method has advantages and applicability in multiple downstream tasks while improving fusion performance.

Abstract:
Virtual reality (VR) is increasingly capable and inexpensive, and VR devices have become indispensable in many domains, such as gaming, videoconferencing, education, and healthcare. VR has also been applied to music performance and learning. Virtual instruments, such as virtual pianos and drums, free users from the need to own physical forms of these (often bulky and expensive) instruments. VR devices enable users to enjoy music anytime and anywhere without constraints. Virtual concerts, including spatial audio simulations and reconstructions of historical performances, are becoming increasingly common. Previous studies have primarily examined virtual guitars in non-VR environments and air guitar chord recognition. However, systematic research on virtual air guitar systems in VR remains scarce. Virtual guitar games that are available on the market cannot recognize hand gestures accurately and thus cannot accurately identify the strumming patterns and chords played by the player. To overcome this problem, we propose a VR-based virtual air guitar system that can recognize 30 chords and various strumming techniques through deep learning and visual feedback. Employing a black-box approach, we combine WaveNet and FiLM to simulate electric guitar pedal effects with a knob difference loss mechanism, which simulates the turning of knobs on a guitar effects pedal, for enhanced accuracy.