TMM2024

Abstract:
Currently, salience-based channel pruning makes continuous breakthroughs in network compression. In the realization, the salience mechanism is used as a metric of channel salience to guide pruning. Therefore, salience-based channel pruning can dynamically adjust the channel width at run-time, which provides a flexible pruning scheme. However, there are two problems emerging: a gating function is often needed to truncate the specific salience entries to zero, which destabilizes the forward propagation; dynamic architecture brings more cost for indexing in inference which bottlenecks the inference speed. In this article, we propose a Progressive Channel-Shrinking (PCS) method to compress the selected salience entries at run-time instead of roughly approximating them to zero. We also propose a Running Shrinking Policy to provide a testing-static pruning scheme that can reduce the memory access cost for filter indexing. We evaluate our method on ImageNet and CIFAR10 datasets over two prevalent networks: ResNet and VGG, and demonstrate that our PCS outperforms all baselines and achieves state-of-the-art in terms of compression-performance tradeoff. Moreover, we observe a significant and practical acceleration of inference. The code is available at https://github.com/JianhongPan-VLG/Progressive.Channel-Shrinking.Network.

Abstract:
Healthy dietary intake has a broad influence on the quality of life, and nutrition prediction plays a great role in the auxiliary decision-making of diet. Given a food image, existing nutrition prediction methods directly regress the nutrition content. However, due to the complex variations in food images, such as differences in viewpoint and lighting conditions, directly regressing the nutrition content faces significant challenges. The complexity of the food image data results in a high-dimensional and feature-rich input space, which poses difficulties for traditional regression models to efficiently navigate and optimize. Consequently, the direct regression paradigm usually generates inaccurate nutrition predictions. To alleviate the ambiguity challenge in the prediction progress, we propose to narrow the searchable space for the model's predictions by decomposing the direct regression into two steps: first coarsely selecting the nutrition scope and then finely refining the prediction value, forming a coarse-to-fine nutrition prediction paradigm. Although the process of coarse prediction which selects a bin from a series of scope bins can be formulated as a standard classification problem, it exhibits a distinguishable characteristic, i.e. the closer to the ground truth bin, the less punishment in the training phase. However, most of the current methods have ignored this phenomenon, thus, we specially design the linearly smoothed label in the nutrition prediction task to reveal the relative distance to the ground truth bin, leading to extraordinary improvements. Furthermore, we conduct a pair-wise comparison among all bins by extending the 1D label into 2D space and propose the structure loss to guide the bin selection process effectively. Due to the narrowed decision space, the nutrition prediction problem can be effectively optimized, and the proposed method achieves promising results on three benchmarks ECUSTFD, VFD and Nutrition5K, demonstrating the efficiency of the coarse-to-fine paradigm equipped with the linear-smoothed structure loss.

Abstract:
Given a set of multi-view instances, the prevailing assumption in most existing clustering approaches is that they are complete and exhibit cross-view alignment. However, this assumption is often unrealistic. In such scenarios, it could be satisfied at the cost of data pre-processing, but this would be complex and inconsistent with practical applications. Therefore, developing more effective solutions for the View-unaligned Problem (VuP) is highly desirable. Several pioneering works have tackled the partially VuP, yet handling fully VuP remains a challenge due to the reliance on partially pre-aligned instances. In this paper, we propose One-pass View-unaligned Clustering (OpVuC) that simultaneously aligns and clusters instances in a unified framework. Specifically, we align shuffled instances with a selected template using an innovative global-local alignment scheme based on the notion of geometric invariance and separate the fully aligned instances using a relaxed k-means algorithm. The proposed OpVuC method can handle VuP at any alignment level without requiring any pre-aligned instances. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness and merits of the proposed OpVuC method.

Abstract:
Radio-frequency (RF) based human sensing technologies, due to their great practical value in various applications and privacy-preserving nature, have gained tremendous attention in recent years. However, without fully exploiting the characteristics of radio signals, the performance of existing methods are still limited. First, RF features of the moving human body have different representations in dimensions such as channel and scale, which is challenging when performing feature fusion. Besides, the human body is specularly reflective with respect to the radar, which means the human body cannot be fully captured by a single RF snapshot. Therefore, the radar signal reflected by the human body is sparse and incomplete, which is difficult to extract high-quality features for 3D human pose estimation. In this paper, we present the RF-based Pose Machines (RPM), a novel framework which can generate 3D skeletons from RF signals. Considering the characteristics of RF signals, RPM includes several modules to overcome the challenges. Firstly, a Feature Fusion Network (FFN) is designed to effectively fuse radio signals from horizontal and vertical planes based on the channels' correlation and maintain high-quality feature via a multi-scale fusion block. A Spatio-Temporal Attention network is then designed to reconstruct 3D skeletons from the sparse and incomplete RF signals. Specifically, a spatial attention module is designed to model non-local relationships among joints and reconstruct body parts that a single RF snapshot cannot capture. Afterwards, a temporal attention module is proposed to refine 3D pose based on temporal coherency learned from frame queries. To evaluate the performance of our RPM framework, we construct a large-scale dataset of synchronized 3 d skeletons and RF signals, RFSkeleton3D. Our experimental results show that RPM locates 3D key points of the human body with an average error of 5.71\;\mathrmcm and maintains its performance in new environments with occlusion or bad illumination. The dataset and codes will be made in public.

Abstract:
Multi-agent applications have recently gained significant popularity. In many computer vision tasks, a network of agents, such as a team of robots with cameras, could work collaboratively to perceive the environment for efficient and accurate situation awareness. However, these agents often have limited computation, communication, and storage resources. Thus, reducing resource consumption while still providing an accurate perception of the environment becomes an important goal when deploying multi-agent systems. To achieve this goal, we identify and leverage the overlap among different camera views in multi-agent systems for reducing the processing, transmission and storage of redundant/unimportant video frames. Specifically, we have developed two collaborative multi-agent video fast-forwarding frameworks in distributed and centralized settings, respectively. In these frameworks, each individual agent can selectively process or skip video frames at adjustable paces based on multiple strategies via reinforcement learning. Multiple agents then collaboratively sense the environment via either 1) a consensus-based distributed framework called DMVF that periodically updates the fast-forwarding strategies of agents by establishing communication and consensus among connected neighbors, or 2) a centralized framework called MFFNet that utilizes a central controller to decide the fast-forwarding strategies for agents based on collected data. We demonstrate the efficacy and efficiency of our proposed frameworks on a real-world surveillance video dataset VideoWeb and a new simulated driving dataset CarlaSim, through extensive simulations and deployment on an embedded platform with TCP communication. We show that compared with other approaches in the literature, our frameworks achieve better coverage of important frames, while significantly reducing the number of frames processed at each agent.

Abstract:
Event cameras are bio-inspired vision sensors with a high dynamic range (140 dB for event cameras vs. 60 dB for traditional cameras) and can be used to tackle the image degradation problem under extremely low-illumination scenarios, which is still not well-explored yet. In this article, we propose a joint framework to compose the underexposed frames and event streams captured by the event camera to reconstruct clear images with detailed textures under almost dark conditions. A residual fusion module is proposed to reduce the domain gap between event streams and frames by using the residuals of both modalities. A multi-level reconstruction loss based on the variability of the contrast distribution is proposed to reduce the perceptual errors of the output image. In addition, we construct the first real-world low-illumination image enhancement dataset (mainly under 2 lux illumination scenes), named LIE, containing event streams and frames collected under indoor and outdoor low-light scenarios together with the ground truth clear images. Experimental results on our LIE dataset demonstrate that our proposed method could achieve significant improvements compared with existing methods.

Abstract:
Knowledge distillation, a widely adopted model compression technique, distils knowledge from a large teacher model to a smaller student model, with the goal of reducing the computational resources required for the student model. However, most existing distillation approaches focus on the types of knowledge and how to distil them, which neglect the student model's neuronal responses to the knowledge. In this article, we demonstrate that the kullback-leibler loss inhibits the neuronal responses in the opposite gradient direction, which injures the student model's potential during distilling. To address this problem, we introduce a principled dual-stage distillation scheme to rejuvenate all inhibited neurons at the neuronal level. In the first stage, we detect all the neurons in the student model during the standard distillation period and divide them into two parts according to their responses. In the second stage, we propose three strategies to resuscitate the neurons differently, which allows us to exploit the full potential of the student model. Through the experiments in various aspects of knowledge distillation, it is verified that the proposed approach outperforms the current state-of-the-art approaches. Our work provides a neuronal perspective for studying the response of the student model to the knowledge from the teacher model.

Abstract:
Tensor-based multi-view subspace clustering (MSC) can capture high-order correlation in the self-representation tensor. Current tensor decompositions for MSC suffer from highly unbalanced unfolding matrices or rotation sensitivity, failing to fully explore inter/intra-view information. Using the advanced tensor network, namely, multi-scale entanglement renormalization ansatz (MERA), we propose a low-rank MERA based MSC (MERA-MSC) algorithm, where MERA factorizes a tensor into contractions of one top core factor and the rest orthogonal/semi-orthogonal factors. Benefiting from multiple interactions among orthogonal/semi-orthogonal (low-rank) factors, the low-rank MERA has a strong representation power to capture the complex inter/intra-view information in the self-representation tensor. The alternating direction method of multipliers is adopted to solve the optimization model. Experimental results on five multi-view datasets demonstrate MERA-MSC has superiority against the compared algorithms on six evaluation metrics. Furthermore, we extend MERA-MSC by incorporating anchor learning and develop a scalable low-rank MERA based multi-view clustering method (sMREA-MVC). To our knowledge, this is the first work to introduce MERA to the multi-view clustering topic. The effectiveness and efficiency of sMERA-MVC have been validated on three large-scale multi-view datasets.

Abstract:
Image-to-music generation aims to generate realistic pure music according to a given image. Although many previous works are conducted on bridging image and music, they mainly focus on the content-based cross-modal matching. For example, matching the Christmas song to an image that contains a Christmas tree. By comparison, image-to-music generation is a more challenging task due to its ambiguity and subjectivity. Specifically, there is no explicit correlation between the image content and music melody, without any lyric and human sound. Meanwhile, the perception of generated music varies from person to person. Inspired by the synesthesia phenomenon, we think that if an image tends to elicit a certain emotion on human, the generated music should also leave a similar impression. Therefore, in this paper, we propose a continuous emotion-based image-to-music generation framework, which uses emotion as the key for cross-modal generation. Specifically, a new image-music dataset is established, which uses valence-arousal (VA) space to capture the complex and nuanced nature of emotions. After that, a plug and play model is proposed to translate an image into a piece of music with similar emotion, which projects the emotions into continuous-valued labels, and explores both the intra-modal and inter-modal emotional consistency with contrastive learning. To our best knowledge, this is the first end-to-end framework towards the task of pure music generation from natural images. Extensive experiments show that the generated music achieves satisfactory emotional consistency with the input images, as well as impressive quality.

Abstract:
Video inpainting aims to fill in missing regions of a video after any undesired contents are removed from it. This technique can be applied to repair the broken video or edit the video content. In this paper, we propose a depth-guided deep video inpainting network (DGDVI) and demonstrate its effectiveness in processing challenging broken areas crossing multiple depth layers. To achieve our goal, we divide the inpainting into depth completion, content reconstruction, and content enhancement. Three corresponding modules are designed to implement a process-flow. Firstly, we develop a depth completion module based upon the spatio-temporal Transformer which is used to obtain the completed depth information for each video frame. Secondly, we design a content reconstruction module to generate initially inpainted video. With this module, the contents of the missing regions are composed via the depth-guided feature propagation. Thirdly, we construct a content enhancement module to enhance the temporal coherence and texture quality for the inpainted video. All of proposed modules are jointly optimized to guarantee the high inpainting efficiency. The experimental results demonstrate that our proposed method provides better inpainting results, both qualitatively and quantitatively, compared with the previous state-of-the-art.

Abstract:
Inspired by the sparse and hierarchical features representation in the ventral stream of the human visual system, the biologically inspired multi-scale contourlet attention network (BMCAnet) is proposed to extract robust discriminative features. First, we constructed the multi-scale contourlet filter banks as a population of neurons in the primary visual cortex (V1), and extracted sparse features in a multi-scale and multi-direction way. It simulated a simple cell in V1 that responds to stimuli in a specific direction. Second, in order to refine contourlet features adaptively, the Shannon block attention module (SBAM) is introduced by integrating Shannon entropy as the third branch of the channel attention module (CAM), thus the weights of contourlet coefficients can be learned adaptively. Third, the responses of the spatial and spectral features are pooled by the proposed contourlet pooling layer to obtain the invariant structure features with the specified rules, which roughly stimulate the pooling process of complex cells in the V1 area. Last, the combination of global average pooling (GAP) and full connection (FC) is used for classification. The competitive results on eight databases demonstrate that the BMCAnet can effectively extract sparse and effective features for the classification tasks.

Abstract:
If we compare how humans reason and how deep models reason, humans reason in a symbolic manner with a formal language called logic, while most deep models reason in black-box. A natural question to ask is “Do the trained deep models reason similar as humans?” or “Can we explain the reasoning of deep models in the language of logic?”. In this work, we present NeurLogX to explain the reasoning process of deep vision language models in the language of logic. Given a trained vision language model, our method starts by generating reasoning facts through augmenting the input data. We then develop a differentiable inductive logic programming framework to learn interpretable logic rules from the facts. We show our results on various popular vision language models. Interestingly, we observe that almost all of the tested models can reason logically.

Abstract:
Motion deblurring is an important topic in the field of image enhancement, which has widespread applications including video surveillance, object detection, etc. Many algorithms are designed for motion deblurring and achieve remarkable performance. However, mainstream motion blur datasets are collected under normal weather and illuminance conditions, i.e., normal domain, ignoring their variations. As a result, current methods perform poorly in dynamic real-world scenes. To address these issues, we study the work in two aspects. First, we collect the real-world motion blur dataset with a well-designed collection device from various angles, focal lengths, and street scenes. Considering its domain is single, it is augmented via a Domain Transfer Strategy (DTS) to construct a Multi-Domain dataset (MD dataset), expanding the domains of the collected dataset. Second, we propose a Multi-Domain Adaptive Deblur Network (MDADNet) with two modules. The one is the Domain Adaptation (DA) module that exploits domain invariant features to stabilize the performance of the MDADNet in multiple domains. The other is the Meta Deblurring (MDB) module that employs the auxiliary branch to enhance the deblurring ability. It also enables the MDADNet to update parameters during the testing stage, improving the generalizations of the MDADNet. Extensive experimental results demonstrate that the MD-trained methods significantly strengthen the motion deblurring ability in multiple domains. Particularly, the proposed MDADNet achieves state-of-the-art performance on the MD dataset and public motion blur datasets.

Abstract:
Talking video frames occasionally drop while streaming for reasons like network errors, which greatly hurts the online team collaboration and user experiences. Directly generating the dropped frames from the remaining ones is unfavorable since a person’s lip motion is usually non-linear and thus hard to be restored when consecutive frames are missing. Nevertheless, the audio content provides strong signals for lip motion and is less likely to drop during transmitting. Inspired by this, as an initial attempt, we present the task of audio-driven talking video frame restoration in this paper, i.e., restoring dropped video frames by jointly leveraging the audio and remaining video frames. Towards the high-quality frame generation, we devise a cross-modal frame restoration network. This network aligns the complete audio content with video frames, precisely identifies and sequentially generates the dropped frames. To justify our model, we construct a new dataset, Talking Video Frames Drop, TVFD for short, consisting of 2.5K video and 144K frames in total. We conduct extensive experiments over TVFD and another publicly accessible dataset - Voxceleb2. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.

Abstract:
Pre-trained models are frequently employed in multimodal learning. However, these models have too many parameters and need too much effort to fine-tune the downstream tasks. Knowledge distillation (KD) is a method to transfer knowledge using the soft label from this pre-trained teacher model to a smaller student, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, adjustment of temperature parameters, label smoothing and teacher-student joint training methods (online distillation) to smooth the soft label of a teacher network, have been proposed. But those methods rarely explain the effect of smoothed soft labels to enhance the KD performance. The main contributions of our work are the discovery, analysis, and validation of the effect of the smoothed soft label and a less time-consuming and adaptive transfer of the pre-trained teacher's knowledge method, namely PESF-KD by adaptive tuning soft labels of the teacher network. Technically, we first mathematically formulate the mismatch as the sharpness gap between teacher's and student's predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on various benchmarks including CV and NLP show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.

Abstract:
When images undergo quality degradation caused by editing, compression or transmission, their saliency tends to shift away from its original position. Saliency shifts indicate visual behaviour change and therefore contain vital information regarding perception of visual content and its distortions. Given a pristine image and its distorted format, we want to be able to detect saliency shifts induced by distortions. The resulting saliency shift map (SSM) can be used to identify the region and degree of visual distraction caused by distortions, and consequently to perceptually optimise image coding or enhancement algorithms. To this end, we first create a largest-of-its-kind eye-tracking database, comprising 60 pristine images and their associated 540 distorted formats viewed by 96 subjects. We then propose a computational model to predict the saliency shift map (SSM), utilising transformers and convolutional neural networks. Experimental results demonstrate that the proposed model is highly effective in detecting distortion-induced saliency shifts in natural images.

Abstract:
Crowd scenes analysis plays an important role in various fields, including public security, smart cities, and intelligent transportation systems. However, traditional crowd scenes captioning methods mainly focus on a single and prominent crowd collective, which limits their ability to describe the different crowd collectives in complex crowd scenes. To address this issue, we propose a collective-guided crowd scenes captioning model (CrowdCaption++) to explore a more comprehensive and detailed description. We design a crowd features encoder (CFE) including double-query features encoder and foreground crowd features encoder, which uses double-query attention module (DQ-ATT) to capture more representative visual features and extracts foreground crowd features to avoid interference from background for collectives prediction. Moreover, we build a collective-guided captioning decoder (CCD) to generate captions of different crowd collectives without requiring extra alignment between crowd collectives and captions. To achieve this, we first design a crowd collectives predictor to identify multiple potential crowd collectives and create crowd collectives guidance information. Finally, we use the crowd collectives guidance information to merge useful visual features and further generate corresponding caption. We evaluate our approach on the latest crowd scenes dataset CrowdCaption and demonstrate that our model can achieve a comprehensive understanding and describe the different crowd collectives in complex crowd scenes.

Abstract:
The maturity of generative models and the popularity of generated data have brought new technical means and camouflage environments to steganography. Numerous generative image steganography methods have emerged, but achieving provable security, robustness, and relatively high capacity simultaneously remains challenging. This paper proposes a provably secure robust image steganography method via the generative adversarial network (GAN), named PARIS. The sender maps the secret message, following a uniform distribution, to latent vectors conforming to a standard Gaussian distribution using inverse transform sampling. Subsequently, the latent vector is fed into the generator, producing the stego image. In this way, the stego image cannot be distinguished from the normally generated image. The receiver extracts the secret message from the recovered latent vector via gradient descent optimization. To enhance the robustness, a noise layer is introduced while recovering the latent vector to simulate potential lossy operations in real scenarios. The security of the proposed method is theoretically proven. Extensive experiments have also verified the proposed method's robustness, security, and relatively high capacity in terms of different GAN architectures, noises, and datasets.

Abstract:
In recent years, tremendous efforts have been made on document image rectification, but existing advanced algorithms are limited to processing restricted document images, i.e., the input images must incorporate a complete document. Once the captured image merely involves a local text region, its rectification quality is degraded and unsatisfactory. Our previously proposed DocTr, a transformer-assisted network for document image rectification, also suffers from this limitation. In this work, we present DocTr++, a novel unified framework for document image rectification, without any restrictions on the input distorted images. Our major technical improvements can be concluded in three aspects. Firstly, we upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. Secondly, we reformulate the pixel-wise mapping relationship between the unrestricted distorted document images and the distortion-free counterparts. The obtained data is used to train our DocTr++ for unrestricted document image rectification. Thirdly, we contribute a real-world test set and metrics applicable for evaluating the rectification quality. To our best knowledge, this is the first learning-based method for the rectification of unrestricted document images. Extensive experiments are conducted, and the results demonstrate the effectiveness and superiority of our method. We hope our DocTr++ will serve as a strong baseline for generic document image rectification, prompting the further advancement and application of learning-based algorithms.

Abstract:
Forgery of facial images and videos has increased the concern about digital security. It has led to the significant development of detecting forgery data recently. However, the data, especially the videos published on the Internet, are usually compressed with lossy compression algorithms such as H.264. The compressed data could significantly degrade the performance of recent detection algorithms. The existing anti-compression algorithms focus on enhancing the performance in detecting heavily compressed data but less consider the compression adaption to the data from various compression levels. We believe creating a forgery detection capable of handling data compressed with unknown levels is important. To enhance the performance of such models, we consider the weak compressed and strong compressed data as two views of the original data and they should have similar representation and relationships with other samples. We propose a novel anti-compression forgery detection framework by maintaining closer relations within data under different compression levels. Specifically, our algorithm measures the pair-wise similarity within data as the relations, ensuring that relationships between weakly and strongly compressed data remain consistent. This enhances the discriminative power for detecting highly compressed data. To achieve a better strong compressed data relation guided by the less compressed one, we apply video-level contrastive learning for weak compressed data, which forces the model to produce similar representations within the same video and far from the negative samples. The experiment results show that the proposed algorithm could boost performance for strong compressed data while improving the accuracy rate when detecting clean data.

Abstract:
Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information.

Abstract:
With the explosive growth of online multi-modal applications that typically include audio, video, and haptic signals, immersive experience (IE) improvement has been broadly regarded as one of the most important tasks. Compared with traditional quality of experience (QoE) improvement for online audio/video applications, it highlights two sequential technical challenges to be resolved: i) much more stringent demand of real-time improvement due to the incorporation of delay-sensitive haptic signals, and ii) high-dimensional instead of existing one-dimensional (i.e., network-level) paradigm for better online improvement. To get over this dilemma, this work systematically addresses the following three fundamental problems: i) which factors influence IE, ii) how to online improve IE, and iii) to what extent of the corresponding IE improvement can be achieved. To this end, we first comprehensively explore and categorize the influence factors on IE from various dimensions. Then, by combing network resource scheduling with the multi-domain collaboration of user profile, device specification, and application type, an online IE improvement strategy is proposed based on the efficient linear contextual bandit with the L_1-norm estimation. Finally, we derive the theoretical bound of IE improvement, scaling at a poly-logarithmical function of data dimension. Numerical results on the practical system also demonstrate the remarkable improvement on IE.

Abstract:
In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. First, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Second, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

Abstract:
In recent years, various distillation methods for semantic segmentation have been proposed. However, these methods typically train the student model to imitate the intermediate features or logits of the teacher model directly, thereby overlooking the high-discrepancy regions learned by both models, particularly the differences in instance edges. In this paper, we introduce a novel approach, called Difference-aware Distillation, to address this limitation. Our proposed method detects the discrepancies among the teacher model and the student model in the logit space through two masking mechanisms (i.e., masking by logit differences with respect to the ground truth labels and masking by differences in the predictive class probabilities), and guides the student model to restore the teacher's features with the focus on these highly-discrepant regions, resulting in improved segmentation performance. With the features jointly masked by these two mechanisms, the student model learns to preserve the teacher's features via a feature generation module, thus achieving better representation. Our experimental evaluation on three datasets, Cityscapes, Pascal2012, and ADE20 K, demonstrates our proposed approach outperforms several baselines considered. Further visualization analysis confirms that our method effectively directs the student model's attention to the discrepancies, such as the edges of small objects and the interiors of large objects.

Abstract:
Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms previous WSTAG methods. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG.

Abstract:
Most of the existing studies on controllable video generation either transfer disentangled motion to an appearance without detailed control over motion or generate videos of simple actions such as the movement of arbitrary objects conditioned on a control signal from users. In this study, we introduce Controllable Video Generation with text-based Instructions (CVGI) framework that allows text-based control over action performed on a video. CVGI generates videos where hands interact with objects to perform the desired action by generating hand motions with detailed control through text-based instruction from users. By incorporating the motion estimation layer, we divide the task into two sub-tasks: (1) control signal estimation and (2) action generation. In control signal estimation, an encoder models actions as a set of simple motions by estimating low-level control signals for text-based instructions with given initial frames. In action generation, generative adversarial networks (GANs) generate realistic hand-based action videos as a combination of hand motions conditioned on the estimated low control level signal. Evaluations on several datasets (EPIC-Kitchens-55, BAIR robot pushing, and Atari Breakout) show the effectiveness of CVGI in generating realistic videos and in the control over actions.

Abstract:
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Existing works usually tackle this task using adversarial learning and visual concept reward based on reinforcement learning. However, these existing works were only able to learn limited cross-domain information in vision and language domains, which restrains the captioning performance of UIC. Inspired by the success of Vision-Language Pre-Trained Models (VL-PTMs) in this research, we attempt to infer the cross-domain cue information about a given image from the large VL-PTMs for the UIC task. This research is also motivated by recent successes of prompt learning in many downstream multi-modal tasks, including image-text retrieval and vision question answering. In this work, a semantic prompt is introduced and aggregated with visual features for more accurate caption prediction under the adversarial learning framework. In addition, a metric prompt is designed to select high-quality pseudo image-caption samples obtained from the basic captioning model and refine the model in an iterative manner. Extensive experiments on the COCO and Flickr30 K datasets validate the promising captioning ability of the proposed model. We expect that the proposed prompt-based UIC model will stimulate a new line of research for the VL-PTMs based captioning.

Abstract:
Text-to-image synthesis aims to generate realistic images conditioned on text descriptions. Recently, conditional affine transformations (CATs), such as conditional batch normalization and instance normalization, have been applied to different layers to control the contents in synthesized images. However, the isolated CAT blocks predict the batch statistics of neighboring layers independently. What's more, CATs are simple multilayer perceptions that are hard to optimize. To address above issues, we propose a recurrent affine transformation (RAT) that connects all the CAT blocks with a recurrent neural network for modeling the long-term dependency between CAT blocks. To verify the effectiveness of RAT, we conduct both microscopic and macroscopic analyses of RAT, which not only demonstrates the effectiveness of RAT but also turns out to be a useful perspective to analyze how GANs fuse conditional information. In addition, we apply a spatial attention mechanism to the discriminator, which helps the text description to supervise the generator to synthesize more relevant image contents. Extensive experiments on the CUB, Oxford-102, and COCO datasets demonstrate the proposed model's superiority in comparison to state-of-the-art models.

Abstract:
Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited 360° video datasets to study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss-function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360° specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.

Abstract:
Although generative models are still being developed, image reconstruction and generation tasks have evolved dramatically. Since the most popular generative models still have some limitations, it is still challenging. For example, while generative adversarial network (GAN) produces clear images, it is hard to train. The hybrid VAE-GAN incorporates the benefits of both, although it is computationally intensive and prone to drawbacks such as overfitting and gradient disappearance. A novel generative model called the Cauchy-Schwarz Divergence-based Introspective Variational Autoencoder (CS-IntroVAE) is based for this challenge. Extensive experiments show that our model has good performance on both tasks by employing mixed Gaussian distributions as prior distributions and Cauchy-Schwarz divergence as a measure of the distance between prior and posterior distributions.

Abstract:
Data augmentation (DA) plays a critical role in improving the generalization of deep learning models. Recent works on automatically searching for DA policies from data have achieved great success. However, existing automated DA methods generally perform the search at the image level, which limits the exploration of diversity in local regions. In this paper, we propose a more fine-grained automated DA approach, dubbed Patch AutoAugment, to divide an image into a grid of patches and search for the joint optimal augmentation policies for the patches. We formulate it as a multi-agent reinforcement learning (MARL) problem, where each agent learns an augmentation policy for each patch based on its content together with the semantics of the whole image. The agents cooperate with each other to achieve the optimal augmentation effect of the entire image by sharing a team reward. We show the effectiveness of our method on multiple benchmark datasets of image classification, fine-grained image recognition and object detection (e.g., CIFAR-10, CIFAR-100, ImageNet, CUB-200-2011, Stanford Cars, FGVC-Aircraft and Pascal VOC 2007). Extensive experiments demonstrate that our method outperforms the state-of-the-art DA methods while requiring fewer computational resources.

Abstract:
Cross-modal hashing (CMH) has gained much attention due to its effectiveness and efficiency in facilitating efficient retrieval between different modalities. Whereas, most existing methods unconsciously ignore the hierarchical structural information of the data, and often learn a single-layer hash function to directly transform cross-modal data into common low-dimensional hash codes in one step. This sudden drop of dimension and the huge semantic gap can cause the discriminative information loss. To this end, we adopt a coarse-to-fine progressive mechanism and propose a novel Hierarchical Consensus Cross-Modal Hashing (HCCH). Specifically, to mitigate the loss of important discriminative information, we propose a coarse-to-fine hierarchical hashing scheme that utilizes a two-layer hash function to refine the beneficial discriminative information gradually. And then, the \ell _2,1-norm is imposed on the layer-wise hash function to alleviate the effects of redundant and corrupted features. Finally, we present consensus learning to effectively encode data into a consensus space in such a progressive way, thereby reducing the semantic gap progressively. Through extensive contrast experiments with some advanced CMH methods, the effectiveness and efficiency of our HCCH method are demonstrated on four benchmark datasets.

Abstract:
Unsupervised domain adaptation (UDA) attempts to learn domain invariant representations and has achieved significant progress, whereas self-training-based UDA methods have shown powerful performance. However, due to the domain gap, pseudo-labels selected through high confidence scores or uncertainty inevitably contain noise, leading to inaccurate predictions. To address this issue, we propose a novel risk-consistent training method. Specifically, both clean and noisy classifiers are introduced to estimate the noise transition matrix. The clean classifier is exploited to assign pseudo-labels for target data in each iteration. The noisy classifier is then trained with noisy target samples, and the optimal parameters are obtained through a closed-form solution. Heuristically, we also pre-train a domain predictor to select a target-like source example for the noise transition matrix estimation. In addition, we design an uncertainty-guided regularization to generate soft pseudo-labels and avoid overconfident predictions. Extensive experimental results show the effectiveness of our method, and state-of-the-art performance has been achieved. Codes are available at https://github.com/feifei-cv/RCE.

Abstract:
Learning based on multimodal data has attracted increasing interest recently. While a variety of sensory modalities can be collected for training, not all of them are always available in practical scenarios, which raises the challenge to infer with incomplete modality. This article presents a general framework termed multimodal hallucination (MMH) to bridge the gap between ideal training scenarios and real-world deployment scenarios with incomplete modality data by transferring the complete multimodal knowledge to the hallucination network with incomplete modality input. Compared with the modality hallucination methods that restore privileged modalities information for late fusion, the proposed framework not only helps to preserve the crucial cross-modal cues but relates the study in complete modalities and in incomplete modalities. Then, we introduce two strategies called region-aware distillation and discrepancy-aware distillation to transfer the response-based and joint-representation-based knowledge of pre-trained multimodal networks, respectively. Region-aware distillation establishes and weights knowledge transferring pipelines between the response of multimodal and hallucination networks at multiple regions, which guides the hallucination network to focus on discriminative regions and avoid wasted gradients. Discrepancy-aware distillation guides the hallucination network to mimic the local inter-sample distance of multimodal representations, which enables the hallucination network to acquire the inter-class discrimination refined by multimodal cues. Extensive experiments on multimodal action recognition and face anti-spoofing demonstrate the proposed multimodal hallucination framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.

Abstract:
The goal of Image-to-image (I2I) translation is to transfer an image from a source domain to a target domain, which has recently drawn increasing attention. One major branch of this research is to formulate I2I translation based on Generative Adversarial Network (GAN). As a zero-sum game, GAN can be reformulated as a Partially-observed Markov Decision Process (POMDP) for generators, where generators cannot access full state information of their environments. This formulation illustrates the information insufficiency in the GAN training. To mitigate this problem, we propose to add a communication channel between discriminators and generators. We explore multiple architecture designs to integrate the communication mechanism into the I2I translation framework. To validate the performance of the proposed approach, we have conducted extensive experiments on various benchmark datasets. The experimental results confirm the superiority of our proposed method.

Abstract:
Vehicle re-identification (Re-ID) aims to retrieve vehicles across non-overlapping cameras. Most studies consider representation learning from single appearance information of the vehicle images. Some works adopt the spatio-temporal information to remove unreasonable vehicles to refine the results in the testing phase. However, they ignore the potential topological relations among cameras under the Closed Circuit Television (CCTV) camera systems in the training phase, which usually leads to suboptimal results due to the high intra-identity variations. To handle this problem, we propose a novel vehicle re-identification framework, which explicitly models the camera topological relations of all input images to aggregate neighbor images and thus acquires camera-independent representations. Specifically, we first construct a Camera Topology Graph (CTG) to elucidate the topological relations among cameras. It takes different cameras as nodes and constructs edges from four levels of the camera system, position, orientation, and individual. Then, we introduce a Camera Topology-based Graph Convolutional Network (CT-GCN), which suppresses irrelevant neighbor images and learns different camera representation functions. Finally, we propose a topological cross-entropy loss to obtain the more discriminative vehicle representations. The whole network is trained in an end-to-end manner. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method against state-of-the-art vehicle Re-ID methods.

Abstract:
In this paper, a novel DRL-based model (VWP, VAE-WGAN-PPOE) is proposed to solve the problem of long training time and unsatisfactory training effect in the end-to-end autonomous driving. The model is optimized from feature extraction and algorithm decision. In feature extraction, we encode the input video by combining variational auto encoder (VAE) with wasserstein generative adversarial network (WGAN). The state dimension is reduced and the problem of mode collapse and gradient disappearance caused by generative adversarial network (GAN) training is solved. In decision algorithm, we formulate a new reward function by analyzing the factors affecting driving performance. Furthermore, we propose an enhanced algorithm PPOE based on the proximal policy optimization (PPO). In the CARLA simulator, compared with CNN and ResNet34, the convergence speed of the DRL model based on VAE-WGAN increases by 26.1% and 20.3%, the navigation task completion rate increases by 18.5% and 9.2%, and the collision rate decreases by 13.6% and 9.4%. Compared with deep deterministic policy gradient (DDPG) decision algorithm, the convergence speed of the DRL model based on PPOE increases by 23.3%, the navigation task completion rate increases by 5.0% in sunny days and 8.4% in severe weather, the collision rate decreases by 3.5% in sunny days and 6.6% in severe weather. Extensive experiments show that the proposed model enables the agent to drive safely along the navigational route in the complex environment with pedestrian and vehicle interaction, even in severe weather.

Abstract:
Recently, part information of pedestrian images has been demonstrated to be effective for person re-identification (ReID), but the part interaction is ignored when using Transformer to learn long-range dependencies. In this article, we propose a novel transformer network named Completed Part Transformer (CPT) for person ReID, where we design the part transformer layer to learn the completed part interaction. The part transformer layer includes the intra-part layer and the part-global layer, where they consider long-range dependencies from the aspects of the intra-part interaction and the part-global interaction, simultaneously. Furthermore, in order to overcome the limitation of fixed number of the patch tokens in the transformer layer, we propose the Adaptive Refined Tokens (ART) module to focus on learning the interaction between the informative patch tokens in the pedestrian image, which improves the discrimination of the pedestrian representation. Extensive experimental results on four person ReID datasets, i.e., MSMT17, Market1501, DukeMTMC-reID, and CUHK03, demonstrate that the proposed method achieves a new state-of-the-art performance, e.g., it achieves 68.0% mAP and 84.6% Rank-1 accuracy on MSMT17.

Abstract:
Video captioning focuses on generating natural language descriptions according to the video content. Existing works mainly explore this multimodal learning with the paired source video and corresponding sentence, which have achieved competitive performances. Nonetheless, learning from video-description pair cannot capture implicit external knowledge, i.e., multiple visual context information and linguistic clues existing in the video-language dataset, which may limit the cognitive capability of the model to generate diverse descriptions. To this end, we propose a Memory-based Augmentation Network (MAN), in which a memory structure is designed to augment the current encoder-decoder framework by incorporating implicit external knowledge with a neural memory. Specifically, we first propose a visual memory for the encoder to store multiple visual contexts across videos in the dataset, which is utilized to obtain memory-augmented contextual features for the source video. In addition, a textual memory is introduced for the decoder to capture the external language clues across sentences in the dataset. It is adapted to capture memory-augmented language features in each time step. The proposed approach is able to capture comprehensive contextual understanding compared to the basic encoder-decoder framework, which is more compatible with the human cognitive process. Extensive experiments on three video captioning datasets including MSVD, MSR-VTT, and VATEX demonstrate the effectiveness of the proposed method.

Abstract:
The main challenge of Unsupervised Domain Adaptation (UDA) crowd counting is the large domain gap between a synthetic domain with annotations (source) and a real-world domain of interest without annotations (target). Previous mainstream UDA crowd counting methods either employ feature alignment or a semi-supervised learning paradigm via pseudo-labels. We for the first time combine both of their advantages and propose an Adversarial Mean Teacher (AMT) framework. On the one hand, we optimize the student model with domain adversarial learning. On the other hand, we feed perturbed target images to the teacher model to generate pseudo-labels. Furthermore, to improve the quality of the pseudo-labels, we propose an Adaptive Teaching (AT) module, consisting of pseudo-label refinement and credible pseudo-label selection. Concretely, we first generate two candidate pseudo-labels from the prediction of the teacher model and obtain a refined pseudo-label by mixing them at the pixel-level. Moreover, we introduce an auxiliary task of foreground-background classification to assist credible region selection and only activate supervision signals on those regions. Extensive experiments on four real-world crowd counting benchmarks demonstrate the effectiveness of our method namely Cross-Domain Adaptive Teacher (CDAT).

Abstract:
Photo enhancement is a long-standing and challenging problem in image processing community. Despite having witnessed significant achievements in recent years, many of them are built upon supervised learning theories and thus required expertise in constructing a huge collection of paired data, which is well-known to be a problem as the acquisition of such data in real life can be impractical. We address this issue by proposing a multi-scale GAN framework that can be trained in an unsupervised fashion. Notably, we unify the design principle of the generator and discriminator in our framework so as to maximize the ability to learn deep latent representations. Specifically, rather than maintaining the content consistency through complicated two-way loss, we present a one-way loss that measures the content distance between multi-scale latent representations of inputs and outputs to speed up the training by \text1.7×. Furthermore, we redesign the discriminator into a multi-scale-multi-stage manner to strengthen the adversarial learning, where the multiple latent features with varying scales are produced by the main discriminator and these features are then sent to auxiliary discriminators for final recognition. Extensive experiments have been conducted in the well-known MIT-Adobe-fivek and HDR+ datasets, and the results demonstrated that the proposed multi-scale representation learning framework shows outstanding performance in photo enhancement task.

Abstract:
Gait recognition aims to obtain people's identity through body shape and walking posture. Existing gait recognition studies focus on low vertical view recognition, in which the person and the camera are nearly at the same height. Differently, in this work, we focus on gait recognition at high vertical views. To facilitate the research, we propose a new dataset named DroneGait, where the drones are used to collect the gait data. This dataset contains 22 k sequences of 96 subjects taken at different vertical views, varying from about 0^\circ to 80^\circ . Furthermore, we evaluate the effectiveness of several state-of-the-art appearance-based and skeleton-based models using our dataset and establish comprehensive baselines. Our results demonstrate that the dataset is challenging and presents significant opportunities to improve existing gait recognition methods. Moreover, we propose a new method called Vertical Distillation, which is based on the feature distillation across different vertical views. Our proposed method substantially outperforms the state-of-the-art models on DroneGait at high vertical views. Cross-vertical-view and cross-domain experiments are also made to explain the importance of gait recognition at high vertical views. Furthermore, we analyze the differences between gait recognition at different vertical views using heatmap visualization techniques. We will make our dataset and code publicly available upon acceptance.

Abstract:
According to Darwinian evolutionary theory, numerous species in the wild have developed remarkable adaptive mechanisms, involving pattern rearrangement and environmental assimilation, to evade predators. These obfuscation strategies pose significant challenges for both individuals and algorithms when performing the Camouflage Object Detection (COD) task in complex and intricate scenarios. Inspired by human strategies in the COD task, which involve assigning uncertainties to the entire input and then focusing on highly uncertain areas with the aid of prior knowledge such as boundary information, we propose the Uncertainty-Edge Dual Guide (UEDG) architecture. UEDG effectively combines probabilistic-derived uncertainty and deterministic-derived edge information to accurately detect concealed objects. The architecture consists of two independent branches dedicated to uncertainty reasoning and edge inference, which are subsequently integrated into a feature fusion module utilizing recursion feedback and feature-reuse techniques. This novel COD framework leverages the benefits of Bayesian learning and convolution-based learning, resulting in a powerful multi-task guided approach. Extensive experiments conducted on four widely employed datasets demonstrate the superior performance of UEDG compared to 12 state-of-the-art approaches, while maintaining an acceptable level of computational complexity. Overall, UEDG presents a promising solution for addressing the challenges of COD in complex environments by combining evolutionary-inspired strategies with advanced computer vision techniques.

Abstract:
The recent advance of video retrieval has been driven by large-scale visual-language pretraining models. In particular, the state-of-the-art approaches are mainly based on temporal extension of the well-known CLIP model. However, they ignore a critical problem in video retrieval, i.e., the text often refers to a small snippet in the corresponding video. Blindly aggregating all the frames inevitably reduces the discriminative capacity of the final video token to match the text token. Hence, these approaches are limited to retrieve complex videos with diversified contents. To tackle this problem, we propose a concise and novel Attentive Snippet Prompting (ASP) framework, which can dynamically exploit the text-relevant video snippet to boost retrieval. Specifically, our ASP consists of two simple but effective modules, i.e., snippet prompting and video aggregating. Given a pair of text and video, snippet prompting can smartly use cross-modal attention to construct a text-driven visual prompt, namely attentive snippet token, which adaptively describes the relevant video snippet of the text query. Alternatively, video aggregating can summarize all the frame tokens as a video token, for providing the global context. With cooperation of attentive snippet token and global video token, our ASP can effectively learn a robust and text-relevant visual representation for video retrieval. Finally, we evaluate our ASP framework on the widely-used benchmarks, where it simply outperforms a number of recent approaches with a large margin.

Abstract:
Virtual try-on of eyeglasses involves placing eyeglasses of different shapes and styles onto a face image without physically trying them on. While existing methods have shown impressive results, the variety of eyeglasses styles is limited and the interactions are not always intuitive or efficient. To address these limitations, we propose GlassesCLIP, a text-guided eyeglasses manipulation method with spatial constraints, which allows for control of the eyeglasses shape and style based on a binary mask and text, respectively. Specifically, we introduce a mask encoder to extract mask conditions and a modulation module that enables simultaneous injection of text and mask conditions. This design allows for fine-grained control of the eyeglasses' appearance based on both textual descriptions and spatial constraints. Our approach includes a disentangled mapper and a decoupling strategy that preserves irrelevant areas, resulting in better local editing. We employ a two-stage training scheme to handle the different convergence speeds of the various modality conditions, successfully controlling both the shape and style of eyeglasses. Extensive comparison experiments and ablation analyses demonstrate the effectiveness of our approach in achieving diverse eyeglasses styles while preserving irrelevant areas.

Abstract:
Existing methods combining skeleton and silhouette representations demonstrate explicit effectiveness for gait recognition. However, current related methods simply combine the video-level representations of model-based skeleton data and gait silhouettes for retrieval. Therefore, diverse skeleton information is not fully exploited in existing related works: Firstly, the position and movement of bones are not clear from individual silhouettes. This indicates that the frame-level interaction between features of skeletons and silhouettes is critical, which is ignored by previous methods. Secondly, diverse part-level skeleton-guided gait features are not fully captured in existing related approaches. To solve the above issues, we present a novel framework with multi-level skeleton-guided refinement, including frame-level, part-level, and video-level skeleton-guided refinement, for comprehensive skeleton-aided gait representation learning. First, two modules are proposed for frame-level skeleton-guided refinement. Specifically, Visual Skeleton Enhanced Backbone (VSEB) is proposed to visually highlight the global and part-level skeleton regions for the feature of each silhouette frame. Moreover, Cross-Visual-Model Frame-level Interaction (CVMFI) is proposed to further transfer the model-based skeleton information to features of the visual modalities. Secondly, part-level visual and model-based skeleton features are utilized to refine the final gait representation. Concretely, in VSEB, Part Skeleton Enhance Network (PSEN) is proposed to visually enhance the position and movement of part-level skeletons. In addition, Semantic Part Pooling (SPP) is proposed for capturing the model-based skeleton features of different semantic parts. Finally, as the video-level skeleton-guided refinement, multi-modal video-level features are combined to boost the final recognition performance. Extensive experimental results on prevailing datasets demonstrate that our approach outperforms most existing methods, including the skeleton-aided multi-modal methods. With the multi-level refinement guided by the skeleton modalities, the framework is expected to provide a deeper understanding of skeleton-aided gait recognition.

Abstract:
For accurate segmentation, effective feature extraction has always been a challenging problem, since the variability of appearance and the fuzziness of object boundaries. Convolutional neural networks have recently gained recognition in feature representation learning. However, it is only conducted in the spatial domain, and lacks effective representation of directionality, singularity and regularity in the spectral domain for anomaly detection of images. This is the key to feature learning representation of high-order singularity. To solve this problem, a multi-scale contourlet knowledge guide learning network is proposed in this paper. It is novel in this sense that, different from the CNNs in the spatial domain, the proposed method learns the multi-scale contourlet sparse representation to obtain more effective and sparse features in multi-scales and multi-directions. Furthermore, the contourlet knowledge guide learning can enhance the representation of spectral domain features. It is shown that the proposed network can learn the multi-level discriminative features and capture the more accurate object boundaries. The segmentation ability in theoretical analysis and experiments on five polyp segmentation datasets (CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS-LaribPolypDB, EndoSceneStill) and two building datasets (Massachusetts, WHU) are compared with developed methods. It must be emphasized that there is potential in effective feature learning representation and the generalization capability of the proposed method in deep learning, recognition and interpretation.

Abstract:
Text-based person re-identification (ReID) has enabled canonical applications in searching for and tracking targets from large-scale surveillance images with textual descriptions. Yet, existing text-based person ReID systems employ centralized model training that gathers images captured by different institutes' cameras into one place, which poses severe privacy threats to sensitive institutional information. This work is then devoted to exploring privacy-preserving text-based person ReID and proposes the framework of FedSH by tailoring the federated learning paradigm for distributed searching knowledge extraction. Specifically, FedSH resolves the local model generalization and entity boundary obscuring limitations, caused by inner-institute data homogeneity and inter-institute data heterogeneity, via building multi-granularity feature representation and a semantically self-aligned network. Meanwhile, it reduces the communication burden introduced by the embedding for multiple modals by updating common representation subspaces during federated learning. Experimental results on two public benchmarks demonstrate that our method can achieve at most 16.47% and 16.02% person ReID performance improvement by the Rank-1 metric, compared with 6 State-of-The-Art (SoTA) baselines and 6 ablation studies. We believe that our work will inspire the community to investigate the potential of implementing Federated Learning in real-world image retrieval and ReID scenarios.

Abstract:
Incremental semantic segmentation focuses on continually learning the segmentation of new coming classes without obtaining the training data from previously seen classes. However, most current methods fail to tackle catastrophic forgetting and background shift since they 1) treat all previous classes equally without considering different forgetting paces caused by imbalanced gradient back-propagation; 2) lack strong semantic guidance between classes. In this paper, to solve the aforementioned challenges, we propose a Gradient-Semantic Compensation (GSC) model, which surmounts incremental semantic segmentation from both gradient and semantic perspectives. Specifically, to handle catastrophic forgetting from the gradient aspect, we develop a step-aware gradient compensation that can balance forgetting paces of previously seen classes by re-weighting gradient back-propagation. Meanwhile, we propose a soft-sharp semantic relation distillation to distill consistent inter-class semantic relations via soft labels for alleviating catastrophic forgetting from the semantic aspect. In addition, we design a prototypical pseudo re-labeling which provides strong semantic guidance to mitigate background shift. It produces high-quality pseudo labels for background pixels belonging to previous classes by assessing distances of pixels relative to class-wise prototypes. Experiments on three public segmentation datasets provide strong evidence for the effectiveness of our proposed GSC model.

Abstract:
Deep convolutional neural networks (CNNs) have achieved impressive success in enhancing the quality of compressed images/videos. These approaches mostly obtain the noise level in advance and train multiple architecture-identical models for enhancement on images/videos of known levels of noise. It largely hinders their practical applications where the noise level is unknown and resource is limited. To practically perform quality enhancement, we propose a novel blind quality enhancement framework for compressed video (BQEV), which utilizes a single network to conduct enhancement on videos compressed at various and unknown quality parameters (QPs). Since there exists feature similarity and difference among videos compressed at multiple QPs, BQEV utilizes this prior to efficiently handle enhancement on videos compressed at blind QPs, which consists of progressive feature extraction and QP-adaptive feature fusion subnets. They utilize temporal information and feature similarity to progressively extract valuable features and further employ the feature difference to conduct reasonable QP-adaptive feature fusion and quality enhancement, respectively. In the progressive feature extraction subnet, we first design a quality rank module to assign more attention to higher-quality frames for efficient utilization of temporal information, then propose a progressive extraction module to further extract features from different QPs. In the QP-adaptive feature fusion subnet, we develop a quality estimation module to guide reasonable feature fusion of these extracted progressive features for stable and promising enhancement results on multiple QPs. Experimental results demonstrate that BQEV achieves 0.31–0.69 dB PSNR improvement compared with videos compressed at various QPs, outperforming state-of-the-art approaches.

Abstract:
Identifying relationships between people from images is essential for studying social activities and interactions, and this has significant potential to further the understanding of human social behaviors. Existing image-based research mainly explores social relationships at the dyadic level, i.e., recognizing pairwise relationships based on visual features of persons, objects, and scenes and their logical constraints. Notably, social relational structures are hierarchically nested, i.e., individuals and dyads are nested within group structures, as indicated in the social relations model (SRM) of social psychology. However, existing computer vision-based studies fail to consider hierarchical nested structures, thus overlooking some of the most important interactions, which leads to poor relation reasoning. To improve the performance of reasoning neural networks, we propose a novel SRM framework for progressive graph reasoning (PGR) to explore social interactions. Specifically, we construct individual-dyad and dyad-group graphs to progressively explore the impact of individuals and groups on recognition of dyadic relationships. A transformer is utilized to fuse visual features and graph reasoning knowledge into a comprehensive representation of social relationships. We demonstrate the effectiveness of the proposed model based on PGR using several public datasets and perform extensive ablation studies to explore the reasons behind its superior performance. Experimental results demonstrate that our proposed model successfully predicts social relationships with higher accuracy than state-of-the-art methods.

Abstract:
Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating distinct modalities to improve the performance or robustness of the model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide semantically relevant information. Unfortunately, the widely used video datasets are usually coarse-annotated or collected from the Internet. Thus, it inevitably contains a portion of noisy labels and noisy correspondence. To address this challenge, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters against both noisy labels and noisy correspondence. Specifically, our method consists of two phases that aim to rectify noise by the inherent correlation between modalities. First, a noise-tolerant contrastive training phase is performed to make the model immune to the possible noisy-labeled data. Despite the benefits brought by contrastive training, it would overfit the noisy correspondence and thus provide false supervision. To alleviate the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. As the noisy correspondence existed at the instance level, we further propose a category-level contrastive loss to reduce its interference. Second, in the hybrid-supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision to guide the training. Furthermore, due to the lack of suitable datasets, we establish a benchmark of real-world noisy correspondence in audio-visual data by relabeling the Kinetics dataset. Extensive experiments on a wide range of noisy levels demonstrate that our method significantly improves the robustness of the action recognition model and surpasses the baselines by a clear margin.

Abstract:
Deep learning-based video compression is a challenging task, and many previous state-of-the-art learning-based video codecs use optical flows to exploit the temporal correlation between successive frames and then compress the residual error. Although these two-stage models are end-to-end optimized, the epistemic uncertainty in the motion estimation and the aleatoric uncertainty from the quantization operation lead to errors in the intermediate representations and introduce artifacts in the reconstructed frames. This inherent flaw limits the potential for higher bit rate savings. To address this issue, we propose an uncertainty-aware video compression model that can effectively capture the predictive uncertainty with deep ensembles. Additionally, we introduce an ensemble-aware loss to encourage the diversity among ensemble members and investigate the benefits of incorporating adversarial training in the video compression task. Experimental results on 1080p sequences show that our model can effectively save bits by more than 20% compared to DVC Pro.

Abstract:
Generating realistic images based on text descriptions remains challenging in computer vision. Existing multi-stage generation methods are sufficient to generate high-resolution images. However, these methods mainly use one sentence to synthesize images, which are difficult to extract adequate semantic features, resulting in the generated images being far apart from ground-truth images. In this article, we propose a Multi-Sentence Complementary Generative Adversarial Network (MSCGAN), which assists in generating accurate images by fusing the same semantics from different sentences and preserving their unique semantics. More specifically, the latest BERT model is employed to identify semantic features and a multi-semantic fusion module (MSFM) is designed to fuse the semantic features of different sentences. Besides, a pre-trained cross-modal contrast similarity model (CCSM) is developed to explore fine-grained loss on generated images. Moreover, a multi-sentence joint discriminator is designed to ensure that the generated images match all sentences. Experiments and ablation studies on CUB and MS-COCO datasets demonstrate the significant superiority of the proposed method compared to state-of-the-art methods.

Abstract:
Modification of the optimal recursive block encoding process is commonly adopted in HEVC steganography based on block partitioning structure to embed secret messages, which inevitably disrupts the optimal rate distortion optimization process, resulting in a degradation of visual quality and an increase in bit rate. In this paper, we analyze the intra frame recursive block encoding process, categorizing modifications based on block partitioning structures into skip-level and non-skip-level modifications. Then, the rate distortion difference between these two types is compared. Additionally, the Maintenance Principle of Quad-tree Structure is introduced, which aims to preserve the stego quad-tree structure as closely as possible to the original one. Furthermore, a new cover mapping method is designed to expand the embedding capacity, and a quad-tree structure-preserving adaptive steganography is proposed. Extensive experimental results demonstrate that the proposed scheme can embed messages with fewer disruptions to the optimal rate distortion optimization process, ultimately improving the visual quality and reducing the bit rate growth.

Affiliations: Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China; Department of Engineering, Durham University, Durham, U.K.; Shenzhen Key Laboratory of Safety and Security for Next Generation of Industrial Internet, and the Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China; Shenzhen Research Institute, China University of Mining and Technology, Shenzhen, China; Technology and Information Center, Shenzhen Urban Safety Monitoring and Early Warning Technology Company, Ltd., Shenzhen, China; Shenzhen Key Laboratory of Safety and Security for Next Generation of Industrial Internet, Southern University of Science and Technology, Shenzhen, China

Abstract:
Crowd gathering events deeply affect public safety. To enhance city management and avoid potential risks, many algorithms are designed for crowd analysis and deployed on video surveillance. Widely applied deep learning models also can be trained for crowd analysis. However, there are still few works focusing on crowd gathering behavior. Furthermore, as a result of the lack of interpretability of deep learning models, which also brings potential risk of being rejected by the users. In this paper, we categorize crowd behaviors into wandering, merging, walking gathering, standing gathering, and dispersing. Also, we propose an interpretable framework for crowd gathering understanding based on crowd density estimation model and proposed crowd descriptors, named Irregularity, Sparsity, Randomness, and Volatility. The experiments on the PETS2009 dataset demonstrate our method has outperformed the previous works on the crowd gathering understanding task. Moreover, we further analyze the framework performance with different crowd feature extraction models and the relations between our descriptors and crowd behavior. Besides, an ablation study is conducted to investigate the effectiveness of the descriptors and differences between density estimation models. The results demonstrate the effectiveness and the much better interpretability of our framework. Our descriptors also show significant contributions to the quantification of crowd gathering behaviors.

Abstract:
Cross-modal hashing is an effective approach for information retrieval from large and heterogeneous cross-modal datasets, owing to its low storage cost and high computational speed. However, conventional cross-modal hashing techniques for generating hashing codes rely on cross-space dimensional compression, which results in two types of information loss: quantization information loss and dimension reduction loss. To address these limitations, we propose a novel method that decouples the one-step hashing (Fig. 1 a) strategy into two sub-steps (Fig. 1 b). Specifically, in the first step, we introduce a novel differentiable hash method, which utilizes a smooth hash module for binary quantization. This method allows our model to reduce the quantization information loss and make the model optimized by gradient descent. In the second step, we design a long-short Hamming space transformation approach to project the long code into a short one, which is effective in preserving the dimension information between long and short and mitigating the dimension reduction loss. We demonstrate the effectiveness of our approach through extensive experiments on several popular cross-modal datasets, achieving a significant improvement in cross-modal retrieval performance.

Abstract:
With the recent advancement of deep neural networks, visual tracking has achieved substantial progress in tracking accuracy. However, the robustness and security of tracking methods developed based on current deep models have not been thoroughly explored, a critical consideration for real-world applications. In this study, we propose a context-guided black-box attack method to investigate the robustness of recent advanced deep trackers against spatial and temporal interference. For spatial interference, the proposed algorithm generates adversarial target samples by mixing the information of the target object and the similar background regions around it in an embedded feature space of an encoder-decoder model, which evaluates the ability of trackers to handle background distractors. For temporal interference, we use the target state in the previous frame to generate the adversarial sample, which easily fools the trackers that rely too heavily on tracking prior assumptions, such as that the appearance changes and movements of a video target object are small between two consecutive frames. We assess the proposed attack method under both CNN-based and transformer-based tracking frameworks on four diverse datasets: OTB100, VOT2018, GOT-10 k, and LaSOT. The experimental results demonstrate that our approach substantially deteriorates the performance of all these deep trackers across numerous datasets, even in the black-box attack mode. This reveals the weak robustness of recent deep tracking methods against background distractors and prior dependencies.

Abstract:
With the rapid advancements in deep learning technologies, person re-identification (ReID) has witnessed remarkable performance improvements. However, the majority of prior works have traditionally focused on solving the problem via extracting features solely from a single perspective, such as uniform partitioning, attention mechanisms, or semantic masks. While these approaches have demonstrated efficacy within specific contexts, they fall short in diverse situations. In this paper, we propose a novel approach, Mutual Distillation Learning For Person Re-identification (termed as MDPR), which addresses the challenging problem from multiple perspectives within a single unified model, leveraging the power of mutual distillation to enhance the feature representations collectively. Specifically, our approach encompasses two branches: a hard content branch to extract local features via a uniform horizontal partitioning strategy and a soft content branch to dynamically distinguish between foreground and background and facilitate the extraction of multi-granularity features via a carefully designed attention mechanism. To facilitate knowledge exchange between these two branches, a mutual distillation and fusion process is employed, promoting the capability of the outputs of each branch. Extensive experiments are conducted on widely used person ReID datasets to validate the effectiveness and superiority of our approach. Notably, our method achieves an impressive 88.7% / 94.4% in mAP/Rank-1 on the DukeMTMC-reID dataset, surpassing the current state-of-the-art results.

Abstract:
Few-shot learning brings the machine close to human thinking which enables fast learning with limited samples. Recent work considers local features to achieve contextual semantic complementation, while they are merely coarsened feature observations that can only extract insignificant label correlations. On the contrary, partial properties of few-shot examples significantly draw the implicit feature observations that can reveal the underlying label correlation of rare label classification. To fully explore the correlation between labels and partial features, this paper proposes a Part-Aware Correlation Network (PACNet) based on Partial Representation (PR) and Semantic Covariance Matrix (SCM). Specifically, we develop a partial representing module of an object that eliminates object-independent information and allows the model to focus on more distinctive parts. Furthermore, a semantic covariance measure function is redefined as a way to learn the semantic relationships of partial representations and to compute the partial similarity between the query sample and the support set. Experiments on three benchmark datasets consistently show that the proposed method outperforms the state-of-the-art counterparts, e.g., on the PartImageNet dataset, the performance gains of up to 12% and 5.9% are observed for the 5-way 1-shot and 5-way 5-shot settings, respectively.

Affiliations: School of Software Engineering, Xi'an Jiaotong University, and Key Laboratory for Intelligent Networks and Network Security, Ministry of Education, and the SMILES LAB, Xi'an, China; China Unicom Shaanxi Branch, Xi'an, China; Department of Computer Science, University of London, London, U.K.; Key Laboratory for Intelligent Networks and Network Security, Ministry of Education, the SMILES LAB, and the School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an, China

Abstract:
Cross-Domain Recommendation (CDR) aims to alleviate the cold-start problem by transferring knowledge from a data-rich domain (source domain) to a data-sparse domain (target domain), where knowledge needs to be transferred through a bridge connecting the two domains. Therefore, constructing a bridge connecting the two domains is fundamental for enabling cross-domain recommendation. However, existing CDR methods often overlook the valuable of natural relationships between items in connecting the two domains. To address this issue, we propose DKTCDR: a Domain-oriented Knowledge Transfer method for Cross-Domain Recommendation. In DKTCDR, We leverages the rich relationships between items in a cross-domain knowledge graph as bridges to facilitate both intra- and inter-domain knowledge transfer. Additionally, we design a cross-domain knowledge transfer strategy to enhance inter-domain knowledge transfer. Furthermore, we integrate the semantic modality information of items with the knowledge graph modality information to enhance item modeling. To support our investigation, we construct two high-quality cross-domain recommendation datasets, each containing a cross-domain knowledge graph. Our experimental results on these datasets validate the effectiveness of our proposed method. Source code is available at https://github.com/zxxxl123/DKTCDR.

Abstract:
In recent years, image hashing has attracted more and more attention in practical retrieval applications due to its low storage cost and high query speed. Although existing hashing methods have achieved promising performance, they always treat both easy and hard points without discrimination, thus easily getting stuck into bad local minima, especially in the presence of noise or outliers. In this paper, we reveal that there exist dual difficulty levels hindering binarization in learning to hash, i.e., samples and bits. To overcome this problem, we propose a novel Dual Self-Paced Hashing method (DSPH) for image retrieval, which learns binary codes by not only evolving from “easy” to “hard” samples but also from “easy” to “hard” bits, mimicking the cognitive learning process from easy to difficult. Specifically, we endow each sample and bit with a weight to estimate the reliability/ease of the row and column, respectively. Then both sample- and bit-level weighting are conducted on rows and columns of Hamming space to enforce the model focus on reliable/easy examples and bits. By gradually increasing the weights during model optimization, more samples and bits are automatically involved in training from “easy” to “hard” via our dual self-paced method, thereby alleviating the adverse impact caused by noises or outliers to learn robust hashing models. Extensive experiments are conducted on four benchmark datasets to demonstrate the superiority and robustness of the proposed DSPH.

Abstract:
The recently proposed MaskFormer [Cheng et al. (2021)] gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that the per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probabilities or masks. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method.

Abstract:
Preserving important textures of the content image and achieving prominent style transfer results remains a challenge in the field of image style transfer. This challenge arises from the entanglement between color and texture during the style transfer process. To address this challenge, we propose an end-to-end network that incorporates adaptive weighted least squares (AWLS) filter, iterative least squares (ILS) filter, and channel separation. Given a content image (\mathcal C) and a reference style image (\mathcal S), we begin by separating the RGB channels and utilizing ILS filter to decompose them into structure and texture layers. We then perform style transfer on the structural layers using WCT^2 (incorporating wavelet pooling and unpooling techniques for whitening and coloring transforms) in the R, G, and B channels, respectively. We address the texture distortion caused by WCT^2 with a texture enhancing (TE) module in the structural layer. Furthermore, we propose an estimating and compensating for the structure loss (ECSL) module. In the ECSL module, with the AWLS filter and the ILS filter, we estimate the texture loss caused by TE, convert the loss of the structural layer to the loss of the texture layer, and compensate for the loss in the texture layer. The final structural layer and the texture layer are merged into the channel style transfer results in the separated R, G, and B channels into the final style transfer result. Thereby, this enables a more complete texture preservation and a significant style transfer process. To evaluate our method, we utilize quantitative experiments using various metrics, including NIQE, AG, SSIM, PSNR, and a user study. The experimental results demonstrate the superiority of our approach over the previous state-of-the-art methods.

Abstract:
Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity.

Abstract:
The successful application of semantic segmentation technology in the real world has been among the most exciting achievements in the computer vision community over the past decade. Although the long-tailed phenomenon has been investigated in many fields, e.g., classification and object detection, it has not received enough attention in semantic segmentation and has become a nonnegligible obstacle to applying semantic segmentation technology in autonomous driving and virtual reality. Therefore, in this work, we focus on a relatively underexplored task setting, long-tailed semantic segmentation (LTSS). We first establish three representative datasets from different aspects, i.e., scene, object, and human. We further propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions. We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching and automatically determines the number of matching queries for each class. Given the comprehensiveness of this work and the importance of the issues revealed, this work aims to promote the empirical study of semantic segmentation tasks.

Abstract:
Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.

Abstract:
Instance-level human parsing is aimed at separately partitioning the human body into different semantic parts for each individual, which remains a challenging task due to human appearance/pose variation, occlusion and complex backgrounds. Most state-of-the-art methods follow the “parsing-by-detection” paradigm, which relies on a trained detector to localize persons and then sequentially performs single-person parsing for each person. However, this paradigm is closely related to the detector, and the runtime is proportional to the number of persons in an image. In this paper, we present a novel detection-free framework for instance-level human parsing in an end-to-end manner. We decompose instance-level human parsing into two subtasks via a unified network: 1) semantic segmentation for pixel-level classification as a human part and 2) instance segmentation for mask-level classification as a person. The framework can directly predict the human-part semantic mask for all persons and binary masks for instance-level persons in parallel. The parsing result of each person can be acquired via a Hadamard product between the human-part semantic mask and the corresponding person's binary mask. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art methods on the CIHP and MHP v2 datasets.

Abstract:
Event cameras, such as dynamic vision sensors (DVS), are biologically inspired vision sensors that have advanced over conventional cameras in high dynamic range, low latency and low power consumption, showing great application potential in many fields. Event cameras are more sensitive to junction leakage current and photocurrent as they output differential signals, losing the smoothing function of the integral imaging process in the RGB camera. The logarithmic conversion further amplifies noise, especially in low-contrast conditions. Recently, researchers proposed a series of datasets and evaluation metrics but limitations remain: 1) the existing datasets are small in scale and insufficient in noise diversity, which cannot reflect the authentic working environments of event cameras; and 2) the existing denoising evaluation metrics are mostly referenced evaluation metrics, relying on APS information or manual annotation. To address the above issues, we construct a large-scale event denoising dataset (multilevel benchmark for event denoising, E-MLB) for the first time, which consists of 100 scenes, each with four noise levels, that is 12 times larger than the largest existing denoising dataset. We also propose the first nonreference event denoising metric, the event structural ratio (ESR), which measures the structural intensity of given events. ESR is inspired by the contrast metric, but is independent of the number of events and projection direction. Based on the proposed benchmark and ESR, we evaluate the most representative denoising algorithms, including classic and SOTA, and provide denoising baselines under various scenes and noise levels. The corresponding results and codes are available at https://github.com/KugaMaxx/cuke-emlb.

Abstract:
Portrait retouching aims to improve the aesthetic quality of input portrait photos and especially requires human-region priority. The deep learning-based methods largely elevate the retouching efficiency and provide promising retouched results. However, existing portrait retouching methods focus on automatic retouching, which treats all human-regions equally and ignores users' preferences for specific individuals, thus suffering from limited flexibility in interactive scenarios. In this work, we emphasize the importance of users' intents and explore the interactive portrait retouching task. Specifically, we propose a region-aware retouching framework with two branches: an automatic branch and an interactive branch. The automatic branch involves an encoding-decoding process, which searches region candidates and performs automatic region-aware retouching without user guidance. The interactive branch encodes sparse user guidance into a priority condition vector and modulates latent features with a region selection module to further emphasize the user-specified regions. Experimental results show that our interactive branch effectively captures users' intents and generalizes well to unseen scenes with sparse user guidance, while our automatic branch also outperforms the state-of-the-art retouching methods due to improved region-awareness.

Affiliations: Key Lab of Big Data and Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University, Beijing, China; Institute for Infocomm Research, A*STAR, Singapore; School of Computer Science and Information Technology, Beijing Jiaotong University, Beijing, China; School of Economics and Management, Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China; Institute of Information Science, Beijing Jiaotong University, Beijing, China

Abstract:
Interactive segmentation pursues generating high-quality pixel-level predictions with a few user-provided clicks, which is gaining attention for its convenience in segmentation data annotation. Users are allowed to iteratively refine the prediction by adding clicks until the result is satisfactory. Existing interactive methods usually transform the clicks into a set of localization maps by Euclidian distance computation or RGB texture extraction to guide the segmentation, which makes the click transformation a core module in interactive segmentation networks. However, when adopted in human images where large poses, occlusions, and bad illuminations are prevailing, prior transformation methods tend to cause uncorrectable overlapping across localization maps which are difficult to form a good match among human parts. Furthermore, the inappropriately transformed information is hard to be refined with the static transformation manner which is out of tune with the dynamically refined interaction process. Hence, we design a dynamic transformation scheme for interactive human parsing (IHP) named Dynamic Interaction Dilation Net (DID-Net), which serves as an initial attempt to break the limitations of static transformation while capturing long-range dependencies of clicks within each human part. Specifically, we construct a Dynamic Dilation Module (DD-Module) to dilate clicks radially in several directions assisted by human body edge detection to refine the dilation quality in each interaction iteration. Furthermore, we propose an Adaptive Interaction Excitation Block (AIE-Block) to exploit potential semantic clues buried in the dilated clicks. Our DID-Net achieves state-of-the-art performance on 3 public human parsing benchmarks.

Abstract:
Standard approaches for video action recognition usually operate on full input videos, which is inefficient due to the widespread spatio-temporal redundancy in videos. The recent progress in masked video modelling, specifically VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts using limited visual content. Inspired by this, we propose Masked Action Recognition (MAR), which reduces redundant computation by discarding a proportion of patches and operating only on a portion of the videos. MAR includes two essential components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches, cell running masking is used to preserve the spatio-temporal correlations in videos. This ensures that the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this issue, we propose a bridging classifier that can help fill the semantic gap between the ViT encoded features used for reconstruction and the specialized features used for classification. Our proposed MAR can reduce the computational cost of ViT by 53%. Extensive experiments have demonstrated that MAR consistently outperforms existing ViT models by a notable margin. Notably, we found that a ViT-Large model fine-tuned by MAR achieves comparable performance to a ViT-Huge model fine-tuned by standard training methods on both Kinetics-400 and Something-Something v2 datasets. Moreover, the computation overhead of our ViT-Large model is only 14.5% of that of the ViT-Huge model.

Abstract:
Images with low quality factor (QF) are widely available and apposite as steganography cover, which will be JPEG recompressed with a preset larger QF when uploaded to online social networks. This scenario is known as “Upward Robust,” which is currently a hotspot of robust steganography. The state-of-the-art algorithm is Generalized dither Modulation-based robust Adaptive Steganography (GMAS). However, GMAS can only realize limited resistance to detection and compression due to robust domain selection. To overcome this problem, we meticulously explore three lossy operations in JPEG recompression and discover that the key problem is spatial overflow. Then, two preprocessing methods, overall scaling (OS) and specific truncation (ST), were presented to remove overflow before message embedding and generate a reference image. After pre-processing, the stability of the image coefficients during JPEG recompression will be significantly enhanced. Therefore, we no longer need robust domain selection and all coefficients are eligible as cover, which improves security and embedding capacity. Additionally, the reference image was employed as guidance to build asymmetric distortion for removing overflow during embedding. Experimental results show that the proposed methods significantly surpass GMAS in terms of security and achieve comparable robustness.

Affiliations: College of Computer and Information Science, College of Software, Southwest University, Chongqing, China; School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China; Faculty of Information Technology, Monash University, Mulgrave, VIC, Australia; School of Computer Science, University of Sydney, Camperdown, NSW, Australia; School of Automation, Beijing Information Science and Technology University, Beijing, China; School of Computer Science, Sichuan University, Chengdu, China

Abstract:
Deep dictionary learning (DDL) shows good performance in visual classification tasks. However, almost all existing DDL methods ignore the locality relationships between the input data representations and the learned dictionary atoms, and learn sub-optimal representations in the feature coding stage, which are less conducive to classification. To this end, we propose a hierarchical locality-aware deep dictionary learning (HILADLE) framework for classification, which can learn locality-constrained dictionaries at different abstract levels through hierarchical dictionary learning. The locality constraints play an important role in learning informative dictionary atoms while preserving the data structure in the original input feature space. Moreover, instead of using an identity activation function like existing DDL methods, we further boost the generalization performance of our HILADLE method with a ReLU activation function to deal with the overfitting issue caused by over-parameterization, inspired by its effectiveness in deep neural networks. Finally, the concatenation of all feature representations learned at different layers is used as input to the final classifier. We demonstrate, through an extensive set of experiments on several benchmark face recognition, image classification, and age estimation datasets, that our method is able to surpass several dictionary learning, deep dictionary learning and deep learning methods.

Abstract:
Some unsupervised approaches have been proposed recently for the person re-identification (ReID) problem since annotations of samples across cameras are time-consuming. However, most of these methods focus on the appearance content of the sample itself, and thus seldom take the structure relations among samples into account when learning the feature representation, which would provide a valuable guide for learning the representations of the samples. Thus hard samples may not be well solved due to the limited or even misleading information of the sample itself. To address this issue, in this article, we propose a Relation-Preserving Feature Embedding (RPE) model that leverages structure relations among samples to boost the performance of the unsupervised person ReID methods without requiring any sample annotations. RPE aims at integrating the sample content and the neighborhood structure relations among samples into the learning of feature embeddings by combining the advantages of the autoencoder and graph autoencoder. Specifically, a relation and content information fusion (RCIF) module is proposed to dynamically merge the information from both perspectives of content and relation levels for feature embedding learning. Also, due to the lack of the identity labels of samples, we adopt an adaptive optimization strategy to update the affinity relations among samples instead of the reconstruction of the whole affinity matrix for optimizing the RPE model, which is more suitable for the unsupervised ReID task. Rigorous experiments on three widely-used large-scale benchmarks for person ReID demonstrate the superiority of the proposed method over current state-of-the-art unsupervised methods.

Abstract:
Semantic segmentation achieves significant success through large-scale training data. Meanwhile, few-shot semantic segmentation was proposed to segment image regions of novel classes through few labeled testing data. However, it ignores classes previously learned from training data. This paper proposes a new segmentation framework called segmentation by dynamic prototype (SDP), which can simultaneously segment the image regions of base classes learned from many training data and novel classes learned from a few testing data. SDP performs segmentation by searching for the nearest prototype of each pixel's features. Different prototypes are representative features for different classes. In testing, SDP dynamically constructs novel prototypes according to support images while maintaining base prototypes learned from training images. The main challenge of SDP is how to achieve intra-image compactness, intra-class compactness, and inter-class separability simultaneously in the feature space. To tackle these challenges, we first introduce a discriminative pixelwise feature and prototype training method to improve the above three types of feature discriminability. We then introduce a mask refinement process in testing, which refines support images' masks to extract more separable novel prototypes. In addition, we introduce a prototype adaptation process in testing, which allows all prototypes to adapt to query images to reduce the prototypes' intra-class variances. Our approach achieves state-of-the-art results for novel class segmentation against existing few-shot semantic segmentation methods on PASCAL-5i and COCO-20i benchmarks in combination with superior runtime efficiency. In addition, our method can maintain strong results in the segmentation of base classes.

Abstract:
The performance of existing methods for multi-person 3D pose estimation in crowded scenes is still limited, due to the challenge of heavy overlapping among persons. Attempt to address this issue, we propose a progressive inference scheme, i.e., Articulation-aware Knowledge Exploration (AKE), to improve the multi-person 3D pose models on those samples with complex occlusions at the inference stage. We argue it is beneficial to explore the underlying articulated information/ knowledge of the human body, which helps to further correct the predicted poses in those samples. To exploit such information, we propose an iterative scheme to achieve a self-improving loop for keypoint association. Specifically, we introduce a kinematic validation module for locating unreasonable articulations and an occluded-keypoint discovering module for discovering occluded articulations. Extensive experiments on two challenging benchmarks under both weakly-supervised and fully-supervised settings demonstrate the superiority and generalization ability of our proposed method for crowded scenes.

Abstract:
Due to the excellent rendering capabilities, GPUs are mainstream accelerators in the Cloud-rendering industry. However, current Cloud-rendering systems suffer from a CPU-GPU workload imbalance that not only degrades application performance but also causes a significant waste of GPU resources. Recent proposals (such as API-forwarding and c-GPU) for improving CPU-GPU balance are promising but fail to solve system-resource redundancy issues (i.e., each instance tends to occupy all resources, exceeding its requirements). Such behavior will increase CPU load and lower effective GPU utilization. To demonstrate the severity of the issue, we evaluated real-world applications and results show that in most cases, nearly 50% of resources are useless. To solve this problem, we present CARE, the first framework intended to reduce the system-level redundancy by cloudifying the system from monolithic to Cloud-native. To allow users to configure required services, CARE puts forward a functional unit called Configurable Android (CA). To allow multiple instances to share certain types of resources, CARE innovates Sharing Resource (SR). To reduce the unused services, CARE introduces Pruning Resources (PR). To further alleviate the CPU pressure and achieve CPU-GPU balance, we propose rShare, a system aiming at enhancing CPU effective utilization and increasing Android instance density of the Cloud-rendering platform. Based on Kubernetes, rShare divides all the CPUs into non-overlapping shared CPU pools, allocates instances to pools within milliseconds, and dynamically migrates them by tracking their QoS status. So far, CARE primarily focuses on Android systems and can handle 60 heavyweight instances (e.g., KOG (King of Glory)) on Intel SG1. rShare can apply instance allocation within milliseconds and increase the platform density by 39.4%.

Abstract:
Multi-target multi-camera tracking (MTMCT) is an important application in intelligent transportation systems (ITS). The conventional works follow the tracking-by-detection scheme and use the information of the object image separately while matching the object from different cameras. As a result, the association information from the object image is lost. To utilize this information, we propose an efficient MTMCT application that builds features in the form of a graph and customizes graph similarity to match the vehicle objects from different cameras. We present algorithms for both the online scenario, where only the past images are used to match a vehicle object, and the offline scenario, where a given vehicle object is tracked with past and future images. For offline scenarios, our method achieves an IDF1-score of 0.8166 on the Cityflow dataset, which contains the actual scenes of the city from multiple street cameras. For online scenarios, our method achieves an IDF1-score of 0.75 with an FPS of 14.

Abstract:
Text-to-Image (T2I) synthesis is a cross-modality task that requires a text description as input to generate a realistic and semantically consistent image. To guarantee semantic consistency, previous studies regenerate text descriptions from synthetic images and align them with the given descriptions. However, the existing redescription modules lack explicit modeling of their training objectives, which is crucial for reliable measurement of semantic distance between redescriptions and given text inputs. Consequently, the aligned text redescriptions suffer from training bias caused by the emergence of adversarial image samples, unseen semantics, and mistaken contents from low-quality synthesized images. To this end, we propose a SEMantic distance Adversarial learning (SEMA) framework for Text-to-Image synthesis which strengthens semantic consistency from two aspects: 1) We introduce adversarial learning between the image generator and the text redescription module to mutually promote or demote the quality of generated image or text instances. This learning model ensures accurate redescription of image contents, thus diminishing the generation of adversarial image samples. 2) We introduce two-fold semantic distance discrimination (SEM distance) to characterize semantic relevance between matching text or image pairs. The unseen semantics and mistaken contents will be penalized with a large SEM distance. The proposed discrimination method also simplifies the model training process with no need to optimize multiple discriminators. Experimental results on CUB Birds 200 and MS-COCO datasets show that the proposed model outperforms the state-of-the-art methods.

Abstract:
Despite the great success of deep neural networks for style transfer tasks, the entanglement of content and style in images leads to more style information not being captured. To tackle this problem, a novel style disentanglement network is proposed to transfer multi-source style elements. Specifically, we specialize in designing a learnable content style separation module, which can efficiently extract content and style components from images in the latent space. This method differs from the previous approaches by predefining content and style layers in the network. Under the condition of content and style separation, we continue to propose the multi-style swap module, which allows the content image to match more style elements. Additionally, by introducing alternate training strategies for the main and auxiliary decoders as well as style disentanglement loss, the stylized results look very similar to the original artworks. Experimental results demonstrate the superiority of our proposed method compared with existing schemes.

Abstract:
Domain generalization (DG) for person re-identification (ReID) is a challenging problem, as access to target domain data is not permitted during the training process. Most existing DG ReID methods update the feature extractor and classifier parameters based on the same features. This common practice causes the model to overfit to existing feature styles in the source domain, resulting in sub-optimal generalization ability on target domains. To solve this problem, we propose a novel style interleaved learning (IL) framework. Unlike conventional learning strategies, IL incorporates two forward propagations and one backward propagation for each iteration. We employ the features of interleaved styles to update the feature extractor and classifiers using different forward propagations, which helps to prevent the model from overfitting to certain domain styles. To generate interleaved feature styles, we further propose a new feature stylization approach. It produces a wide range of meaningful styles that are both different and independent from the original styles in the source domain, which caters to the IL methodology. Extensive experimental results show that our model not only consistently outperforms state-of-the-art methods on large-scale benchmarks for DG ReID, but also has clear advantages in computational efficiency.

Abstract:
Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video generation (I2V), Video-to-Video generation (V2V), and Text-to-Video generation (T2V). To pursue more controllable and fine-grained video generation, a novel video generation task, named Text-Image-to-Video generation (TI2V), and a corresponding baseline solution, named Motion Anchor-based video Generator (MAGE), were proposed. However, two other factors, namely clean datasets and reliable evaluation metrics, also play important roles in the success of the TI2V task. In this article, we present a complete benchmark for the TI2V task which includes synthetic video-text paired datasets, a baseline method, and two evaluation metrics. More specifically: (1) Two versions of synthetic datasets are built based on CATER containing rich combinations of objects and actions, as well as the resulting changes of brightness and shadow. We also provide both explicit and ambiguous text descriptions to support deterministic and diverse video generation, respectively. (2) A refined version of MAGE, dubbed MAGE+, is proposed with an innovative motion anchor structure to store appearance-motion aligned representation, which can be further injected with explicit condition and implicit randomness to model the uncertainty in data distribution. (3) To evaluate the quality of generated video especially given ambiguous description, we introduce action precision and referring expression precision to assess the quality of motion based on captioning-and-matching method. Experiments conducted on proposed datasets, as well as relevant datasets, verify the effectiveness of our baseline and show appealing potentials of TI2V task.

Abstract:
Most existing tracking methods try to represent the target by exploiting visual information as much as possible based on the various deep networks. However, the appearance model hardly describes the attribute feature of the target well, which makes the trackers fail to adapt to the complex visual surrounding. In this article, inspired by brain-like intelligence, we propose an One-stream Vision-Language Memory network (OVLM) for object tracking. Firstly, we use the combination of vision and language to build the target model and use the semantic information in the language to compensate for the instability of visual information, making the target model more stable in the face of complex appearance changes. Secondly, to build a more compact target model, we propose a memory token selection mechanism that utilizes linguistic information to eliminate tokens that do not contain target information. Furthermore, to provide better visual information for target modeling, we propose a language-based evaluation method to select high-quality target samples to be stored in the memory. Finally, OVLM achieves a 64.7% success rate on the large-scale tracking benchmark dataset TNL2K, outperforming the previous best result (VLT) by 11.6%. By exposing the possibility of the vision-language memory network, we aim to draw greater attention to it and open up new avenues for vision-language tracking.

Abstract:
In arbitrary shape text detection, locating accurate text boundaries is challenging and non-trivial. Existing methods often suffer from indirect text boundary modeling or complex post-processing. In this article, we systematically present a unified coarse-to-fine framework via boundary learning for arbitrary shape text detection, which can accurately and efficiently locate text boundaries without post-processing. In our method, we explicitly model the text boundary via an innovative iterative boundary transformer in a coarse-to-fine manner. In this way, our method can directly gain accurate text boundaries and abandon complex post-processing to improve efficiency. Specifically, our method mainly consists of a feature extraction backbone, a boundary proposal module, and an iteratively optimized boundary transformer module. The boundary proposal module consisting of multi-layer dilated convolutions will predict important prior information (including classification map, distance field, and direction field) for generating coarse boundary proposals while guiding the boundary transformer's optimization. The boundary transformer module adopts an encoder-decoder structure, in which the encoder is constructed by multi-layer transformer blocks with residual connection while the decoder is a simple multi-layer perceptron network (MLP). Under the guidance of prior information, the boundary transformer module will gradually refine the coarse boundary proposals via iterative boundary deformation. Furthermore, we propose a novel boundary energy loss (BEL) that introduces an energy minimization constraint and an energy monotonically decreasing constraint to further optimize and stabilize the learning of boundary refinement. Extensive experiments on publicly available and challenging datasets demonstrate the state-of-the-art performance and promising efficiency of our method.

Abstract:
Compositional Zero-Shot learning (CZSL) requires recognizing unseen attribute-object compositions using observed visual primitives attributes and objects in a training set, which is a critical capacity for learning systems because the long tail of new combinations dominates the distribution in the real world. However, CZSL is a challenging problem because learning systems tend to learn the dependencies between objects and attributes, which is not conducive to composition classification, and incorrect dependencies will mislead the classification of new combinations of known attributes and objects. This paper primarily introduces a novel yet effective dual-stream contrastive learning method with two main objectives: making the learned representations discriminative and transferring knowledge more efficiently from seen to unseen compositions. Specifically, we generate positive and negative pairs based on the similarity of different concepts (attributes and objects), independently capturing the discriminative representations of concepts. Meanwhile, unlike existing contrastive methods that select negative samples randomly, we construct confusable compositional representations as the negatives to explore the intrinsic relevance between attributes and objects, which can improve the generalization from seen to unseen compositions. Experimental results on two benchmarks show that the proposed method outperforms the state-of-the-arts.

Abstract:
Fine-grained image datasets have small inter-class differences and large intra-class differences, which is a difficulty of the fine-grained image classification. Traditional fine-grained image classification methods only focus on the visual features of images. However, this limitation can be eliminated when these methods are improved with multimodal information. This paper proposes an improved fine-grained image classification method with multimodal information that includes multimodal data preprocessing, multimodal feature extraction, multi-temporal feature fusion and decision correction. The preprocessing method proposed solves the problems of scattered distribution, difficult processing and uneven contribution to prediction of multimodal data through normalization, packing phrases and weighted concatenating methods. When extracting multimodal features, the SAMLP (Self-Attention MLP) module proposed combines self-attention with MLP to capture the internal correlation of multimodal information. The multi-temporal feature fusion proposed is divided into early feature fusion and late feature fusion. The former refers to adding multimodal information markers to the original image, and the latter refers to designing a multi-cascade dynamic MLP structure to fuse visual features and multimodal features. In view of the limitation of feature fusion, a decision strategy is proposed to revise the prediction results of fused features according to the prediction results of multimodal features. Ablation experiment on INAT18-1K and INAT21-1K datasets shows that our method is effective in improving classification with multimodal information. Experiments on the INAT2021_mini large dataset show that the comprehensive method in this article has higher accuracy and negligible efficiency loss compared with the state-of-the-art method.

Abstract:
One main challenge of visible-infrared person re-identification (VI Re-ID) lies in the large style discrepancy between the heterogeneous data. We present a STyle-Agnostic Representation learning (STAR) framework that bridges the modality gaps at both data and feature levels in a progressive manner. At the data level, we present Cross Modality Blending (CMB), a powerful and parameter-free data augmentation scheme that smoothly synthesizes intermediate modalities by conducting identity-preserving patch exchange and smooth cross-modality blending. At the feature level, we explore the inter-modality feature alignment problem from a new perspective of the style-related feature statistics. Specifically, we design a plug-and-play Adaptive Style Normalization (ASN) module to discard the intrinsic style distractors without losing discriminative content via dual-level adaptive distribution normalization and discriminability compensation. Moreover, considering that an appropriate modality intermediary can convey relevant information on the inter-modality distribution shift, we propose Reciprocal Modality Bridging Learning (RMBL) to better steer the modality bridging process. Two lightweight modality transformation modules are designed in RMBL to model an appropriate intermediate space by manipulating high-order statistics under our shortest distance constraint. Meanwhile, intermediary-guided distribution alignment is reciprocally conducted to align heterogeneous features to the modality intermediary. Experiments on VI Re-ID benchmarks demonstrate the superiority and flexibility of STAR over state-of-the-art methods.

Abstract:
Face hallucination in low-light environments is an extremely challenging task due to the significant loss of facial structure and facial texture information. Although cascading image relighting and face hallucination tasks is a feasible strategy, simply cascading these two tasks does not achieve satisfactory results because they do not fit into each other naturally. In this article, we propose a novel duplex fusing-embedding learning approach to tackle this challenge in low-light environments. The core of the proposed approach is the duplexity of feature fusion and embedding between relighting and hallucination tasks. In the feature fusion phase, the shallow features from two tasks are bidirectionally fused and activated into a consistent feature space. In the feature embedding phase, the fused features from the previous iteration are fed back and bidirectionally embedded into the deep features of two tasks in the current iteration so that they can learn feature representations that consistently represent both tasks, thereby boosting the performance of relighting and hallucination to generate photorealistic HR face images. Experimental results show that the proposed approach allows current face hallucination methods to learn to hallucinate face in the dark.

Abstract:
Deep learning-based image compressive sensing (CS) methods have achieved great success in the past few years. However, most of them are content-independent, with a spatially uniform sampling rate allocation for the entire image. Such practises may potentially degrade the performance of image CS with block-based sampling, since the content of different blocks in an image is different. In this article, we propose a novel rate-adaptive image CS neural network (dubbed RACSNet) to achieve adaptive sampling rate allocation based on the content characteristics of the image with a single model. Specifically, a measurement domain-based reconstruction distortion is first used to guide the sampling rate allocation for different blocks in an image without access to the ground truth image. Then, a step-wise training strategy is designed to train a reusable sampling matrix, which is capable of sampling image blocks to generate the compressed measurements under arbitrary sampling rates. Subsequently, a pyramid-shaped initial reconstruction sub-network and a hierarchical deep reconstruction sub-network that fuse the measurement information of different scales are put forward to reconstruct image blocks from the compressed measurements. Finally, a reconstruction distortion map and an improved loss function are developed to eliminate the blocking artifacts and further enhance the CS reconstruction. Experimental results on both objective metrics and subjective visual qualities show that the proposed RACSNet achieves significant improvements over the state-of-the-art methods.

Abstract:
Captured outdoor scene images are easily affected by haze. Most image dehazing methods have limited generalization capabilities for real-world hazy images owing to the complexities of real-world environments and domain gaps in the training datasets. This article proposes a semi-supervised single-image dehazing network based on disentangled meta-knowledge. The symmetric and heterogeneous design of the disentangled network is conducive to the separation of the content and mask features of hazy images and these features are used as meta-knowledge to guide feature fusion in the dehazing network. Moreover, functions describing constant-color and disentangled-reconstruction-checking losses are designed to ensure the subjective qualities of the generated dehazed images. The results of extensive experiments conducted on synthetic datasets and real-world images indicate that the proposed algorithm outperforms state-of-the-art single-image dehazing algorithms. In addition, the algorithm effectively improves the performance of object-detection tasks.

Abstract:
Aiming to match the person identity between daytime VISible (VIS) and nighttime Near-InfraRed (NIR) images, VIS-NIR re-identification (Re-ID) has attracted increasing attention due to its wide applications in low-light scenes. However, dramatic modality discrepancies between VIS and NIR images lead to a considerable intra-class gap in the feature space, which impacts identity matching. To bridge the modality gap, we propose a Tri-level Modality-information Disentanglement (TMD) to disentangle modality information at the levels of raw image, features distribution and instance features. Our model consists of three key modules, including Style-Aligned Converter (SAC), Two-Steps Wasserstein Loss (TSWL) and Self-supervised Orthogonal Disentanglement (SOD) to handle the modality information at the three levels. Firstly, aiming at reducing modality discrepancy at image-level, the SAC is introduced to generate style-aligned images by the designed style converter and \mathcal A-distance learning approach. The SAC can effectively alleviate the style discrepancy between VIS and NIR images with a negligible increase in model complexity. Secondly, considering the heterogeneity of VIS and NIR feature distribution caused by the structure- and style-misaligned raw images, we propose the TSWL to decrease the VIS-NIR gap at distribution-level by two distribution alignment steps. Specifically, after generating style-consistent images, we eliminate modality-related discrepancy by aligning the distribution between structure-aligned original and generated VIS/NIR images and bridge the modality-unrelated gap by aligning the style-consistent generated VIS-NIR images. Thirdly, focusing on further reducing the modality discrepancy at instance-level, the SOD is presented to construct orthogonal constraints between the extracted modality- and identity-related features. Since the modality-related factors are disentangled from the instance features, the proposed TMD efficiently learns the modality-unrelated and identity-discriminative representations, which are productive to conduct person Re-ID task on the VIS-NIR images. Comprehensive experiments are carried out on two cross-modality pedestrian Re-ID datasets to demonstrate the effectiveness of TMD.

Abstract:
Deep learning models for point clouds have shown to be vulnerable to adversarial attacks, which have received increasing attention in various safety-critical applications such as autonomous driving, robotics, and surveillance. Since existing 3D attack methods either modify the local points or perform global point-wise perturbations over the point cloud, they fail to capture the dependency between neighboring points for preserving the geometrical context and topological smoothness of the original 3D object. In this article, we propose a novel Geometry-Dependent Attack (GDA), which aims to generate more robust adversarial point clouds with lower perturbation costs by capturing and preserving the geometry-guided topology information. Specifically, we first analyze the geometric information of each benign point cloud following the graph signal processing and disentangle it into low-frequency (flat) and high-frequency (contour) components. Then, considering the varying characteristics of smoothness and sharpness after disentanglement, we design two collaborative patch-aware and point-aware attacks to perturb these two components separately to misclassify the 3D object. We test the proposed GDA attack using five popular point cloud networks (PointNet, PointNet++, DGCNN, PointTransformer, and PointMLP) on both ModelNet40 and ShapNetPart datasets. Experimental results show that our GDA attack achieves 100% success rates with the lowest perturbation cost. It also demonstrates the increased capability to defeat several existing defense models over other competing attacks.

Abstract:
The task of Human-Object Interaction (HOI) detection is to detect humans and their interactions with surrounding objects, where transformer-based methods show dominant advances currently. However, these methods ignore the relationship among humans, objects, and interactions: 1) human features are more contributive than object ones to interaction prediction; 2) interactive information disturbs the detection of objects but helps human detection. In this article, we propose a Human and Object Disentangling Network (HODN) to model the HOI relationships explicitly, where humans and objects are first detected by two disentangling decoders independently and then processed by an interaction decoder. Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions with human features as the positional embeddings. To handle the opposite influences of interactions on humans and objects, we propose a Stop-Gradient Mechanism to stop interaction gradients from optimizing the object detection but to allow them to optimize the human detection. Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det datasets. It can be combined with existing methods easily for state-of-the-art results.

Affiliations: Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering Huazhong University of Science and Technology, Wuhan, China; Wangxuan Institute of Computer Technology, Peking University, Beijing, China; Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China; School of software, Dalian University of Technology, Dalian, China; School of Computer Science, and Technology, Huazhong University of Science, and Technology, Wuhan, China

Abstract:
This article studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this article, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets (ActivityNet Captions, Charades-STA and TACoS) show that our proposed HLGT achieves a new state-of-the-art performance, demonstrating its effectiveness and computational efficiency.

Abstract:
The task of temporally language grounding (TLG), aiming to locate a video moment within an untrimmed video that matches a given textual query, has attracted considerable research attention in recent years. Typical retrieval-based TLG methods are inefficient due to their reliance on a large number of pre-segmented candidate moments, while localization-based TLG solutions adopt reinforcement learning, resulting in unstable convergence. Meanwhile, the cutting-edge capabilities of multi-modal architecture, especially pre-training paradigm, have not been fully exploited. Therefore, how to perform TLG task efficiently and stably is a non-trivial task. In this work, we propose a novel TLG solution named Multi-modal Multi-Prompt Tuning (MMPT), which formulates the TLG task as a prompt-based multi-modal problem and integrates multiple sub-tasks to tune the performance. In this way, off-the-shelf pre-trained models can be directly leveraged to achieve more stable performance. Specifically, a flexible multi-prompt strategy is contributed to rewrite the query firstly, which contains the query, the start and end timestamps. Among them, various prompt templates are integrated to enhance robustness. Thereafter, a multi-modal Transformer is adopted to fully learn the multi-modal context. Moreover, we design various sub-tasks to optimize this novel framework including the matching task, localization task and joint learning task. Extensive experiments on two real-world datasets validate the effectiveness and rationality of our proposed solution.

Abstract:
Traditional image/video compression compresses the highly redundant visual data while preserving signal fidelity. Recently, cross modal compression (CMC) is proposed to compress the data into a compact, human-comprehensible domain (such as text) with an ultra-high compression ratio while preserving semantic fidelity for machine analysis and semantic monitoring. CMC is with a constant rate because the CMC encoder can only represent the data with a fixed grain. But in practice, variable rate is necessary due to the complicated and dynamic transmission conditions, the different storage mediums, and the diverse levels of application requirements. To deal with this problem, in this paper, we propose variable rate cross modal compression (VR-CMC), where we introduce variable rate prompt to represent the data with different grains. Variable rate prompt is composed of three strategies. Specifically, 1) target length prompt (TLP) introduces the target length into the language prompt to guide the generation of the text representation; 2) decaying EOS probability (DEP) exponentially decays the probability of the EOS token with regard to the decoding step and target length, where the EOS (end-of-sequence) token is a special token indicating the end of the text; 3) text augmentation (TA) enriches the training data and makes the text representation length more balanced when training. Experimental results show that our proposed VR-CMC can effectively control the rate in the CMC framework and achieve state-of-the-art performance on MSCOCO and IM2P datasets.

Abstract:
Existing RF-based human pose estimation methods usually require intensive computations and cannot meet the real-time processing and portability requirements for mobile devices. To tackle the limitation, in this article, we introduce a lightweight RF-based pose estimation model, i.e., MobiRFPose, to construct the portable RF-based pose camera. Different from traditional optical-based cameras, the RF-based camera does not capture visual information, which means the privacy-preserving characteristic. Specifically, we only utilize a horizontal antenna array to transceive RF signals, then estimate the human locations on the RF signal heatmap and crop the human location regions, and finally estimate the fine-grained human poses based on the cropped small RF signal heatmaps. To evaluate the performance, we compare MobiRFPose with state-of-the-art methods. Experimental results demonstrate that MobiRFPose can achieve accurate 3D human pose estimation with fewer parameters and computations. We also test the trained MobiRFPose model using mobile computing devices, where the model structures and parameters only take up 268 KB and 3226 KB of disk space, and MobiRFPose can achieve 66 FPS processing speed. The pose estimation error is 11.05 cm in the case of a single person and 11.29 cm in the case of multiple people. All experimental results indicate that our proposed method can construct a portable RF camera to estimate human poses accurately.

Abstract:
Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented compared to the majority ones. Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority, where collaborative learning is employed to aggregate the knowledge of experts, i.e., online distillation. In this article, we observe that the knowledge transfer between experts is imbalanced in terms of class distribution, which results in limited performance improvement of the minority classes. To address it, we propose a re-weighted distillation loss by comparing two classifiers' predictions, which are supervised by online distillation and label annotations, respectively. We also emphasize that feature-level distillation will significantly improve model performance and increase feature robustness. Finally, we propose an Effective Collaborative Learning (ECL) framework that integrates a contrastive proxy task branch to further improve feature quality. Quantitative and qualitative experiments on four standard datasets demonstrate that ECL achieves state-of-the-art performance and the detailed ablation studies manifest the effectiveness of each component in ECL.

Abstract:
Adversarial examples have attracted widespread attention in security-critical applications because of their transferability across different models. Although many methods have been proposed to boost adversarial transferability, a gap still exists between capabilities and practical demand. In this article, we argue that the model-specific discriminative regions are a key factor causing overfitting to the source model, and thus reducing the transferability to the target model. For that, a patch-wise mask is utilized to prune the model-specific regions when calculating adversarial perturbations. To accurately localize these regions, we present a learnable approach to automatically optimize the mask. Specifically, we simulate the target models in our framework, and adjust the patch-wise mask according to the feedback of the simulated models. To improve the efficiency, the differential evolutionary (DE) algorithm is utilized to search for patch-wise masks for a specific image. During iterative attacks, the learned masks are applied to the image to drop out the patches related to model-specific regions, thus making the gradients more generic and improving the adversarial transferability. The proposed approach is a preprocessing method and can be integrated with existing methods to further boost the transferability. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our method. We incorporate the proposed approach with existing methods to perform ensemble attacks and achieve an average success rate of 93.01% against seven advanced defense methods, which can effectively enhance the state-of-the-art transfer-based attack performance.

Abstract:
Incomplete multi-view clustering is gaining increased attention owing to its great success in mining underlying information from the missing views. However, the existing approaches still encounter two issues: 1) They generally do not give sufficient consideration to the robustness of incomplete multi-view data with noise; 2) They only exploit the low-rank structures in the intra-view graphs, while the low-rank priors embedded in inter-view graphs are ignored. To this end, we propose a Robust Tensor Recovery for Incomplete Multi-view Clustering (RIMC) method, which transforms the view-missing problem into the tensor graph recovery problem by manipulating the comprehensive low-rank priors. Specifically, RIMC first employs a marginalized denoising operation to construct robust graphs and further builds a tensor graph by stacking these robust graphs. Then, we develop a novel tensor completion to recover the tensor graph by performing comprehensive low-rank priors: low-rank structures in the inter-view graphs (i.e., horizontal and lateral slices); low-rank structures in the intra-view graphs (i.e., frontal slices). Meanwhile, we integrate the tensor completion and spectral clustering to learn a unified indicator matrix. Extensive experiments show the promising performance of our method.

Abstract:
Referring Expression Comprehension (REC) is a multimodal comprehension task that aims to locate an object in an image, given a text description. Traditionally, during the existing REC tasks, there has been a basic assumption that the given text expression and the image are usually exactly matched to each other. However, in real-world scenarios, there is uncertainty in how well the image and text match each other exactly. Illegible objects in the image or ambiguous phrases in the text have the potential to significantly degrade the performance of conventional REC tasks. To overcome these limitations, we consider a more practical and comprehensive REC task, where the given image and its referring text expression can be inexactly matched. Our models aim to correct such inexact matching and supply corresponding interpretations. We refer to this task as Further REC (FREC). This task is divided into three subtasks: 1) correcting the erroneous text expression using visual information, 2) generating the rationale for this input expression, and 3) localizing the proper object based on the corrected expression. We introduce three new datasets for FREC: Further-RefCOCOs, Further-Copsref and Further-Talk2Car. These datasets are based on the existing REC datasets, including RefCOCO and Talk2Car. We developed a novel pipeline architecture to execute the three subtasks simultaneously in an end-to-end fashion. Next, we developed an elastic masked language modeling (EMLM) training head to rectify text errors with uncertain lengths. Our experimental results demonstrate the validity of our proposed pipeline. We hope this work sparks more research focused on inexactly matched REC.

Abstract:
Zero-shot action recognition (ZSAR) aims to recognize unseen action categories in the test set without corresponding training examples. Most existing zero-shot methods follow the feature generation framework to transfer knowledge from seen action categories to model the feature distribution of unseen categories. However, due to the complexity and diversity of actions, it remains challenging to generate unseen feature distribution, especially for the cross-dataset scenario when there is a potentially larger domain shift. This article proposes a Deconfounding Ca USAl GAN (DeCalGAN) for generating unseen action video features with the following technical contributions: 1) Our model unifies compositional ZSAR with traditional visual-semantic models to incorporate local object information with global semantic information for feature generation. 2) A GAN-based architecture is proposed for causal inference and unseen distribution discovery. 3) A deconfounding module is proposed to refine representations of local objects and global semantic information confounder in the training data. Action descriptions and random object features after causal inference are then used to discover unseen distributions of novel actions in different datasets. Our extensive experiments on Cross-Dataset Zero-Shot Action Recognition (CD-ZSAR) demonstrate substantial improvement over the UCF101 and HMDB51 standard benchmarks for this problem.

Affiliations: School of Software, South China Normal University, Foshan, China; School of Fashion and Textiles, The Hong Kong Polytechnic University, Kowloon, Hong Kong; Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory, Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China

Abstract:
In unsupervised domain adaptation (UDA), negative transfer is one of the most challenging problems. Due to complex environments, the used domain data are always corrupted by noise or outliers in many applications. If the noisy data are directly used for domain adaptation, the disturbances and negative influence of the noise are also shifted for the target tasks. Thus, preventing disturbances and negative effects caused by noise are key problems in UDA that need to be addressed. In this article, a low-rank correlation learning (LRCL) method is proposed for UDA. In LRCL, the noisy domain data are recovered by low-rank learning; then both domain data are cleaned. Hence, the disturbances and negative effects of the noise are prevented. The maximized correlated features of the clean data from the source and target domains are learned by a novel correlation regularization term in a latent common space. LRCL also reduces the distribution difference of the learned clean source and target data by constructing a reconstruction term, in which the clean target data are linearly represented by the clean source data. To explore the temporal and structural information of the data, we further extend LRCL into a graph case and propose graph LRCL (GLRCL). Extensive experiments have been conducted on several public data benchmarks, and the experimental results demonstrate that our methods can effectively prevent negative transfer and obtain better classification outcomes than other compared approaches.

Abstract:
Distortion compensation method is a common way to cope with the distortion drift problem in coefficient domain HEVC steganography. However, it will leave obvious steganographic traces called centralized error (CER). The current coefficient domain HEVC steganography is fragile to CER-based steganalysis. In this article, a novel adaptive HEVC steganography that can resist CER-based steganalysis is proposed. First, the difference of CER between H.264/AVC and HEVC is introduced, and the CER feature in HEVC is re-modeled. Then, from two aspects of overall average distribution and single-frame distribution, we conclude that there is a strong correlation among four components of the CER feature. Last, an adaptive cost function is proposed by maintaining one component distribution to resist steganalysis. Experimental results show that the proposed cost function can effectively improve the security compared with other coefficient-based HEVC steganography. In addition, the proposed steganography outperforms other HEVC steganography in visual quality and bit rate increase.

Abstract:
Text-based person search aims to retrieve the most relevant pedestrian images from an image gallery based on textual descriptions. Most existing methods rely on two separate encoders to extract the image and text features, and then elaborately design various schemes to bridge the gap between image and text modalities. However, the shallow interaction between both modalities in these methods is still insufficient to eliminate the modality gap. To address the above problem, we propose TransTPS, a transformer-based framework that enables deeper interaction between both modalities through the self-attention mechanism in transformer, effectively alleviating the modality gap. In addition, due to the small inter-class variance and large intra-class variance in image modality, we further develop two techniques to overcome these limitations. Specifically, Cross-modal Multi-Granularity Matching (CMGM) is proposed to address the problem caused by small inter-class variance and facilitate distinguishing pedestrians with similar appearance. Besides, Contrastive Loss with Weakly Positive pairs (CLWP) is introduced to mitigate the impact of large intra-class variance and contribute to the retrieval of more target images. Experiments on CUHK-PEDES and RSTPReID datasets demonstrate that our proposed framework achieves state-of-the-art performance compared to previous methods.

Abstract:
Video captioning is a more challenging task compared to image captioning, primarily due to differences in content density. Video data contains redundant visual content, making it difficult for captioners to generalize diverse content and avoid being misled by irrelevant elements. Moreover, redundant content is not well-trimmed to match the corresponding visual semantics in the ground truth, further increasing the difficulty of video captioning. Current research in video captioning predominantly focuses on captioner design, neglecting the impact of content density on captioner performance. Considering the differences between videos and images, there exists an another line to improve video captioning by leveraging concise and easily-learned image samples to further diversify video samples. This modification to content density compels the captioner to learn more effectively against redundancy and ambiguity. In this article, we propose a novel approach called Image-Compounded learning for video Captioners (IcoCap) to facilitate better learning of complex video semantics. IcoCap comprises two components: the Image-Video Compounding Strategy (ICS) and Visual-Semantic Guided Captioning (VGC). ICS compounds easily-learned image semantics into video semantics, further diversifying video content and prompting the network to generalize contents in a more diverse sample. Besides, learning with the sample compounded with image contents, the captioner is compelled to better extract valuable video cues in the presence of straightforward image semantics. This helps the captioner further focus on relevant information while filtering out extraneous content. Then, VGC guides the network in flexibly learning ground truth captions based on the compounded samples, helping to mitigate the mismatch between the ground truth and ambiguous semantics in video samples. Our experimental results demonstrate the effectiveness of IcoCap in improving the learning of video captioners. Applied to the widely-used MSVD, MSR-VTT, and VATEX datasets, our approach achieves competitive or superior results compared to state-of-the-art methods, illustrating its capacity to handle the redundant and ambiguous video data

Abstract:
Although Siamese trackers have recently been gradually replaced by complicated and computationally expensive Transformer trackers, they are simpler and more applicable to real-world deployment. We believe that room for improvement still exists in the Siamese tracking framework and attribute the performance limitation to inadequate annotation, insufficient augmentation and suboptimal assignment, which heavily weakens the discriminative power. As Siamese methods directly inherit the assignment manner from object detection, they both face the imbalance between sparse annotated objects and dense background examples and between easy and hard negative examples. Moreover, existing augmentations and negative pairs are insufficient to simulate practical tracking ambiguity and failure cases. Nevertheless, work in the training vein is still overlooked. Therefore, we strive to yield a negative-driven training pipeline to unleash the potential of the Siamese framework without any extra inference cost. Specifically, 1) We devise strong negative augmentations based on random copy-paste to take full advantage of available annotations and generate more challenging tracking scenarios, especially negative examples. 2) We propose a semisupervised two-phase assignment that jointly utilizes existing annotations and model outputs to mine more appropriate and challenging negative examples. 3) We formulate a complementary reweighting loss by modifying the loss weight matrix to bridge subtasks and highlight the contributions of hard negative examples more smoothly. We choose several classic Siamese trackers to validate the pipeline effectiveness. After training, these trackers can gain, at most, a nearly 14% relative increase in performance, which is comparable to advanced Siamese trackers and even Transformer trackers. The experimental results indicate that the tracking-specific training pipeline is an efficient method for strengthening trackers and requires further development.

Abstract:
Contrastive self-supervised learning (CSL) based on instance discrimination typically attracts positive samples while repelling negatives to learn representations with pre-defined binary self-supervision. However, vanilla CSL is inadequate in modeling sophisticated instance relations, limiting the learned model to retain fine semantic structure. On the one hand, samples with the same semantic category are inevitably pushed away as negatives. On the other hand, differences among samples cannot be captured. In this paper, we present relation-aware contrastive self-supervised learning (ReCo) to integrate instance relations, i.e., global distribution relation and local interpolation relation, into the CSL framework in a plug-and-play fashion. Specifically, we align similarity distributions calculated between the positive anchor views and the negatives at the global level to exploit diverse similarity relations among instances. Local-level interpolation consistency between the pixel space and the feature space is applied to quantitatively model the feature differences of samples with distinct apparent similarities. Through explicitly instance relation modeling, our ReCo avoids irrationally pushing away semantically identical samples and carves a well-structured feature space. Extensive experiments conducted on commonly used benchmarks justify that our ReCo consistently gains remarkable performance improvements.

Abstract:
Gait recognition is a soft biotechnology to identify pedestrians observed from different camera views based on specific walking patterns. However, various dressing and wearing conditions bring great challenges to realistic gait recognition. Most existing methods take holistic gait silhouette as input and focus on local areas through horizontal strip division or attention map. We consider that this processing may contain mixed or incomplete information about multiple body parts so that gait information is misused or underutilized. In this paper, we propose a parsing-guided framework for gait recognition, named GaitParsing, which explores human semantic parsing to dissect human body into a set of specific and complete body parts. Correspondingly, a simple yet effective dual-branch feature extraction network is adopted to process holistic gait and distinct body parts. To maximize the use of highly discriminated gait frames, we propose a self-occlusion frame assessment to measure the self-occlusion in a gait sequence. Since there is no human parsing modality in current gait datasets, we further develop a general human parsing pipeline specifically tailored for gait datasets. This single training enables widespread application across various gait datasets. Extensive experiments with ablation analyses demonstrate competitive performance even in the most challenging conditions, e.g., Cloth-Changing (CC+5.9%). Especially, It is gratifying to see that our model can be easily applied to existing methods and significantly outperform the original architecture, even without much modification.

Abstract:
External knowledge has been widely applied in image captioning tasks to enrich the generated sentences. However, existing methods retrieve knowledge by considering only semantic relevance while ignoring whether they are useful for captioning. For example, when querying “person” in external knowledge, the most relevant concepts may be “wearing shirt” or “riding horse” statistically, which are not consistent with image contents and introduce noise to generated sentences. Intuitively, we humans can iteratively correlate visual clues with corresponding knowledge to distinguish useful clues from noise. Therefore, we propose an event-aware retrospective learning network for knowledge-based image captioning, which employs a retrospective validation mechanism on captioning models to align the retrieved knowledge with visual contents. This approach is an event-aware perspective and helps select useful knowledge that corresponds to visual facts. To better align images and knowledge, 1) we design an event-aware retrieval algorithm that clusters word-centered knowledge into triplet-centered knowledge (i.e., from “ < subject - predicate - object >” to “< triplet A> - edge - < triplet B >”), which provides an event context to facilitate knowledge retrieval and validation. 2) We revisit image contents to retrospectively validate retrieved knowledge by aligning the visual representation between knowledge and image. We summarize the visual characteristics of each knowledge event from the visual genome dataset to help learn which knowledge does not exist in the visual scene and should be discarded. 3) We adopt a dynamic knowledge fusion module that calibrates image and knowledge representations for sentence generation, which includes a knowledge-controlled gate unit that jointly calculates visual and semantic features in event-aware patterns. Compared to current knowledge-based captioning methods, the proposed network retrospectively learns the visual facts by event-aware retrieval and knowledge-image visual alignment, which regularizes the knowledge-incorporated captioning with visual evidence. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of our method. Ablation studies and visualization demonstrate the advantages of each component of the proposed model.

Abstract:
This paper explores semantic-aware representations for scoring figure skating videos. Most existing approaches to sports video analysis only focus on reasoning action scores based on visual input, limiting their ability to depict high-level semantic representations. Here, we propose a teacher-student-based network with an attention mechanism to realize an adaptive knowledge transfer from the semantic domain to the visual domain, which is termed semantics-guided network (SGN). Specifically, we use a set of learnable atomic queries in the student branch to mimic the semantic-aware distribution in the teacher branch, which is represented by the visual and semantic inputs. In addition, we propose three auxiliary losses to align features in different domains. With aligned feature representations, the adapted teacher is capable of transferring the semantic knowledge to the student. To verify the effectiveness of our method, we collect a new dataset OlympicFS for scoring figure skating. Besides action scores, OlympicFS also provides professional comments on actions for learning semantic representations. By evaluating four challenging datasets, our method achieves state-of-the-art performance.

Abstract:
Scene graph generation is a significant and challenging task for scene understanding. Most existing methods are confined to the 2D space (i.e. images) or additional use of segmentation information, while neglecting the richer spatial and geometric information of 3D space. In this paper, we propose a novel method to generate scene graphs from 3D point clouds. Specifically, our model consists of three parts: a point feature extraction backbone, a box head, and a relation head. The feature extraction backbone extracts base features directly from raw point clouds, and the box head produces detected 3D bounding boxes. Final 3D scene graphs are obtained from the relation head which takes the extracted features and 3D boxes as inputs. We also design a point RoI module which sequentially processes points inside 3D boxes with a bidirectional LSTM. To further leverage the geometric characteristics of point clouds, we propose a location attention module which learns the influence of relative locations between objects. We introduce the RelationScanNet dataset with densely annotated semantic and geometric relationships, which extends one of the most widely used dataset ScanNetV2 in 3D indoor scene understanding. We test the proposed method on the RelationScanNet dataset and 3DSSG dataset. The results prove the strength of our method.

Abstract:
Open-set action recognition (OSAR) aims to learn a recognition framework capable of both classifying known classes and identifying unknown actions in open-set scenarios. Existing OSAR methods typically reside in a data-driven paradigm, which ignore the rich semantics in both known and unknown categories. In fact, we humans have the capability of leveraging the captured semantic information, i.e., knowledge and experience, to incisively distinguish samples from known and unknown classes. Motivated by this observation, in this paper, we propose a Unified Semantic Exploration (USE) framework for recognizing actions in open-set scenarios. Specifically, we explore the explicit knowledge semantics by simulating the unknown classes with knowledge-guided virtual classes based on an external knowledge graph, which enables the model to simulate open-set perception during model training. Besides, we propose to learn the implicit data semantics by transferring the knowledge structure of action categories to the visual prototype space for semantic structure preservation. Extensive experiments on several action recognition benchmarks validate the effectiveness of our proposed method.

Abstract:
With the increasing popularity of Online Social Networks (OSNs), covert communication is rapidly shifting from lossless channels like email to lossy channels, specifically social networks. In response to this trend, robust adaptive steganography has emerged as a powerful technique for concealing information in lossy transport channels. Previous approaches have aimed to address the challenge of JPEG image compression during transmission by utilizing static compression-resistant domains, Syndrome-Trellis Codes (STC), and Error Correction Codes (ECC). However, reliance on a significant number of ECC check codes to ensure robustness could inadvertently affect security. In response to this challenge, we introduce the “Adaptive STC-ECC” strategy, which enhances security by minimizing the number of check codes without compromising robustness. We further improve the robustness by simulating the embedding process and strategically placing the wet point in unstable cover elements. Furthermore, we exploit the residual information between the pre-cover and cover images to adjust the distortion and accurately determine the direction of the dither modulation, thus improving the overall security. Extensive experiments have been conducted to evaluate the performance of our proposed approach, and the results demonstrate its superior robustness and security compared to existing state-of-the-art approaches.

Abstract:
The change in appearance is a great challenge for cloth-changing person re-identification. Existing methods tackle this challenge by learning the shape features of the human body, however, these features are easily affected by the human pose or camera perspective. Thus, the ability to learn and extract the invariant features of person in varying conditions is crucial to overcome the above challenge. To address the issue of invariant feature extraction for cloth-changing person re-identification, a Pose-Guided Attention Learning (PGAL) framework is proposed in this paper. First, we introduce the human pose estimation network to remove the background effects and align the fine-grained key points features of human body. Then, to fully exploit the available appearance information, we develop a Feature Enhancement Module (FEM) that improves the feature representation of non-key point regions of human body through the Multi-Head Self Attention. Finally, in order to adaptively learn the invariant features of the person, we construct an Attention Learning Module (ALM) to achieve automatic selection of multi-granularity features by utilizing three different loss functions. Comparing with current popular methods on four cloth-changing person Re-ID datasets, the experimental results show the superiority of our method.

Abstract:
Temporal action localization (TAL) is a critical task in video understanding. Effectively utilizing multi-scale information and handling interactions across various scales have consistently posed challenging issues within the realm of TAL. In this article, we propose a novel gated multi-scale Transformer model (TransGMC) for temporal action localization. A gated control mechanism is designed to filter and aggregate the information at different scales, by which the contributions of contexts at different temporal scales are well characterized. To enhance the feature representation at each temporal scale, the rich global-local contexts are extracted at each temporal scale. A cascade attention module that contains two seamlessly integrated channel attention and moment attention is proposed for capturing global temporal contexts. We utilize a new regression loss function for locating the time boundaries. We conducted experiments on four challenging benchmark datasets, including two third-person view datasets and two first-person view datasets. Our method achieves an average mAP of 67.5% on THUMOS14, 36.1% on ActivityNet v1.3, 24.9% on EPIC-Kitchens 100, and 23.2% on Ego4D, which all outperform the previous state-of-the-arts methods. Extensive ablation studies also validate the effectiveness of the proposed method.

Affiliations: National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen, China; School of Information Science and Engineering, Ningbo University, Ningbo, China; College of Management, Shenzhen University, Shenzhen, China; School of Computer and Control Engineering, Yantai University, Yantai, China

Abstract:
Nowadays, it is a common practice to retouch face images before sharing them on websites, social media, and even identification cards. In response, increased criticisms have appeared about taking photo retouching to an extreme. This naturally leads to the necessity of designing perceptual quality assessment methods that can measure how much a retouched face image has strayed from reality. However, such an issue has seldom been considered. In this paper, we conduct both subjective and objective studies to advance this field. Firstly, we construct a benchmark database (termed SZU-RFD) via subjective experiments. SZU-RFD consists of 200 high-quality images with Asian faces and 1,600 retouched images generated by three popular photo-editing tools under different settings. Secondly, considering that retouching usually distorts the image texture, we propose a novel no-reference (NR) quality assessment method, named TANet, for retouched face images by taking the textural artifact into account. Specifically, a texture enhancement module is embedded into the shallow layer to help the network focus on textural information, and a multi-task learning strategy is applied to improve the performance of the main task with the assistance of an auxiliary task, i.e., texture recognition. Extensive experiments on the constructed SZU-RFD show that our proposed TANet correlates well with subjective perceptual judgments and is superior to 19 mainstream NR image quality assessment methods in evaluating retouched face images.

Abstract:
Due to the excellent interpretability of non-negative matrix factorization (NMF), NMF-based multi-view clustering has attracted much attention for multi-media data analysis and processing. However, the existing clustering methods leverage NMF to cluster data matrix, resulting in high computational complexity. Moreover, they are sub-optimal to exploit the complementary information between views because they all measure the between-views error pixel by pixel. To tackle this problem, inspired by orthogonal NMF and anchor graph, we present an efficient anchor graph factorization model with orthogonal, non-negative, and tensor low-rank constraints. We use an anchor graph instead of a data matrix to get an indicator matrix without post-processing, which remarkably reduces the computational complexity. To exploit the between-views complementary information well, we introduce tensor Schatten p-norm regularization on the third tensor, composed of soft label matrices of views. The solution can be obtained by iteratively optimizing four decoupled sub-problems, which can be solved more efficiently with good convergence. Through experimental results on the six multi-view datasets, our approach ensures the enhancement of clustering performance while improving efficiency.

Abstract:
Unsupervised 3D shape clustering is emerging as a promising research topic in multimedia and computer vision field. Considering the flexibility of acquiring multiple views for 3D shapes, this paper proposes a contrastive multi-view learning network (CMVL-Net) to cluster unlabeled 3D shapes from multiple views. To the best of our knowledge, this is the first multi-view-oriented 3D shape deep clustering method. The key to this method lies in how to capture highly discriminative 3D shape features suitable for clustering. By exploring consistency and complementarity among multiple views, a cross-view contrastive clustering mechanism is proposed to learn clustering-specified discriminative 3D shape features. To obtain a more compact 3D shape clustering structure, a consensus graph-guided contrastive constraint is designed to encourage cluster-wise consistency learning under the guidance of potential category associations among shapes. Experimental results on two widely used benchmark datasets demonstrate the effectiveness of the proposed method.

Abstract:
Most existing domain adaptation methods learn with both (labeled) samples in the source domain and (unlabeled) samples in the target domain. Relying on the availability of target domain samples, however, is not always feasible in real-world applications. In this article, we propose a new method to address this issue, in which the target domain samples do not need to be available for the task of interest. To improve the performance of such a zero-shot domain adaptation (ZSDA), we learn with not only source samples in the task of interest, but also seek additional assistance from those dual-domain samples in an irrelevant task. To overcome the problems induced by the unavailability of target samples in the task of interest, we exploit the hypothesis that the domain correlation is consistent across tasks and learn to transfer it from the irrelevant task to the task of interest. Specifically, our method aims to learn a domain-invariant representation space in which the source-domain classifier is directly transferable to the target domain. We achieve this by restricting the two domains to share both inter-category structure and intra-category structure in the representation space. Experiment results on five benchmarking datasets indicate that our proposed method significantly outperforms the existing representative baselines.

Abstract:
RGB-T semantic segmentation is a key technique for autonomous driving scenes understanding. For the existing RGB-T semantic segmentation methods, however, the effective exploration of the complementary relationship between different modalities is not implemented in the information interaction between multiple levels. To address such an issue, the Context-Aware Interaction Network (CAINet) is proposed for RGB-T semantic segmentation, which constructs interaction space to exploit auxiliary tasks and global context for explicitly guided learning. Specifically, we propose a Context-Aware Complementary Reasoning (CACR) module aimed at establishing the complementary relationship between multimodal features with the long-term context in both spatial and channel dimensions. Further, considering the importance of global contextual and detailed information, we propose the Global Context Modeling (GCM) module and Detail Aggregation (DA) module, and we introduce specific auxiliary supervision to explicitly guide the context interaction and refine the segmentation map. Extensive experiments on two benchmark datasets of MFNet and PST900 demonstrate that the proposed CAINet achieves state-of-the-art performance.

Abstract:
Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.

Affiliations: Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information, Fuzhou University, Fuzhou, China; Faculty of Information Technology, the Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, the Beijing Laboratory of Smart Environmental Protection, the Beijing Key Laboratory of Computational Intelligence and Intelligent System, and the Beijing Artificial Intelligence Institute, Beijing University of Technology, Beijing, China

Abstract:
Due to the light-independent imaging characteristics, sonar images play a crucial role in fields such as underwater detection and rescue. However, the resolution of sonar images is negatively correlated with the imaging distance. To overcome this limitation, Super-Resolution (SR) techniques have been introduced into sonar image processing. Nevertheless, it is not always guaranteed that SR maintains the utility of the image. Therefore, quantifying the utility of SR reconstructed Sonar Images (SRSIs) can facilitate their optimization and usage. Existing Image Quality Assessment (IQA) methods are inadequate for evaluating SRSIs as they fail to consider both the unique characteristics of sonar images and reconstruction artifacts while meeting task requirements. In this paper, we propose a Perception-and-Cognition-inspired quality Assessment method for Sonar image Super-resolution (PCASS). Our approach incorporates a hierarchical feature fusion-based framework inspired by the cognitive process in the human brain to comprehensively evaluate SRSIs' quality under object recognition tasks. Additionally, we select features at each level considering visual perception characteristics introduced by SR reconstruction artifacts such as texture abundance, contour details, and semantic information to measure image quality accurately. Importantly, our method does not require training data and is suitable for scenarios with limited available images. Experimental results validate its superior performance.

Abstract:
Previous Few-Shot Segmentation (FSS) approaches exclusively utilize support features for prototype generation, neglecting the specific requirements of the query. To address this, we present the Query-guided Prototype Evolution Network (QPENet), a new method that integrates query features into the generation process of foreground and background prototypes, thereby yielding customized prototypes attuned to specific queries. The evolution of the foreground prototype is accomplished through a support-query-support iterative process involving two new modules: Pseudo-prototype Generation (PPG) and Dual Prototype Evolution (DPE). The PPG module employs support features to create an initial prototype for the preliminary segmentation of the query image, resulting in a pseudo-prototype reflecting the unique needs of the current query. Subsequently, the DPE module performs reverse segmentation on support images using this pseudo-prototype, leading to the generation of evolved prototypes, which can be considered as custom solutions. As for the background prototype, the evolution begins with a global background prototype that represents the generalized features of all training images. We also design a Global Background Cleansing (GBC) module to eliminate potential adverse components mirroring the characteristics of the current foreground class. Experimental results on the PASCAL-5^i and COCO-20^i datasets attest to the substantial enhancements achieved by QPENet over prevailing state-of-the-art techniques, underscoring the validity of our ideas.

Affiliations: Department of Computer Science, The University of Hong Kong, Hong Kong; Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China

Abstract:
By leveraging deep neural networks, recent face swapping techniques have performed admirably in generating faces that maintain consistent identities. Nevertheless, while these methods accurately transfer source identities, they often struggle to preserve important attributes (such as head poses, expressions, and gaze directions) in the target faces. As a consequence, the current research in this domain has not resulted in satisfactory performance. In this article, we propose an efficient attribute-preserving framework, called AP-Swap, for short, for face swapping. Our approach incorporates two innovative modules designed specifically to preserve critical facial attributes. First, we propose a global residual attribute-preserving encoder (GRAPE), which adaptively extracts globally complete attribute features from target faces. Second, in addition to the regular network streams for the source and target facial images, we introduce a network stream that takes into account the facial landmarks of the target faces. This additional stream enables our landmark-guided feature entanglement module (LFEM), which efficiently preserves fine-grained facial attributes by conducting a landmark-based attribute-preserving (LBAP) operation. Through extensive quantitative and qualitative experiments, we demonstrate the superiority of AP-Swap over other state-of-the-art methods in terms of facial attribute preservation and model efficiency, along with satisfactory identity consistency performance.

Abstract:
Tensor-based multi-view clustering, which incorporates high-order correlations among views, has emerged as a promising research direction. These methods aim to capture intrinsic structure through a tensor-based constraint and then construct an affinity matrix. However, when constructing the affinity matrix, the negative entries in the coefficient matrices are forced to be positive via absolute operation, which can inadvertently destroy the inherent relationships within the data. Furthermore, existing methods may lack the flexibility to effectively handle and fuse multiple views. To address these issues, we propose a novel approach called Tensorized Scaled Simplex Representation (TSSR) for multi-view clustering. TSSR leverages a low-rank tensor constraint to capture the consensus and complementary information among the views. Besides, it introduces the scaled simplex representation, ensuring non-negative coefficient matrices, thus preserving inherent relationships and enhancing flexibility. Thirdly, TSSR extends the scaling range of the affine constraint to capture authentic structural information. Finally, an auto-weighted strategy assigns ideal weights to diverse views, enabling them to contribute appropriately. We integrate these techniques into a unified framework solved by an iterative algorithm. Experimental results demonstrate that TSSR outperforms state-of-the-art methods in terms of performance and efficiency.

Abstract:
Conventional cameras face challenges when capturing motion information during the exposure due to their physical design, rendering the motion deblurring task ill-posed. To this end, we propose a Two-stage Residual-based Motion Deblurring (TRMD) framework for an event camera, which converts a blurry image into a sequence of sharp images, leveraging the abundant motion features encoded in events. In the first stage, a residual estimation network is trained to estimate the residual sequence, which measures the intensity difference between the intermediate frame and other frames sampled during the exposure. In the subsequent stage, the previously estimated residuals are combined with the blurry image to reconstruct the deblurred sequence based on the physical model of motion blur. To facilitate the efficient integration of image and event modalities for residual estimation, we propose a cross-modal fusion module based on spatial-channel attention, aiming to fuse the complementary spatial-temporal features of two modalities. Extensive experiments demonstrate that our method outperforms current state-of-the-art approaches on the synthetic dataset GOPRO and produces superior visualization with less noise and artifacts on the real blur event dataset REBlur.

Abstract:
In long-tailed image instance segmentation, the existing methods deal with imbalance problem from a single perspective, which results in the limitation of performance. Considering that imbalances exist not only between positive and negative classes, but also between foreground and background subclasses, as well as between hard and easy examples, we argue that the losses of samples should be hierarchically equalized at multi-levels (HEL). We first propose a focus based hierarchical-equalization loss (FHEL), which employs a class gradient ratio based reweighting mechanism to achieve the balance between classes, and uses a subclass-balance term and a sample-balance term to separately deal with the inter-subclass and inter-sample imbalances. FHEL can improve the performance of long-tailed instance segmentation in an end-to-end manner, avoiding the overfitting risk and manual hard division in the traditional methods. On the basis of FHEL, we further explore the relationship between inter-subclass imbalance and inter-sample imbalance, and propose a constrained-focus based hierarchical-equalization loss (CFHEL) that copes with the imbalances at multi-levels simultaneously with fewer hyperparameters. We conduct extensive experiments on LVIS v1.0 and COCO-LT datasets with different benchmarks. Both FHEL and CFHEL are superior to the existing methods. On LVIS v1.0, with ResNet50Mask R-CNN, ResNet101Mask R-CNN, ResNeXt101Mask R-CNN and ResNet101 CascadeMask R-CNN, CFHEL outperforms its baselines respectively with 19.8%, 18.5%, 21.6% and 21.2% AP_r gains, and with 6.7%, 6.6%, 7.6% and 6.5% AP gains, achieving the new state-of-the-arts. On COCO-LT, our CFHEL outperforms the baseline with 13.2% APr gains and 3.3% AP gains, also achieving the new best performances.

Abstract:
Latest diffusion-based methods for many image restoration tasks outperform traditional models, but they encounter the long-time inference problem. To tackle it, this paper proposes a Wavelet-Based Diffusion Model (WaveDM). WaveDM learns the distribution of clean images in the wavelet domain conditioned on the wavelet spectrum of degraded images after wavelet transform, which is more time-saving in each step of sampling than modeling in the spatial domain. To ensure restoration performance, a unique training strategy is proposed where the low-frequency and high-frequency spectrums are learned using distinct modules. In addition, an Efficient Conditional Sampling (ECS) strategy is developed from experiments, which reduces the number of total sampling steps to around 5. Evaluations on twelve benchmark datasets including image raindrop removal, rain steaks removal, dehazing, defocus deblurring, demoiréing, and denoising demonstrate that WaveDM achieves state-of-the-art performance with the efficiency that is comparable to traditional one-pass methods and over 100× faster than existing image restoration methods using vanilla diffusion models.

Abstract:
Existing deep neural network (DNN)-based image fusion methods seldom consider low-rank priors for the decomposition of source images, which cannot efficiently model base and detail components in images. To exploit the low-rank priors better, we propose a deep rank-N decomposition network (DRDec-Net) according to the rank-N decomposition of source images. Specifically, a rank-N decomposition model is first established by imposing low-rank priors on the base component of source images. Then, based on the decomposition model, we construct DRDec-Net, which is composed of low-rank decomposition (LRD) modules, a detail fusion (DetailF) module, and a low-rank fusion (LRF) module. In DRDec-Net, it is assumed that source images share the same base component, which is expressed as the sum of rank-1 components. We employ N cascaded LRD modules to extract these rank-1 components from source images. Meanwhile, detail components are obtained by subtracting the base component from source images. Next, the extracted rank-1 components and detail components are integrated by LRF and DetailF modules to produce the base component and detail component of the fused image. Finally, the sum of the two obtained components is regarded as the fused image. Compared to some state-of-the-art methods, experimental results demonstrate that the proposed DRDec-Net can produce a better performance on three image fusion tasks, including infrared and visible images, multi-exposure images, and multi-focus images.

Abstract:
Deep Neural Networks (DNNs) have primarily been demonstrated to be successful when large-scale labeled data are available. However, DNNs usually fail when tasked in few-sample learning scenarios, and the results will be much worse when the limited data show large intra-class variation and inter-class similarity (a.k.a fine-grained classification). To solve this challenging task, the idea of carrying out feature augmentation is visited and better achieved by exploring the merit of the forward Euler method in solving ordinary differential equations (ODEs), and a novel high-order feature augmentation (HFA) model with ResNet is proposed. Specifically, the proposed method leverages the stacked residual structure to model the direction of feature change over the initial state, and uses the triplet loss as constraint to model the step size of change in an adaptive manner. As a result, the initial features can then be augmented by a residual structure with a forward Eulerian form to generate features of the same subcategory with a similar representation as the input image. Furthermore, the proposed augmentation mechanism enjoys two additional benefits: a) it can help avoid the over-fitting issue when learned with insufficient training data; b) it can be used seamlessly with any residual structure-based classification network, and the ResNet used in this paper remains unchanged during testing. Extensive experiments are carried out on fine-grained visual categorization benchmarks, and the results demonstrate that our approach can significantly improve the categorization performance when the training data is highly insufficient.

Abstract:
Few-shot generative model adaptation aims to obtain an excellent model that generates high-quality and high-diversity images with a few training data. However, a small number of training samples often leads to overfitting of the model, which leads the generated images to lose generative diversity. Existing methods either fail to preserve structural information, leading to overfitting phenomena, or maintain too much structure in the source domain, failing to transfer styles well. To solve these problems, we propose an effective generative model adaptation method with style-guided prompt to balance generative diversity and style transformation. Firstly, by freezing the structure-related parameters of the pre-trained model, we preserve the robustness and diversity of the source domain's generative model, which helps to mitigate overfitting and maintain diversity in the generated images. Secondly, the proposed style-guided prompt method allows us to capture the target domain's style more naturally, facilitating more accurate and efficient style transfer in the generated images. Thirdly, the multi-layer deep contrastive loss is designed to further enhance the generated images' diversity and quality by preserving the generative diversity in the source domain without using extra target domain data. Extensive quantitative and qualitative experiments prove the effectiveness and superiority of our method.

Abstract:
Open-set semi-supervised learning (OSSL) provides a practical solution by filtering out-of-distribution (OOD) samples from unlabeled data to guarantee the reliance on large unlabeled data in semi-supervised setting. However, existing OSSL methods mainly focus on identifying in-distribution (ID) samples and discarding OOD samples, while ignoring to make full use of samples that could not be exactly identified as ID or OOD samples. Those samples are more likely to be hard samples, which should be carefully explored to boost the performance in OSSL task. Hence, in this paper, we propose a novel framework, named Mutual Filter Teaching (MFT), where two networks are trained simultaneously to divide the unlabeled data into three parts: ID samples, OOD samples and hard samples. The samples are regarded as ID or OOD samples only if two networks give consistent decisions according to Mahalanobis distance between the unlabeled samples and their closest class prototypes. For those samples with inconsistent decisions, we treat them as hard samples and design an efficient mutual teaching scheme where the samples detected by only one network as positive samples are fed to its peer network for training. Furthermore, we propose to employ the prediction variance of two networks to dynamically rectify the learning from hard samples. Experiments on multiple benchmark datasets demonstrate that our approach achieves the state-of-the-art performance.

Abstract:
The susceptibility of deep neural networks (DNNs) to adversarial examples has raised significant concerns regarding the security and reliability of artificial intelligence systems. These examples contain maliciously crafted perturbations not perceptible to the human eye but can cause the model to make wrong predictions. Adversarial training (AT) is the de facto standard method for enhancing adversarial robustness. However, the improved robustness is often at the cost of a significant drop in standard accuracy for clean samples. Numerous works have attempted to alleviate this trade-off by identifying its causes. A key factor lies in the variability of clean samples, which leads to different adversarial examples being generated using the same attack strategy. The other factor is the disruption of the underlying data structure caused by adversarial perturbations. To overcome these challenges, we propose a novel adversarial training framework named Hardness-Guided Sample-Dependent Adversarial Training (HGSD-AT), which dynamically adjusts the attack strategy based on the hardness of the current adversarial sample to further improve the robustness of the model. By utilizing the two types of constraints which construct from a temporal perspective and spatial distribution perspective, our method directly learns the impact of attack methods on the model, rather than the indirect effects associated with sample distribution. This approach aims to improve the generation of adversarial examples while simultaneously enhancing the robustness and accuracy of DNNs. Our approach exhibits superior performance in terms of both robustness and natural accuracy compared to state-of-the-art defense methods, as validated through comprehensive experiments conducted on three benchmark datasets.

Abstract:
The acquisition of transparent 3D shapes will facilitate many multimedia and computer vision tasks, such as game/movie production and virtual enrioment applications. In this work, we propose a novel method for detailed reconstruction of transparent objects by exploiting polarimetric cues. Most of existing transparent shapes reconstruction methods usually lack sufficient constraints and suffer from the over-smooth problem. Hence, we introduce polarization information as a complementary cue. Specifically, we employ the implicit representation for object's geometry with a neural network, while the polarization render is capable of differentiably rendering the object's polarization images from given illumination configuration. However, direct comparison of rendered polarization images to the real-world captured images will have additional errors due to the transmission in the transparent object. To make the polarimetric cues technically feasible on transparent shapes reconstruction, the concept of reflection percentage which represents proportion of the reflection component is introduced as the weight of the polarization loss. Based on controllable environment setup, we build a polarization dataset containing several solid and smooth transparent objects to verify our method. Experimental results show that our method is capable of recovering detailed shapes and improving reconstruction quality of transparent objects.

Abstract:
Single-image super-resolution (SISR) has experienced vigorous growth with the rapid development of deep learning. However, handling arbitrary scales (e.g., integers, non-integers, or asymmetric) using a single model remains a challenging task. Existing super-resolution (SR) networks commonly employ static convolutions during feature extraction, which cannot effectively perceive changes in scales. Moreover, these continuous-scale upsampling modules only utilize the scale factors, without considering the diversity of local features. To activate more information for better reconstruction, two plug-in and compatible modules for fixed-scale networks are designed to perform arbitrary-scale SR tasks. Firstly, we design a Scale-aware Local Feature Adaptation Module (SLFAM), which adaptively adjusts the attention weights of dynamic filters based on the local features and scales. It enables the network to possess stronger representation capabilities. Then we propose a Local Feature Adaptation Upsampling Module (LFAUM), which combines scales and local features to perform arbitrary-scale reconstruction. It allows the upsampling to adapt to local structures. Besides, deformable convolution is utilized letting more information to be activated in the reconstruction, enabling the network to better adapt to the texture features. Extensive experiments on various benchmark datasets demonstrate that integrating the proposed modules into a fixed-scale SR network enables it to achieve satisfactory results with non-integer or asymmetric scales while maintaining advanced performance with integer scales.

Abstract:
Underwater images suffer from severe color distortion, due to the wavelength-dependent light attenuation and scattering. Various underwater image enhancement methods have been developed to improve the quality of degraded underwater images. However, contemporary approaches often overlook the impact of different scene colors on the overall process, potentially leading to undesired outcomes, such as enhanced images exhibiting excessive redness. In this paper, we observe that the color tones of degraded underwater images exhibit variability under the influence of different underwater targets and scenes. Each degraded color channel can be utilized to guide the color correction of other channels. Given this, a light-weight underwater color correction network, dubbed UCCNet, is presented to alleviate the issue of color corruption. In UCCNet, three parallel branches are designed to excavate the residual information within each color channel, subsequently leveraging these features to improve the quality of underwater images. Moreover, facing the challenge of effectively enhancing underwater images in diverse and complex scenes, the model UCCNet-KT is established based on UCCNet. In UCCNet-KT, the technology of knowledge transfer is designed to improve the generalization ability by enriching the dataset and constructing the loss function. Extensive experiments on various underwater datasets indicate the impressive performance of the UCCNet and UCCNet-KT qualitatively and quantitatively.

Abstract:
Semi-supervised multi-view learning is a remarkable but challenging task. Existing semi-supervised multi-view classification (SMVC) approaches mainly focus on performance improvement while ignoring decision reliability, which limits their deployment in safety-critical applications. Although several trusted multi-view classification methods are proposed recently, they rely on manual annotations. Therefore, this work emphasizes trusted multi-view classification learning under semi-supervised conditions. Different from existing SMVC methods, this work jointly models class probabilities and uncertainties based on evidential deep learning to formulate view-specific opinions. Moreover, unlike previous works that explore cross-view consistency in a single schema, this work proposes a multi-level consistency constraint. Specifically, we explore instance-level consistency on the view-specific representation space and category-level consistency on opinions from multiple views. Our proposed trusted graph-based contrastive loss nicely establishes the relationship between joint opinions and view-specific representations, which enables view-specific representations to enjoy a good manifold to improve classification performance. Overall, the proposed approach provides reliable and superior semi-supervised multi-view classification decisions. Extensive experiments demonstrate the effectiveness, reliability and robustness of the proposed model.

Abstract:
Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.

Abstract:
Although audio-visual representation has been proven to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of the dancer and music rhythm, we introduce MuDaR, a novel Music-Dance Representation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin.

Abstract:
Occluded person re-identification is a challenging problem due to the destruction of occluders in different camera views. Most existing paradigms focus on visible human body parts through some external models to reduce noise interference. However, the feature misalignment problem caused by discarded occlusions negatively affects the performance of the network. Different from most previous works that discard the occluded regions, we present Feature Completion Transformer (FCFormer) that reduces noise interference and complements missing features in occluded parts. Specifically, Occlusion Instance Augmentation is proposed to simulate real and diverse occlusion situations on the holistic image, which enlarges the occlusion samples in the training set and forms aligned occluded-holistic pairs. To reduce the interference of noise, a two-stream architecture is proposed to learn pairwise discriminative features from aligned image pairs, while obtaining self-aligned occluded-holistic feature level sample-label pairs without additional auxiliary models. To complement the features of occluded regions, a Feature Completion Decoder is designed to aggregate possible information from self-generated occluded features in a self-supervised manner. Further, in order to correlate the completion features with identity information, Feature Completion Consistency loss is introduced to enforce the distribution of the generated completion features to be consistent with the real holistic feature distribution. In addition, we propose the Cross Hard Triplet loss to further bridge the gap between completion features and extracting features under the same ID. Extensive experiments over five challenging datasets demonstrate that the proposed FCFormer achieves superior performance and outperforms the state-of-the-art methods by significant margins on Occluded-Duke dataset.

Abstract:
Self-supervised video hashing methods retrieve large-scale video data without labels by making full use of visual and temporal information in original videos. Existing methods are not robust enough to handle small temporal differences between similar videos, because of the ignoring of future unseen samples on temporal which leads to large generalization errors. At the same time, existing self-supervised methods cannot preserve pairwise similarity information between large-scale unlabeled data efficiently and effectively. Thus, a self-supervised temporal sensitive video hashing (TSVH) is proposed in the paper for video retrieval. The TSVH uses a transformer-based autoencoder network with temporal sensitivity regularization to achieve low sensitivity of local temporal perturbations and preserve information of global temporal sequence. The pairwise similarity between video samples is effectively preserved by applying a hashing-based affinity matrix in the method. Experiments on realistic datasets show that the TSVH outperforms several state-of-the-art methods and classic methods.

Abstract:
Due to its wide applications, multimodal emotion recognition has gained increasing research attention. Although existing methods have achieved compelling success with various multimodal fusion methods, they overlook that the dominated modality (e.g., text) may cause a shortcut and hence negatively affect the representation learning of other modalities (e.g., image and audio). To alleviate such a problem, we resort to the knowledge distillation to narrow the gap between different modalities. In particular, we develop a new hierarchical knowledge distillation model for multi-modal emotion recognition (HKD-MER), consisting of three components, feature extraction, hierarchical knowledge distillation, and attentive multi-modal fusion. As the major contribution in our proposed model, the hierarchical knowledge distillation is designed to transfer the knowledge from the dominant modality to the others at both the feature and label levels. It boosts the performance of non-dominated modalities by modeling the inter-modal relation between different modalities. We have justified the effectiveness of our proposed model over two benchmark datasets.

Affiliations: School of Cyberspace, Hangzhou Dianzi University, Hangzhou, China; Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China; Laboratory of Computer Science and Digital Society, University of Technology of Troyes, Troyes, France; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou Science and Technology Institute, Zhengzhou, China

Abstract:
The face manipulation technique such as Deepfake has been widely used to create realistic faces, which raises growing concerns in the community. Based on the correct labeled data, the current Deepfake detectors are mostly trained on the clean dataset, usually resulting in the reliable high detection accuracy. However, in the real-world scenario, labelers possibly mislabel the data or malicious attackers always intend to poison the training data with incorrect label, namely noisy label attack, leading to poor detection results. To overcome the tough issue, we propose a Deepfake detection framework fighting against noisy label attack. Specifically, a Negative Sample Generator (NSG) utilizes the possibly-poisoned samples to generate label-reliable negative samples through simulating blending artifacts caused by Deepfake. Next, a Noise-immune Contrastive Learner (NiCL) takes both positive and negative samples as training data, exploring blending artifacts and intrinsic forgery clues to filtrate the noisy samples out. Moreover, relying on label purification, the filtrated noisy samples are further purified, which then are fed back to the feature extractor for the following model training. Extensive experiments on the benchmark datasets demonstrate the superiority of our proposed Deepfake detector. In particular, when fighting against noisy label attack, the high performance of the proposed detector is remarkably better than its competitors.

Abstract:
Multi-view clustering can explore consensus information from multiple views and has attracted increasing attention in the past two decades. However, existing works face two major challenges: i) how to deal with the conflict between learning view-consensus information and reconstructing inconsistent view-private information and ii) how to mitigate representation degeneration caused by implementing the consistency objective for multi-view data. To address these challenges, we propose a novel framework of self-weighted contrastive fusion for deep multi-view clustering (SCMVC). First, our method establishes a hierarchical feature fusion framework, effectively segregating the consistency objective from the reconstruction objective. Then, multi-view contrastive fusion is implemented via maximizing consistency expression between the view-consensus representation and global representation, fully exploring the view consistency and complementary. More importantly, we propose to measure the discrepancy between pairwise representations, and then introduce a self-weighting method, which adaptively strengthens useful views in feature fusion and weakens unreliable views, to mitigate representation degeneration. Extensive experiments on nine public datasets demonstrate that our proposed method achieves state-of-the-art clustering performance.

Abstract:
The significant success of machine learning models is mainly based on a large amount of data for training iterations, but this limits their generalization for few-shot data. Some existing models utilize the extensive visual and textual modal knowledge of vision-language pre-trained models (VLPs) to compensate for the data scarcity problem. However, they may suffer from a classification bias problem during the fusion of multi-modal information since that they focus on the inter-modal matching while neglecting intra-modal recognition for few-shot images. In this paper, we propose a novel few-shot model with mixed-modal prototypes by partial-tuning the VLPs for better information fusion. It aims to yield a high-quality class prototype representation by integrating the abundant multi-modal knowledge of VLPs and the specific-task information of low-shot visual data. Specifically, we introduce an image-text alignment module to ensure the consistency of the few-shot visual representation and the textual knowledge of VLPs at the feature space. A self-similar learning module is designed to excavate the local and detailed characters of specific class, which is crucial under the data scarcity. Additionally, to preserve the generalizable pre-trained knowledge in the maximum extent, we partial-tune the parameters of VLPs to adapt for the few-shot tasks. To sum up, we mix multi-modal information at the feature representation level instead of fusing multi-modal matching similarities, which effectively mitigates classification bias and ultimately enhances the model performance for few-shot data. The extensive experiments are conducted to evaluate the effectiveness of our model on 11 benchmark datasets and the results show its promising.

Abstract:
Event prediction involves analyzing and forecasting events that occur at a specific time and location to inform decision-making and take the next actions. Current event prediction approaches primarily employ deep learning methods to analyze regular patterns from large amounts of historical data. However, predicting adversarial soccer events remains a significant challenge due to strong antagonism and complex relationships between players. With this consideration, we propose an object-attribute-relation (OAR) network for predicting soccer events using multimodal data, including spatiotemporal trajectory data and video data. The proposed scheme aims to enhance prediction performance by transforming multimodal data into an OAR space that integrates global and local relationships (adversarial information and multi-objective information). In particular, the scheme consists mainly of a relation module, an object attribute module, and a graph prediction module. We first use ConvLSTM to extract the visual features of players from video data and use LSTM to extract the movement features of players from spatiotemporal data. Additionally, we apply a multihead GRU attention mechanism to calculate the relation weights. These three components are then combined into an OAR graph of a clip in a soccer game. Finally, an OAR GNN is designed to determine the influence of different objects and predict events. The entire process constitutes an end-to-end event prediction learning framework. Extensive experimental results on the two challenging datasets, namely, soccER and SkillCorner, verify the effectiveness of the proposed framework.

Abstract:
Accurate pedestrian classification and localization has garnered significant attention due to their extensive applications in various multimedia applications such as security monitoring, autonomous driving, and more. We have observed that the commonly employed Intersection over Union (IoU) metric in many pedestrian detectors is susceptible to an inconsistent GT-Proposal assignment issue. This issue arises when spatially adjacent proposals, which have highly similar features, are assigned to distinct ground-truth boxes, leading to confusion during the training process and an increased number of false positives during inference. To address this challenge, our work presents a novel algorithm named Directional Assignment Strategy (DAS). Firstly, in conjunction with depth distribution, our approach transforms the assignment metric from a two-dimensional (2D) view into a three-dimensional (3D) space, enabling the optimization of the regression head under the constraint of depth direction. Secondly, in contrast to the conventional IoU-based one-to-one assignment of one proposal to one ground-truth box, our method aims to establish a more reasoned matching between sets of proposals and ground-truth boxes. By doing so, the detector is less reliant on the setting of a specific threshold. Leveraging this strategy as a plug-in module within state-of-the-art pedestrian detectors, we demonstrate a notable improvement in performance.

Abstract:
Hashing has been studied extensively for large-scale image retrieval due to its efficient computation and storage. Deep hashing methods typically train models with category-balanced data and suffer from a serious performance deterioration when dealing with long-tailed training samples. Recently, several long-tailed hashing methods focus on this newly emerging field for practical purpose. However, existing methods still face challenges that fixed category centers with limited semantic information cannot effectively improve the discriminative ability of tail-category hash codes. To tackle the issue, we propose a novel method called Semantic-enhanced Proxy-guided Hashing in this paper. We leverage two sets of learnable category proxies in the feature space and the Hamming space respectively, which can describe category semantics by getting updated continuously along with the whole model via back-propagation. Based on this, we introduce the Mahalanobis distance metric to characterize relationships accurately and enhance the semantic representation of both proxies and samples concurrently, improving the hash learning process. Moreover, we capture the multilateral correlations between proxies and samples in the feature space and extend a hypergraph neural network to transfer semantic knowledge from proxies to samples in the Hamming space. Extensive experiments show that our method achieves the state-of-the-art performance and surpasses existing methods by 1.47%–7.56% MAP on long-tailed benchmarks, demonstrating the superiority of learnable category proxies and the effectiveness of our proposed learning algorithm for long-tailed hashing.

Abstract:
Incomplete multi-view clustering primarily focuses on dividing unlabeled data into corresponding categories with missing instances, and has received intensive attention due to its superiority in real applications. Considering the influence of incomplete data, the existing methods mostly attempt to recover data by adding extra terms. However, for the unsupervised methods, a simple recovery strategy will cause errors and outlying value accumulations, which will affect the performance of the methods. Broadly, the previous methods have not taken the effectiveness of recovered instances into consideration, or cannot flexibly balance the discrepancies between recovered data and original data. To address these problems, we propose a novel method termed Manifold-based Incomplete Multi-view clustering via Bi-consistency guidance (MIMB), which flexibly recovers incomplete data among various views, and attempts to achieve biconsistency guidance via reverse regularization. In particular, MIMB adds reconstruction terms to representation learning by recovering missing instances, which dynamically examines the latent consensus representation. Moreover, to preserve the consistency information among multiple views, MIMB implements a biconsistency guidance strategy with reverse regularization of the consensus representation and proposes a manifold embedding measure for exploring the hidden structure of the recovered data. Notably, MIMB aims to balance the importance of different views, and introduces an adaptive weight term for each view. Finally, an optimization algorithm with an alternating iteration optimization strategy is designed for final clustering. Extensive experimental results on 6 benchmark datasets are provided to confirm that MIMB can significantly obtain superior results as compared with several state-of-the-art baselines.

Abstract:
Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and RegionB as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to State-of-the-Art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous State-of-the-Art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.

Abstract:
Despite the great progress of semantic segmentation with supervised learning, annotating large amounts of pixel-wise labels is, however, very expensive and time-consuming. To this end, Unsupervised Semantic Segmentation(USS) has been proposed to learn semantic segmentation, without any form of annotations. This approach involves dense prediction of semantics which is however challenging due to the unreliable nature of local representations. To solve this problem, we propose a newly context-aware unsupervised semantic segmentation framework, which aims to enhance the unsupervised semantic segmentation by leveraging contextual knowledge within and across images. In particular, we introduce a training strategy based on our Pyramid Semantic Guidance (PSG), which utilizes holistic semantics on pyramid views to guide pixel clustering with a siamese network-based framework. Additionally, we introduce a Context-Aware Embedding (CAE) module to fuse global features with low-level geometrical and appearance representations. We evaluate our method on the COCO-Stuff dataset and achieved competitive results compared to both the convolutional and ViT-based USS methods. Specifically, we attain significant improvements of +4.5% and +5% mIoU for Stuff and all class segmentation respectively, compared to previous approaches that employ unsupervised convolutional backbones.

Abstract:
Single domain generalization (single-DG) is a realistic yet challenging domain generalization scenario where a model trained on a single domain generalization scenario where a model trained on a single domain generalizes well to multiple unseen domains. Unlike typical single-DG methods that are essentially supervised data augmentation and focus mainly on the novelty of images, we propose a simple adversarial augmentation method, termed Progressive Diversity Generation (PDG), to synthesize novel and diverse images in a fully unsupervised manner. Specifically, PDG minimizes the uncertainty coefficient to ensure that synthesized images are novel. By modeling conditional probabilities with an auxiliary network, we transfer the adversarial process from semantics to images, thus eliminating dependency on labels. To enhance diversity, we propose the f-diversity, a collection of correlation or similarity measures, to allow our model to generate potential images from diverse perspectives. The proposed architecture combines a multi-attribute generator with a progressive generation framework to improve model performance. PDG is the unsupervised and easy-to-implement method that solves single-DG with only synthesized (source) images. Extensive experiments on multiple single-DG benchmarks show that PDG achieves remarkable results and outperforms existing supervised and unsupervised methods by a large margin in single domain generalization.

Abstract:
Weakly-supervised temporal action localization aims to detect temporal intervals of actions in arbitrarily long untrimmed videos with only video-level annotations. Owing to label sparsity, learning action consistency is intractable. In this paper, we assume that frames with similar representations in a given video should be considered as the same action. To this end, we develop a query-based contrastive learning paradigm to ensure action-semantic consistency. This mechanism encourages normalized embeddings with the same class to be pulled closer together, while embeddings from different classes are repelled apart. Besides, we design a two-branch framework, consisting of a class-aware branch and a class-agnostic branch, to learn salient features and fine-grained clues respectively. To further guarantee the action-semantic consistency of the two branches, unlike previous methods that handle each branch independently, we model the relationship between the two branches to avoid unreasonable predictions. Finally, the proposed model demonstrates superior performance over existing methods on the publicly available THUMOS-14 and ActivityNet-1.3 datasets. Substantial experiments and ablation studies also demonstrate the effectiveness of our model.

Abstract:
Face anti-spoofing (FAS) plays a crucial role in securing face recognition systems against presentation attacks. However, existing FAS methods often struggle to generalize to unseen attacks and domains. Existing generalizable FAS studies generally leverage domain generalization (DG) techniques for exploiting intermediate features that support generalization while neglecting the task-specific nature of FAS. In this paper, we argue that the FAS task is an imbalanced classification problem, which renders it unsuitable to be handled by a standard discriminative classifier. In contrast, we propose a novel approach for FAS by modeling the problem from a generative perspective using an energy-based model (EBM). The EBM captures the distribution of genuine faces and detects spoofing attempts as deviations from this distribution. We train the EBM using a discriminative objective and an energy regularization term to shape the learned distribution and improve generalization. To enhance the robustness to unseen domains, we introduce an energy-based domain augmentation technique that explores the latent space around the source distribution guided by the EBM. We further leverage a meta-learning framework and a gradient-based variant to leverage the augmented data for domain generalization. For practicability, we consider a practical setting where samples are holistically collected under different environments without distinct domain labels, and show that our method can naturally harness this challenging setting by training with cluster labels. Extensive experiments on four FAS datasets demonstrate the superiority of our method in both intra- and cross-dataset settings, outperforming state-of-the-art approaches.

Abstract:
Active learning (AL) is designed to construct a high-quality labeled dataset by iteratively selecting the most informative samples. Such sampling heavily relies on data representation, while recently pre-training is popular for robust feature learning. However, as pre-training utilizes low-level pretext tasks that lack annotation, directly using pre-trained representation in AL is inadequate for determining the sampling score. To address this problem, we propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance for selecting diverse and instructive samples near the decision boundary. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. The diversity indicator constructs two feature spaces based on the pre-training pretext model and the downstream knowledge from annotation, by which it locates the neighbors of unlabeled data from the downstream space in the pretext space to explore the interaction of samples. With this mechanism, DOKT unifies the data relations of low-level and high-level representations to estimate traceback diversity. Next, in the uncertainty estimator, domain mixing is designed to enforce perceptual perturbing to unlabeled samples with similar visual patches in the pretext space. Then the divergence of perturbed samples is measured to estimate the domain uncertainty. As a result, DOKT selects the most diverse and important samples based on these two modules. The experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods and generalizes well to various application scenarios such as semantic segmentation and image captioning.

Abstract:
With the prevalence of electronic devices in our daily lives, content leakages frequently occur, and to enable leakage tracing, screen-shooting resistant watermarking has attracted tremendous attention. However, current studies often overlook a thoughtful investigation of the cross-media screen-camera process and fail to consider the effect of grayscale deviation on the screen. In this paper, we propose screen-shooting distortion simulation (\bf SSDS), which involves a grayscale deviation function for constructing a more practical noise layer. We divide SSDS into screen displaying and camera shooting. For screen displaying, different viewing angles result in grayscale deviation with distinct intensities, and we simulate the distortions by modeling the relative position of the viewing point and the screen plane. For camera shooting, a series of distortion functions are used to approximate the perturbations in the camera pipeline, including defocus blur, noise and JPEG compression. Furthermore, the gradient-guided encoder is designed to conduct the embedding in the texture region using a modification cost map. Experimental results show that our proposed watermarking framework outperforms the state-of-the-art methods in terms of robustness and visual quality.

Abstract:
Cross-modal video retrieval aims to retrieve semantically relevant videos when given a textual query, and is one of the fundamental multimedia tasks. Most top-performing methods primarily leverage Vision Transformer (ViT) to extract video features (Lei et al., 2021, (Bain et al., 2021), (Wang et al., 2022). However, they suffer from the high computational complexity of ViT, especially when encoding long videos. A common and simple solution is to uniformly sample a small number (e.g., 4 or 8) of frames from the target video (instead of using the whole video) as ViT inputs. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames yields better performance than using 4 frames but requires more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level optimization process learns a cross-modal video retrieval model whose input includes the “compressed frames” learned by frame-level optimization. In turn, frame-level optimization is achieved through gradient descent using the meta loss of the video retrieval model computed on the whole video. We call this BOP method (as well as the “compressed frames”) the Meta-Optimized Frames (MOF) approach. By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in its actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation purposes, we conduct extensive cross-modal video retrieval experiments on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method that boost multiple baseline methods, and can achieve a new state-of-the-art performance.

Abstract:
Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, i.e., action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network (TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.

Abstract:
Object Re-identification (ReID) is a task focused on retrieving a probe object from a multitude of gallery images using a ReID model trained on a stationary, camera-free dataset. This training involves associating and aggregating identities across various camera views. However, when deploying ReID algorithms in real-world scenarios, several challenges, such as storage constraints, privacy considerations, and dynamic changes in camera setups, can hinder their generalizability and practicality. To address these challenges, we introduce a novel ReID task called Camera-Incremental Object Re-identification (CIOR). In CIOR, we treat each camera's data as a separate source and continually optimize the ReID model as new data streams come from various cameras. By associating and consolidating the knowledge of common identities, our aim is to enhance discrimination capabilities and mitigate the problem of catastrophic forgetting. Therefore, we propose a novel Identity Knowledge Evolution (IKE) framework for CIOR, consisting of Identity Knowledge Association (IKA), Identity Knowledge Distillation (IKD), and Identity Knowledge Update (IKU). IKA is proposed to discover common identities between the current identity and historical identities, facilitating the integration of previously acquired knowledge. IKD involves distilling historical identity knowledge from common identities, enabling rapid adaptation of the historical model to the current camera view. After each camera has been trained, IKU is applied to continually expand identity knowledge by combining historical and current identity memories. Market-CL and Veri-CL evaluations show the effectiveness of Identity Knowledge Evolution (IKE) for CIOR.Code: https://github.com/htyao89/Camera-Incremental-Object-ReID

Abstract:
While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increases the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.

Abstract:
Image colorization is a challenging task due to its ill-posed and multimodal nature, leading to unsatisfactory results in traditional approaches that rely on reference images or user guides. Although deep learning-based methods have been proposed, they may not be sufficient due to the lack of semantic understanding. To overcome this limitation, we present an innovative end-to-end automatic colorization method that does not require any color reference images and achieves superior quantitative and qualitative results compared to state-of-the-art methods. Our approach incorporates a Multiscale Pyramid Transformer that captures both local and global contextual information and a novel attention module called Dual-Attention, which replaces the traditional Window Attention and Channel Attention with faster and lighter Separable Dilated Attention and Factorized Channel Attention. Additionally, we introduce a new color decoder called Color-Attention, which learns colorization patterns from grayscale images and color images of the current training set, resulting in improved generalizability and eliminating the need for constructing color priors. Experimental results demonstrate the effectiveness of our approach in various benchmark datasets, including high-level computer vision tasks such as classification, segmentation, and detection. Our method offers robustness, generalization ability, and improved colorization quality, making it a valuable contribution to the field of image colorization.

Abstract:
Domain generalization aims to reduce the vulnerability of deep neural networks in the out-of-domain distribution scenario. With the recent and increasing data privacy concerns, federated domain generalization, where multiple domains are distributed on different local clients, has become an important research problem and brings new challenges for learning domain-invariant information from separated domains. In this paper, we address the problem of federated domain generalization from the perspective of domain hallucination. We propose a novel federated domain hallucination learning framework, with no additional data exchange between clients other than model weights, based on the idea that a domain hallucination with enlarged prediction uncertainty for the global model is more likely to transform the samples into an unseen domain. These types of desired domain hallucinations are achieved by generating samples that maximize the entropy of the global model and minimize the cross-entropy of the local model, where the latter loss is further introduced to maintain the sample semantics. By training the local models with the learned domain hallucinations, the final model is expected to be more robust to unseen domain shifts. We perform extensive experiments on three object classification benchmarks and one medical image segmentation benchmark. The proposed method outperforms state-of-the-art methods on all the benchmarks, demonstrating its effectiveness.

Abstract:
Domain generalization (DG) aims to generalize the knowledge learned from multiple source domains to unseen target domains. Existing DG techniques can be subsumed under two broad categories, i.e., domain-invariant representation learning and domain manipulation. Nevertheless, it is extremely difficult to explicitly augment or generate the unseen target data. And when source domain variety increases, developing a domain-invariant model by simply aligning more domain-specific information becomes more challenging. In this article, we propose a simple yet effective method for domain generalization, named Knowledge Distillation based Domain-invariant Representation Learning (KDDRL), that learns domain-invariant representation while encouraging the model to maintain domain-specific features, which recently turned out to be effective for domain generalization. To this end, our method incorporates multiple auxiliary student models and one student leader model to perform a two-stage distillation. In the first-stage distillation, each domain-specific auxiliary student treats the ensemble of other auxiliary students' predictions as a target, which helps to excavate the domain-invariant representation. Also, we present an error removal module to prevent the transfer of faulty information by eliminating incorrect predictions compared to the true labels. In the second-stage distillation, the student leader model with domain-specific features combines the domain-invariant representation learned from the group of auxiliary students to make the final prediction. Extensive experiments and in-depth analysis on popular DG benchmark datasets demonstrate that our KDDRL significantly outperforms the current state-of-the-art methods.

Abstract:
In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates.

Abstract:
Most existing state-of-the-art video classification methods assume that the training data obey a uniform distribution. However, video data in the real world typically exhibit an imbalanced long-tailed class distribution, resulting in a model bias towards head class and relatively low performance on tail class. While the current long-tailed classification methods usually focus on image classification, adapting them to video data is not a trivial extension. We propose an end-to-end multi-expert distribution calibration method to address these challenges based on two-level distribution information. The method jointly considers the distribution of samples in each class (intra-class distribution) and the overall distribution of diverse data (inter-class distribution) to solve the issue of imbalanced data under long-tailed distribution. By modeling the two-level distribution information, the model can jointly consider the head classes and the tail classes and significantly transfer the knowledge from the head classes to improve the performance of the tail classes. Extensive experiments verify that our method achieves state-of-the-art performance on the long-tailed video classification task.

Affiliations: Intelligent Visual Information Perception Laboratory, Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Communication and Information Engineering, Shanghai University, Shanghai, China; Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; State Key Laboratory of Internet of Things for Smart City, Faculty of Science and Technology, Department of Computer and Information Science, University of Macau, Macau, China; School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China; Research Center for Industries of the Future and the School of Engineering, Westlake University, Hangzhou, China

Abstract:
Recently, Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks due to the ability of global feature extraction. However, the capabilities of Transformers that need to incorporate contextual information to extract features dynamically are neglected. To address this issue, we propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer. Specifically, in the CT block, we first propose a CNN-based Cross-Scale Information Aggregation Module (CIAM) to enable the model to better focus on potentially helpful information to improve the efficiency of the Transformer phase. Then, we design a novel Cross-receptive Field Guided Transformer (CFGT) to enable the selection of contextual information required for reconstruction by using a modulated convolutional kernel that understands the current semantic information and exploits the information interaction within different self-attention. Extensive experiments have shown that our proposed CFIN can effectively reconstruct images using contextual information, and it can strike a good balance between computational cost and model performance as an efficient model.

Abstract:
Multi-modal images are required in a wide range of practical scenarios, from clinical diagnosis to public security. However, certain modalities may be incomplete or unavailable because of the restricted imaging conditions, which commonly leads to decision bias in many real-world applications. Despite the significant advancement of existing image synthesis techniques, learning complementary information from multi-modal inputs remains challenging. To address this problem, we propose an autoencoder-based collaborative attention generative adversarial network (ACA-GAN) that uses available multi-modal images to generate the missing ones. The collaborative attention mechanism deploys a single-modal attention module and a multi-modal attention module to effectively extract complementary information from multiple available modalities. Considering the significant modal gap, we further developed an autoencoder network to extract the self-representation of target modality, guiding the generative model to fuse target-specific information from multiple modalities. This considerably improves cross-modal consistency with the desired modality, thereby greatly enhancing the image synthesis performance. Quantitative and qualitative comparisons for various multi-modal image synthesis tasks highlight the superiority of our approach over several prior methods by demonstrating more precise and realistic results.

Abstract:
Visual retrieval tasks such as image retrieval and person re-identification (Re-ID) aim at effectively and thoroughly searching images with similar content or the same identity. After obtaining retrieved examples, re-ranking is a widely adopted post-processing step to reorder and improve the initial retrieval results by making use of the contextual information from semantically neighboring samples. Prevailing re-ranking approaches update distance metrics and mostly rely on inefficient crosscheck set comparison operations while computing expanded neighbors based distances. In this work, we present an efficient re-ranking method which refines initial retrieval results by updating features. Specifically, we reformulate re-ranking based on Graph Convolution Networks (GCN) and propose a novel Graph Convolution based Re-ranking (GCR) for visual retrieval tasks via feature propagation. To accelerate computation for large-scale retrieval, a decentralized and synchronous feature propagation algorithm which supports parallel or distributed computing is introduced. In particular, the plain GCR is extended for cross-camera retrieval and an improved feature propagation formulation is presented to leverage affinity relationships across different cameras. It is also extended for video-based retrieval, and Graph Convolution based Re-ranking for Video (GCRV) is proposed by mathematically deriving a novel profile vector generation method for the tracklet. Without bells and whistles, the proposed approaches achieve state-of-the-art performances on seven benchmark datasets from three different tasks, i.e., image retrieval, person Re-ID and video-based person Re-ID.

Abstract:
Owing to its inherently dynamic nature and economical training cost, offline reinforcement learning (RL) is typically employed to implement an interactive recommender system (IRS). A crucial challenge in offline RL-based IRSs is the data sparsity issue, i.e., it is hard to mine user preferences well from the limited number of user-item interactions. In this article, we propose a knowledge-enhanced causal reinforcement learning model (KCRL) to mitigate data sparsity in IRSs. We make technical extensions to the offline RL framework in terms of the reward function and state representation. Specifically, we first propose a group preference-injected causal user model (GCUM) to learn user satisfaction (i.e., reward) estimation. We introduce beneficial group preference information, namely, the group effect, via causal inference to compensate for incomplete user interests extracted from sparse data. Then, we learn the RL recommendation policy with the reward given by the GCUM. We propose a knowledge-enhanced state encoder (KSE) to generate knowledge-enriched user state representations at each time step, which is assisted by a self-constructed user-item knowledge graph. Extensive experimental results on real-world datasets demonstrate that our model significantly outperforms the baselines.

Abstract:
Low-light image enhancement aims to recover normal-light images from the images captured under dim environments. Most existing methods could just improve the light appearance globally whereas failing to handle other degradation such as dense noise, color offset and extremely low-light. Moreover, unsupervised methods proposed in recent years lack reliable physical model as the basis, thus universality is greatly limited. To address these problems, we propose a novel low-light image enhancement method via Retinex-inline cycle-consistent generative adversarial network named Cycle-Retinex, whose training is totally dependent on unpaired datasets. Specifically, we organically combine Retinex theory with CycleGAN, by which we decouple low-light image enhancement task into two sub-tasks, i.e. illumination map enhancement and reflectance map restoration. Retinex theory helps CycleGAN simplify low-light image enhancement problem and CycleGAN provides synthetic paired images to guide the training of Retinex decomposition network. We further introduce a self-augmented method to address the color distortion and noise problem, thus making the network learn to enhance low-light images adaptively. Extensive experiments show that the proposed method can achieve promising results.

Abstract:
Visual commonsense reasoning (VCR) is a challenging reasoning task that aims to not only answer the question based on a given image but also provide a rationale justifying for the choice. Graph-based networks are appropriate to represent and extract the correlation between image and language for reasoning, where how to construct and learn graphs based on such multi-modal Euclidean data is a fundamental problem. Most existing graph-based methods view visual regions and linguistic words as identical graph nodes, ignoring inherent characteristics of multi-modal data. In addition, these approaches typically only have one graph-learning layer, and the performance declines as the model goes deeper. To address these issues, a novel method named Multi-modal Structure-embedding Graph Transformer (MSGT) is proposed. Specifically, an answer-vision graph and an answer-question graph are constructed to represent and model intra-modal and inter-modal correlations in VCR simultaneously, where additional multi-modal structure representations are initialized and embedded according to visual region distances and linguistic word orders for more reasonable graph representation. Then, a structure-injecting graph transformer is designed to inject embedded structure priors into the semantic correlation matrix for the evolution of node features and structure representations, which can stack more layers to make model deeper and extract more powerful features with instructive priors. To adaptively fuse graph features, a scored pooling mechanism is further developed to select valuable clues for reasoning from learnt node features. Experiments demonstrate the superiority of the proposed MSGT framework compared with state-of-the-art methods on the VCR benchmark dataset.

Affiliations: College of Computer and Information Science, College of Software, Southwest University, Chongqing, China; School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China; School of Computer Science and Technology, Yangzhou University, Yangzhou, China; Faculty of Information Technology, Monash University, Clayton, VIC, Australia; School of Big Data and Computer Science, Guizhou Normal University, Guiyang, China; School of Computer Science, Sichuan University, Chengdu, China

Abstract:
The application of Auto-Encoder (AE) to multi-view representation learning has gained traction due to advancements in deep learning. While some current AE-based multi-view representation learning algorithms incorporate the geometric structure of the input data into their feature representation learning process, their use of a shallow structured graph regularization term can be restrictive when used in conjunction with deep models. Furthermore, current multi-view representation learning algorithms do not fully utilize the diversity and consistency presented in different views, leading to a reduction in the efficacy of feature learning. This paper introduces a novel approach, reconstructed graph constrained auto-encoders (RGCAE), for multi-view representation learning. Unlike existing methods, our approach incorporates deep adaptive graph regularization based on multi-layer perceptron to ensure the preservation of the geometric similarity graph, which is constructed based on the local invariance principle. By decoupling the feature representation learning from the preservation of the geometric structure among different views, our approach can better leverage the diversity presented in multi-view data. We obtain view-specific representations that preserve the geometric structure and then combine them by averaging to obtain a common representation. To ensure the consistency of the multi-view data, we minimize the loss between the view-specific and common representations. Consequently, our RGCAE approach can maintain the geometric structure of multi-view data and is better suited for integration with deep models. Extensive experiments on six datasets demonstrate that RGCAE obtained promising performance, compared with the state-of-the-art methods.

Abstract:
Learning based feature matching methods have been commonly studied in recent years. The core issue for learning feature matching is to how to learn (1) discriminative representations for feature points (or regions) within each intra-image and (2) consensus representations for feature points across inter-images. Recently, self- and cross-attention models have been exploited to address this issue. However, in many scenes, features are coming with large-scale, redundant and outliers contaminated. Previous self-/cross-attention models generally conduct message passing on all primal features which thus lead to redundant learning and high computational cost. To mitigate limitations, inspired by recent seed matching methods, in this article, we propose a novel efficient Anchor Matching Transformer (AMatFormer) for the feature matching problem. AMatFormer has two main aspects: First, it mainly conducts self-/cross-attention on some anchor features and leverages these anchor features as message bottleneck to learn the representations for all primal features. Thus, it can be implemented efficiently and compactly. Second, AMatFormer adopts a shared FFN module to further embed the features of two images into the common domain and thus learn the consensus feature representations for the matching problem. Experiments on several benchmarks demonstrate the effectiveness and efficiency of the proposed AMatFormer matching approach.

Abstract:
Cross-modal retrieval aims to retrieve relevant content of different modalities by giving a query of another modality. The biggest difficulty is how to bridge the heterogeneous gap between different modalities. The commonly-used methods tend to focus on exploiting individual image-text pair and mining the relations of cross-modality data thereof, but ignore the role of multi-sample correlation. Moreover, more global, structural inter-pair knowledge contained by the training dataset will be under-used. To fully exploit graph-structured semantics and mine the semantic information in the dataset for learning discriminative representations, we propose Weighted Graph-structured Semantics Constraint Network (WGSCN), a unified, graph-based, semantic-constrained learning framework, in which GCN is used to mine comprehensive relation information from cross modality data. Our main inspiration is to design a novel two-branch GCN-based Cross-modal Semantic Encoding (GCSE) module to produce semantic embeddings with the both modality-specific and modality-shared correlation. Moreover, a GAN-based dual learning approach is used to further improve the discriminability and model the joint distribution across different modalities. Our proposed GDL uses semantic embeddings as supervisory signal to make the common representation semantically discriminative while adversarial learning and dual learning are used to make the common representation modality-invariant. Through comparative experiments on five commonly used cross-modal datasets, we have shown the superior retrieval accuracy of our WGSCN.

Abstract:
Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details well because they are limited by the used clothing-agnostic person representation without referring to the limb textures of the person image. To address these problems, we propose Limb-aware Virtual Try-on Network named PL-VTON, which performs fine-grained clothing warping progressively and generates high-quality try-on results with realistic limb details. Specifically, we present Progressive Clothing Warping (PCW) that explicitly models the location and size of in-shop clothing and utilizes a two-stage alignment strategy to progressively align the in-shop clothing with the human body. Moreover, a novel gravity-aware loss that considers the fit of the person wearing clothing is adopted to better handle the clothing edges. Then, we design Person Parsing Estimator (PPE) with a non-limb target parsing map to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and body regions. Finally, we introduce Limb-aware Texture Fusion (LTF) that focuses on generating realistic details in limb regions, where a coarse try-on result is first generated by fusing the warped clothing image with the person image, then limb textures are further fused with the coarse result under limb-aware guidance to refine limb details. Extensive experiments demonstrate that our PL-VTON outperforms the state-of-the-art methods both qualitatively and quantitatively.

Abstract:
Learned video compression has recently emerged as an essential research topic in developing advanced video compression technologies, where motion compensation is considered one of the most challenging issues. In this article, we propose a learned video compression framework via heterogeneous deformable compensation strategy (HDCVC) to tackle the problems of unstable compression performance caused by single-size deformable kernels in downsampled feature domain. More specifically, instead of utilizing optical flow warping or single-size-kernel deformable alignment, the proposed algorithm extracts features from the two adjacent frames to estimate content-adaptive heterogeneous deformable (HetDeform) kernel offsets. Then we align the features extracted from the reference frames with the HetDeform convolution to accomplish motion compensation. Moreover, we design a Spatial-Neighborhood-Conditioned Divisive Normalization (SNCDN) to reduce spatial statistic dependencies and achieve more effective data Gaussianization combined with the Generalized Divisive Normalization. Furthermore, we propose a multi-frame enhanced reconstruction module for exploiting context and temporal information for final quality enhancement. Experimental results indicate that HDCVC achieves superior performance than the recent state-of-the-art learned video compression approaches.

Abstract:
Image-text retrieval, as a fundamental task in the cross-modal field, aims to explore the relationship between visual and textual modalities. Recent methods address this task only by learning the conceptual and syntactical correspondences between cross-modal fragments, but these correspondences inevitably contain noise without considering external knowledge. To solve this issue, we propose a novel Commonsense-Guided Semantic and Relational Consistencies (CSRC) for image-text retrieval that can simultaneously expand the semantics and relations to reduce the cross-modal differences under the assumption that the semantics and relations of the true image-text pair should be consistent between two modalities. Specifically, we first explore commonsense knowledge to expand the specific concepts for visual and textual graphs and optimize the semantic consistency by minimizing the differences in cross-modal semantic importance. Then, we extend the same relations for cross-modal concept pairs with semantic consistency, which serves to implement relational consistency. After that, we combine external commonsense knowledge with internal correlation to enhance concept representation and further optimize relational consistency by regularizing the importance differences between association-enhanced concepts. Extensive experimental results on two popular image-text retrieval datasets demonstrate the effectiveness of our proposed method.

Abstract:
Deep hashing integrates the advantages of deep learning and hashing technology, and has become the mainstream of the large-scale image retrieval field. However, when training the deep hashing models, most of the existing approaches regard the similarity margin of image pairs as a constant. Once similarity distance exceeds the fixed margin, the network will not learn anything, which easily results in model collapses. In this paper, we address this dilemma with a novel unified deep hashing framework, termed Deep Neighborhood Structure-preserving Hashing (DNSH), to generate the similarity-preserving and discriminative hash codes. Specifically, by extracting the discriminative object characteristics with large variances, we design an adaptive margin quadruplet loss to further explore the underlying similarity relationship between image pairs, reflecting the correct semantic structure among its neighbors. Based on the quadruple form, we develop a quadruple regularization to decrease quantization errors between binary-like embedding and hashing codes. Furthermore, through learning bit balance and bit independent terms jointly, we present the binary code constraint loss to alleviate redundancy in different bits. Extensive evaluations on four popular benchmark datasets demonstrate that our proposed deep hashing framework achieves an excellent performance than the comparison methods.

Abstract:
In recent years, conditional image synthesis has attracted growing attention due to its controllability in the image generation process. Although recent works have achieved realistic results, most of them have difficulty handling fine-grained styles with subtle details. To address this problem, a novel normalization module, named Detailed Region-Adaptive Normalization (DRAN), is proposed. It adaptively learns both fine-grained and coarse-grained style representations. Specifically, we first introduce a multi-level structure, Spatiality-aware Pyramid Pooling, to guide the model to learn coarse-to-fine features. Then, to adaptively fuse different levels of styles, we propose Dynamic Gating, making it possible to adaptively fuse different levels of styles according to different spatial regions. Finally, we collect a new makeup dataset (Makeup-Complex dataset) that contains a wide range of complex makeup styles with diverse poses and expressions. To evaluate the effectiveness and show the general use of our method, we conduct a set of experiments on makeup transfer and semantic image synthesis. Quantitative and qualitative experiments show that equipped with DRAN, simple baseline models are able to achieve promising improvements in complex style transfer and detailed texture synthesis.

Abstract:
Numerous photos are taken in daily life, and sorting them is laborious and time consuming. The large number of similar images exacerbates the difficulty of album management, under this scenario, serial photo selection (SPS) emerges. As an important branch of image aesthetic quality assessment, it focuses on identifying the best image among a series of almost identical photos. Currently, most existing SPS methods focus only on extracting features from the original image, while neglecting the fact that multiple views of the image can provide much more detailed aesthetic information. In this article, we propose a Siamese network structure called SPSNet to enhance the representation learning of multi-view features by acquiring the depth, generic, and handcrafted features of images. In specific, we implement a parallel structure to extract deep and shallow features, fusing local and global representations at different resolutions interactively. The aggregation of multiple views of image via a self-attentive module with adaptive weights enables the model to discriminate the importance of each view. Moreover, we employ a graph neural network to construct the relationships among the multi-view features. Our proposed method, which is trained by a Siamese network, can effectively distinguish the nuances of similar images, and thus, select the best one from a series of almost identical photos. Extensive experiments conducted on the aesthetic dataset demonstrate that our method outperforms other state-of-the-art SPS methods, which achieves the 75.36% accuracy on the Phototriage dataset. Besides, our model is up to 3.04% better than the baseline methods in terms of the average accuracy.

Abstract:
With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized “classifier”, while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.

Abstract:
Video object detection has attracted increasing attention in recent years. Although great success has been achieved by off-the-shelf video object detection methods through delicately designing various types of feature aggregation, they overlook the class-aware supervision and thus still suffer from the problem of classification incapability, which means the classification between objects with deteriorated or similar appearances is error-prone. In this article, we propose a novel class-aware dual-supervised aggregation network (CDANet) for video object detection, including three substantial improvements to effectively alleviate the classification incapability problem of previous methods. First, we develop a class-aware cross-modality distillation supervision that transfers the semantic knowledge of label data to the features of video data, effectively enhancing the semantic representations of features. Second, we design a graph-guided feature aggregation module that effectively models the structural relations between features by leveraging the dynamic residual graph convolutional network, enabling our CDANet to perform more effective feature aggregation in the temporal domain. Third, we present a class-aware proposal contrastive supervision to maximize the intra-class agreement and inter-class disagreement, which is conducive to improving the semantic discriminability of features. The class-aware dual supervision and feature aggregation are tightly tied into a unified end-to-end framework to make our CDANet fully exploit class-specific semantic knowledge and inter-frame temporal dependencies to enhance object appearance representations, which facilitates the classification of detected objects. We conduct experiments on the challenging ImageNet VID dataset, and the results demonstrate the superiority of our CDANet against state-of-the-art methods. More remarkably, our CDANet achieves 85.4% mAP with ResNet-101 or 86.5% mAP with ResNeXt-101.

Abstract:
Video Anomaly detection, aiming to detect the abnormal behaviors in surveillance videos, is a challenging task since the anomalous events are diversified and complicated in different situations. And this makes it difficult to use one single static network architecture to extract useful information from diverse abnormal patterns. Therefore, in this article, we propose a novel Dynamic Self-Supervised Network (DSS-Net) to explore both spatial and temporal anomalous information. In our DSS-Net, we design a dynamic network to adaptively select suitable network architecture to extract latent features from different anomalous patterns and normal patterns. Specifically, we generate spatial and temporal pseudo-abnormal data as the input of the dynamic network to conduct self-supervised learning. And we have a specific design on Hybrid Anomaly Dynamic Convolution (HAD-Conv) to extract features for diversified anomalous events adaptively. We utilize both normal and pseudo-abnormal data to encourage the dynamic network to mine the discriminative information. Furthermore, we design a feature separation loss to maximize the difference between the anomalous and normal videos. We evaluate our proposed method on four public anomaly detection datasets and achieve competitive results compared with the state-of-the-art approaches.

Abstract:
In multimodal machine learning, proper handling of cross-modal information is essential for obtaining an ideal joint embedding. Despite the progress made by recent fusion strategies, we hold that before the fusion stage, the unimodal representation inevitably contains noise that may hinder the correct learning of cross-modal dynamics and affect multimodal fusion. It is worthwhile to investigate how the information is being utilized and how to make the full use of it. Rethinking the process of leveraging multiple modalities for the joint embedding, multimodal learning can be regarded as a chemical reaction process and two steps may benefit learning: 1) purification to filter impurity, and 2) catalyst to facilitate learning. In this paper, we propose a Multimodal Information Modulation (MIM) learning framework to modulate the contribution and utilization of the cross-modal information, which identifies and handles the ‘impurity’ and ‘catalyst’ in multimodal learning. Specifically, a Unimodal Purification Network (UPN) is proposed to identify and explicitly filter out the impurity within each modality before fusion, which reduces the possibility of learning incorrect cross-modal dynamics. Besides, based on the intuition that useful information has the potential in the guidance of model updating, it plays a role to facilitate learning, which is achieved by the design of the Knowledge Guidance Scheme (KGS) considering both the intra- and inter-modal scenarios. Different to a majority of works that emphasize the role of useful information in the fusion and inference stage, KGS considers its potential role in assisting the representation learning of weaker components. Besides, it fully considers the modality dominance problem and sample variations for optimization. In short, MIM manages to modulate the useless/useful information to minimize/emphasize their contribution. Experimental results verify the effectiveness of the proposed method.

Abstract:
Few-shot learning (FSL) is a challenging task that aims to train a classifier to recognize novel categories, where only a few annotated examples are available in each category. Recently, many FSL approaches have been proposed based on the meta-learning paradigm, which attempts to learn transferable knowledge from similar tasks by designing a meta-learner. However, most of these approaches only exploit the information from visual modality and do not utilize ones from additional modalities (e.g., textual description). Since the labeled examples in FSL are limited, increasing the information on the examples is a probable solution to improve the classification performance. This motivates us to propose a novel meta-learning method, termed textual enhanced adaptive meta-fusion FSL (TAMF-FSL), which leverages both the visual information from the visual image and semantic information from language supervision. Specifically, TAMF-FSL exploits the semantic information of textual description to improve the visual-based models. We first employ a text encoder to learn the semantic features of each visual category, and then design a modality alignment module and meta-fusion module to align and fuse the visual and semantic features for final prediction. Extensive experiments show that the proposed method outperforms many recent or competitive FSL counterparts on two popular datasets.

Abstract:
Sequential recommendation mines the user's interaction sequence or time information to get better recommendations and thus is gaining more and more attention. Existing sequential recommendations tend to build new models, and the study of the loss function is seriously neglected. Despite the increasing attention paid to contrastive learning recently, we believe that the key to contrastive learning is contrastive loss(CL), which also provides a new option for sequential recommendation. However, we find it works against the personalized representation of features. First, it is a relative constraint that keeps positive and negative samples away from each other but without an absolute constraint. Second, recent studies have shown that all embeddings should be uniformly distributed. However, CL only widens the distance of positive and negative samples within the training batch, rather than making a uniform distribution of all items. These two shortcomings make the embedding space too compact, which is harmful to personalized representation and recommendation. Therefore, this article proposes Personalized Contrastive Loss (PCL) to combine CL with absolute constraints of BCE/CE and employs regularization methods to make the representations uniformly distributed. State-of-the-art results are obtained in experiments on several commonly used datasets. The code and data will be available on GitHub.

Abstract:
Deep models for facial expression recognition achieve high performance by training on large-scale labeled data. However, publicly available datasets contain uncertain facial expressions caused by ambiguous annotations or confusing emotions, which could severely decline the robustness. Previous studies usually follow the bias elimination method in general tasks without considering the uncertainty problem from the perspective of different corresponding sources. This article proposes a novel method of multi-task assisted correction in addressing uncertain facial expression recognition called MTAC. Specifically, a confidence estimation block and a weighted regularization module are applied to highlight solid samples and suppress uncertain samples in every batch. In addition, two auxiliary tasks, i.e., action unit detection and valence-arousal measurement, are introduced to learn semantic distributions from a data-driven AU graph and mitigate category imbalance based on latent dependencies between discrete and continuous emotions, respectively. Moreover, a re-labeling strategy guided by feature-level similarity constraint further generates new labels for identified uncertain samples to promote model learning. The proposed method can flexibly combine with existing frameworks in a fully-supervised or weakly-supervised manner. Experiments on five popular benchmarks demonstrate that the MTAC substantially improves over baselines when facing synthetic and real uncertainties and outperforms the state-of-the-art methods.

Abstract:
Depth maps suffer from multiple kinds of degradation such as noise and low resolution, due to the limitations of sensors. To improve the spatial resolution and quality of depth maps, RGB-D-based depth super-resolution (SR) methods utilize the corresponding color image to provide extra structure information. However, the inconsistency between the color texture and depth structure can lead to texture-copying artifacts if the two kinds of features are fused without selection. In this article, we propose a novel coarse-to-fine framework for RGB-D-based depth SR, which consists of two sub-networks, i.e., CONet for coarse SR and RFNet for refinement. Through the proposed coarse supervision strategy, CONet can alleviate multiple degradations in depth maps and assist with further SR in the refinement stage. Moreover, the branch attention module (BAM) is incorporated in the RFNet to adaptively select important information from RGB-D features and suppress the texture-copying artifact. Additionally, we propose an edge-aware spatial attention module (ESAM) to further locate and restore the depth discontinuity in the fused RGB-D features. Extensive experiments on multiple benchmarks demonstrate that compared to the state-of-the-art methods, the proposed method achieves improved results both quantitatively and qualitatively.

Abstract:
Entity-aware image captioning aims to describe named entities and events related to the image by utilizing the background knowledge in the associated article. This task remains challenging as it is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities. Furthermore, the complexity of the article brings difficulty in extracting fine-grained relationships between entities to generate informative event descriptions about the image. To tackle these challenges, we propose a novel approach that constructs a multi-modal knowledge graph (MMKG) to associate the visual objects with named entities and capture the relationship between entities simultaneously with the help of external knowledge collected from the web. Specifically, we build a text sub-graph by extracting named entities and their relationships from the article, and build an image sub-graph by detecting the objects in the image. To connect these two sub-graphs, we propose a cross-modal entity matching module trained using a knowledge base that contains Wikipedia entries and the corresponding images. Finally, the MMKG is integrated into the captioning model via a graph attention mechanism. Extensive experiments on both GoodNews and NYTimes800 k datasets demonstrate the effectiveness of our method.

Abstract:
LiDAR sensors are widely used in many safety-critical applications such as autonomous driving and drone control, and the collected data called point clouds are subsequently processed by 3D object detectors for visual perception. Recent works have shown that attackers can inject virtual points into LiDAR sensors by strategically transmitting laser pulses to them; additionally, deep visual models have been found to be vulnerable to carefully crafted adversarial examples. Therefore, a LiDAR-based perception may be maliciously attacked with serious safety consequences. In this article, we present a highly-deceptive adversarial obstacle generation algorithm against deep 3D detection models, to mimic fake obstacles within the effective detection range of LiDAR using a limited number of points. To achieve this goal, we first perform a physical LiDAR simulation to construct sparse obstacle point clouds. Then, we devise a strong attack strategy to adversarially perturb prototype points along each direction of the ray. Our method achieves a high attack success rate while complying with physical laws at the hardware level. We perform comprehensive experiments on different types of 3D detectors and determine that the voxel-based detectors are more vulnerable to adversarial attacks than the point-based methods. For example, our approach achieves an 89% mean attack success rate against PV-RCNN by using only 20 points to spoof a fake car.

Abstract:
The long-term action in untrimmed video generally contains multiple sub-actions, among which various semantic patterns exist (e.g., the co-occurrence or sequentiality between sub-actions). These semantic patterns are temporally coarse, and correlated with multiple local contexts which encode the local temporal evolution of visual elements (e.g., hands, objects) in videos. The local contexts and semantic patterns form the inherent fine-to-coarse temporal structure of long-term actions, which is neglected by existing works. Accordingly, in this work we propose TwinFormer, which exploits a novel fine-to-coarse temporal modeling manner to uncover the temporal structure of long-term actions. The proposed TwinFormer consists of a pair of twin encoders with the same structural design, namely Localcontext Encoder and Semantic-pattern Encoder, and a Temporalbridged Attention to bridge the two twin encoders. The Localcontext Encoder aims to model the local contexts in the longterm action. And the Temporal-bridged Attention is designed to correlate the local contexts with semantic patterns. Furthermore, the Semantic-pattern Encoder reveals the temporal evolution of semantic patterns. Experimental results on three benchmarks demonstrate the effectiveness of the proposed model.

Abstract:
Automatic emotion recognition has recently gained significant attention due to the growing popularity of deep learning algorithms. One of the primary challenges in emotion recognition is effectively utilizing the various cues (modalities) available in the data. Another challenge is providing a proper explanation of the outcome of the learning. To address these challenges, we present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK), a generalized and modular system for human emotion recognition and explanation using visual information. Our system can handle multiple modalities, including facial expressions, posture, and gait, in a flexible and modular manner. The network consists of different modules that can be added or removed depending on the available data. We utilize a two-stream network architecture with convolutional neural networks (CNNs) and encoder-decoder style attention mechanisms to extract deep features from face images. Similarly, CNNs and recurrent neural networks (RNNs) with Long Short-term Memory (LSTM) are employed to extract features from posture and gait data. We also incorporate deep features from the background as contextual information for the learning process. The deep features from each module are fused using an early fusion network. Furthermore, we leverage situational knowledge derived from the location type and adjective-noun pair (ANP) extracted from the scene, as well as the spatio-temporal average distribution of emotions, to generate explanations. Ablation studies demonstrate that each sub-network can independently perform emotion recognition, and combining them in a multimodal approach significantly improves overall recognition performance. Extensive experiments conducted on various benchmark datasets, including GroupWalk, validate the superior performance of our approach compared to other state-of-the-art methods.

Abstract:
The recently developed vision transformer (ViT) has achieved promising results on image retrieval compared to convolutional neural networks. However, most of these vision transformer-based image retrieval methods use the original ViT model to extract global features, ignoring the importance of local features for image retrieval. In this work, we propose a vision transformer-based multiscale feature fusion image retrieval method (MSViT) to achieve the fusion of global features with local features. The challenge of this research work is how to learn the feature representation ability of transformer model, so as to improve the performance of image retrieval model. First, a transformer-based two-branch network structure is proposed to obtain different scale features by processing image patches with different granularities. Second, we present a multiscale feature fusion strategy, which can efficiently and effectively fuse the feature information of different sizes on two branches. Finally, to more fully utilize the label information to supervise the network training process, we optimize the construction rules for the triplet data. The comparison of experimental results with ten CNN-based and six transformer-based image retrieval methods on four publicly available image datasets shows that our method outperforms the state-of-the-art methods. And ablation experiments show that the designed multiscale feature fusion strategy and improved triplet loss function have an implicit improvement on the performance of MSViT.

Abstract:
We consider the problem of hyperspectral image (HSI) reconstruction, which aims to recover 3D hyperspectral data from 2D compressive HSI measurements acquired by a coded aperture snapshot spectral imaging (CASSI) system. Existing deep learning methods have achieved acceptable results in HSI reconstruction. However, these methods did not consider the imaging system degradation pattern. In this article, based on observing the initialized HSIs obtained by shifting and splitting the measurements, we propose a dynamic Fourier network based on degradation learning, called the degradation-aware dynamic Fourier-based network (DADF-Net). We estimate the degradation feature maps from the degraded hyperspectral images to realize the linear transformation and dynamic processing of the features. In particular, we use the Fourier transform to extract the HSI non-local features. Extensive experimental results show that the proposed model outperforms state-of-the-art algorithms on simulation and real-world HSI datasets.

Abstract:
Face forgery detection plays an important role in personal privacy and social security. With the development of adversarial generative models, high-quality forgery images become more and more indistinguishable from real to humans. Existing methods always regard as forgery detection task as the common binary or multi-label classification, and ignore exploring diverse multi-modality forgery image types, e.g. visible light spectrum and near-infrared scenarios. In this article, we propose a novel Hierarchical Forgery Classifier for Multi-modality Face Forgery Detection (HFC-MFFD), which could effectively learn robust patches-based hybrid domain representation to enhance forgery authentication in multiple modality scenarios. The local hybrid domain representation is designed to explore strong discriminative forgery clues both in the image and frequency domain with the intra-attention mechanism. Furthermore, the specific hierarchical face forgery classifier is designed through the authenticity feedback strategy to integrate diverse discriminative clues. Experimental results on representative multi-modality face forgery datasets demonstrate the superior performance of the proposed HFC-MFFD compared with state-of-the-art algorithms.

Abstract:
Referring image segmentation aims to segment objects that are described by natural language expressions. Although remarkable advancements have been made to align natural language expressions with visual representations for better performance, the interaction between image-level and text-level information is still not formulated properly. Most of the previous works focus on building correlations between vision and language, ignoring the variety of objects. The target objects with unique appearances may not be correctly located or completely segmented. In this article, we propose a novel Bilateral Knowledge Interaction Network, termed BKINet, which reformulates the image-text interaction in a bilateral manner to adapt concrete knowledge of the target object in the image. BKINet contains two key components: a knowledge learning module (KLM) and a knowledge applying module (KAM). In the KLM, the abstract knowledge from text features is replenished with concrete knowledge from visual features to adapt to the target objects in the input images, which generates the knowledge interaction kernels (KI kernels) containing abundant referring information. With the referring information of KI kernels, the KAM is designed to highlight the most relevant visual features for predicting the accurate segmentation mask. Extensive experiments on three widely-used datasets, i.e. RefCOCO, RefCOCO+, and G-ref, demonstrate the superiority of BKINet over the state-of-the-art.

Abstract:
Multi-Label Continual Learning (MLCL) is a framework designed for class-incremental multi-label image recognition. However, MLCL faces two critical challenges: the construction of label relationships on past-missing and future-missing partial labels of training data, and the problem of catastrophic forgetting, which leads to poor generalization. To address these challenges, this study proposes an enhanced version of the Augmented Graph Convolutional Network (AGCN++), capable of constructing cross-task label relationships and mitigating catastrophic forgetting. First, an Augmented Correlation Matrix (ACM) is constructed across all observed classes, incorporating intra-task relationships derived from hard label statistics. Additionally, inter-task relationships are established by leveraging both hard and soft labels obtained from the data, as well as a constructed expert network. Next, a novel partial label encoder (PLE) is introduced for MLCL, enabling the extraction of dynamic class representations for each partial label image as graph nodes. This PLE also facilitates the generation of soft labels, which contribute to the creation of a more persuasive ACM and effectively mitigate forgetting. Lastly, a relationship-preserving constrainter is proposed to address the issue of forgetting label dependencies across old tasks. In the AGCN++, the label relationships topology can be augmented automatically, thereby generating efficient class representations. The effectiveness of the proposed method is evaluated using two multi-label image benchmarks. The experimental results demonstrate that the proposed approach is highly effective in the context of MLCL image recognition. It can establish compelling correlations across tasks, even in scenarios where the old task labels are missing.

Abstract:
As a popular research topic in computer vision, video frame interpolation is widely used in video processing tasks. However, this task is often limited by slow processing speed or high memory consumption in practical applications. To address these drawbacks, a frame interpolation network focusing on motion regions named MFNet is proposed, which consists of a sampler for adaptive and efficient separation of motion regions from the background, a fine-grained module for direct approximation of intermediate streams, and a lightweight module for bi-directional optical stream fusion. Extensive experiments show that our MFNet achieves optimal accuracy on some frame interpolation tasks and is much faster than other state-of-the-art methods. In addition, transplantation of the core components of MFNet to other frame interpolation networks can significantly improve the performance.

Abstract:
Supervised deep hashing aims to learn hash functions using label information. Existing methods learn hash functions by employing either pairwise/triplet loss to explore the point-to-point relation or center loss to explore the point-to-class relation. However, these methods overlook the collaboration between the above two kinds of relations and the hardness of pairs. In this work, we propose a novel Self-Paced Relational Contrastive Hashing (SPRCH) method with a single learning objective to capture valuable discriminative information from hard pairs using both the point-to-point and point-to-class relations. To exploit the above two kinds of relations, the Relational Contrastive Hash (RCH) loss is proposed, which ensures that each data anchor is closer to all similar data points and corresponding class centers in the Hamming space compared to dissimilar ones. Moreover, the proposed RCH loss reduces the drastic imbalance between point-to-point pairs and point-to-class pairs by rebalancing their weights. To prioritize hard pairs, a self-paced learning schedule is proposed, assigning higher weights to these pairs in the RCH loss. The self-paced learning schedule assigns dynamic weights to pairs according to their similarities and the training process. In this way, deep hash model can initially learn universal patterns from the entire set of pairs and then gradually acquire more valuable discriminative information from hard pairs. Experimental results on four widely-used image retrieval datasets demonstrate that our proposed SPRCH method significantly outperforms the state-of-the-art supervised deep hash methods.

Abstract:
Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed FFINet, which includes two primary components, i.e., the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, i.e., minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6 M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach.

Abstract:
The multi-modal robotic teleoperation, as an important application in human-computer interaction (HCI), is playing a significant role in various domains such as industry, healthcare, and education. However, existing robotic teleoperation systems face significant challenges with multi-modal signals, primarily in designing a cross-modal communication architecture that caters to diverse modal requirements and ensuring high-quality cross-modal signal reconstruction even in poor network conditions. To this end, this work proposes a general cross-modal signal reconstruction scheme by taking full advantage of the correlation among different modality signals. Specifically, we first propose a scalable cross-modal communication architecture that meets the diverse needs of various modality signals using multi-modal encoding and multi-directional decoding, eliminating the need for a specialized feature extraction model. Next, we design a masked auto-encoder with discriminator assistance (MAE-D) cross-modal signal reconstruction method, which leverages the idea of generative confrontation by combining the codec for signal reconstruction with the discriminator responsible for assessing the authenticity of the reconstructed signal to achieve accurate and efficient cross-modal signal reconstruction. Finally, numerical experiments conducted on our self-built multi-modal dataset, a public dataset, and a teleoperation simulation platform demonstrate that the proposed scheme offers significant advantages in cross-modal signal reconstruction.

Abstract:
Major Depressive Disorder (MDD) detection with cross-domain datasets is a crucial yet challenging application due to the data scarcity and isolated data island issues in multimedia computing research. Given the domain shift issue in MDD datasets and a continuous stream of incoming data in clinical settings, Semi-supervised Domain Adaptation (SDA) is suitable for addressing these challenges in MDD detection. However, existing mainstream Domain Adaptation (DA) methods have the following limitations that still need to be addressed, such as semantic misalignment, challenges in extending to various DA paradigms, and difficulty in addressing the classifier bias caused by class imbalance issues. To relieve the above issues, we propose a flexible Graph Neural Network-based Semi-supervised Domain Adaptation (GNN-SDA) for MDD detection. The proposed framework comprises a feature extraction backbone along with two essential modules: a GNN-based domain alignment module and an uncertainty-guided optimization module. The GNN-based domain alignment module is designed to reduce the domain gap in a flexible manner, which is able to align multiple domains through the information propagation mechanism instead of the explicit alignment operation. The uncertainty-guided optimization module discusses the uncertainty of pseudo-labels, mitigating the adverse impact of noisy predictions and taking into account the class distribution of unlabeled data. Finally, we evaluate the proposed GNN-SDA framework for MDD detection under different domain adaptation paradigms on four benchmark datasets, i.e., DAIC-WOZ, EATD, CMDC, and MODMA. The promising results indicate the flexibility and effectiveness of the proposed framework for MDD detection.

Abstract:
While saliency detection for 3D meshes has been extensively studied in the past decades, only a little work considers color information, and most of existing 3D mesh saliency databases are collected using meshes without color information. The lack of publicly available 3D colored mesh saliency database hinders the research progress in 3D colored mesh saliency detection. In this article, we established a novel 3D colored mesh saliency database (3DCMS) based on an eye-tracking experiment and investigated subjects' visual attention behavior towards 3D colored meshes. Based on the investigations, a novel 3D colored mesh saliency detection framework is proposed which takes both color and geometric features into consideration. To evaluate the performance of the proposed algorithm, we compare it with several relevant methods and apply it to 3D mesh simplification task. The quantitative and qualitative evaluation results demonstrate the superior performance of the proposed framework. The proposed 3DCMS database will be made publicly available.1

Abstract:
In recent years, we have seen the success of deep video enhancement models. However, the performance improvement of new methods has gradually entered a bottleneck period. Optimizing model structures or increasing training data brings less and less improvement. We argue that existing models with advanced structures have not fully demonstrated their performance and demand further exploration. In this study, we statistically analyze the relationship between motion estimation accuracy and video interpolation quality of existing video frame interpolation methods, and find that only supervising the final output leads to inaccurate motion and further affects the interpolation performance. Based on this important observation, we propose a general motion distillation framework that can be widely applied to flow-based and kernel-based video frame interpolation methods. Specifically, we begin by training a teacher model, which uses the ground-truth target frame and adjacent frames to estimate motion. These motion estimates then guide the training of a student model for video frame interpolation. Our experimental results demonstrate the effectiveness of this approach in enhancing performance across diverse advanced video interpolation model structures. For example, after applying our motion distillation framework, the CtxSyn model achieves a PSNR gain of 3.047 dB.

Abstract:
Heterogeneous face recognition (HFR), which refers to matching face images with different modalities, is essential to public safety. Although HFR has made promising progress in recent years, disguised faces in HFR scenarios still remain a major challenge for the following reasons. First, most existing HFR methods focus on traditional scenarios without disguised accessories, and the performance degrades when dealing directly with disguised faces. Second, there is a need for disguised heterogeneous face datasets, which is essential for developing the related research community. Third, colorful accessories are distinct from heterogeneous face images in terms of their modalities, and their direct combination results in style inconsistency and poor quality. Therefore, we propose a disguised heterogeneous face generation method based on an iterative-adversarial style unification framework. Our approach aims to gradually learn frame textures to detail textures in multiple confrontation iterations, resulting in style unification for disguised accessories and heterogeneous faces. We also construct a disguised heterogeneous face dataset, which contains a disguised NIR-VIS subset and a disguised sketch-photo subset. Moreover, we provide benchmark evaluations conducted on our proposed dataset with face recognition and image quality assessment, demonstrating the superiority of our method over direct addition and two representative disguised face generation techniques.

Abstract:
Recent studies have shown impressive performance in object detection. However, most current detectors only explore the appearance feature to locate and classify objects but disregard or underestimate the valuable contextual information in the image, which limits the detection performance for those hard objects, such as small objects, occluded objects, blurred objects, etc. In this article, we instead seek to build a novel context modeling framework and conduct more effective context reasoning for object detection. Specifically, we design a Context-guided Reasoning Network (CRNet) to explore the relationships between objects and use easy detected objects to help understand hard ones. In our CRNet, an image is modeled as a graph and local features of objects are viewed as nodes of the graph to learn the relationships between objects. By passing contextual information in the built graph, the features of hard objects can be updated to discriminative features. To this end, we first develop a cascaded center prediction module built upon CenterNet to produce a set of high-quality proposals viewed as nodes of the graph. In addition, to maximize the value of global context information, we present a multi-granularity feature fusion network to encode the whole scene information which is also viewed as nodes of the graph. Then, the spatial and semantic relationships between objects are learned to initialize edges of the graph. Finally, context reasoning is conducted to update the node states iteratively. Extensive experiments are conducted on MS COCO and Pascal VOC to demonstrate the effectiveness of the proposed CRNet. Experimental results show that the proposed CRNet greatly improves the detection performance over existing context-based detectors, and it is comparable with state-of-the-art detectors.

Abstract:
Cross-domain facial expression recognition is confronted by the problem of the large distribution discrepancy and samples inconsistencies between the source domain and target domain. To solve this problem, we propose a cross-domain sample relationship learning (CSRL) method that explores useful intrinsic sample relationships of two domains to narrow the domain discrepancy. Specifically, during the training stage, we first design inter-domain sample transformers to explore the sample similarity relationships between the source and target domains, and then deploy intra-domain sample transformers to capture the internal similar structure of the samples in each domain. Thus dual sample relationships can be learned to align the cross-domain similar samples and preserve the domain-specific information, which can facilitate both the inter-domain invariant features and intra-domain invariant features learning. Subsequently, we design a joint alignment strategy by simultaneously deploying the feature distribution alignment and cross-domain sample relationship learning. Thus, both local similar samples and global domain distribution of two domains can be well aligned to enhance the generalization ability of the model. Experimental results on several benchmark databases show the superiority of CSRL over some state-of-the-art methods.

Abstract:
Aspect-based multimodal sentiment analysis (ABMSA) is an important sentiment analysis task that analyses aspect-specific sentiment in data with different modalities (usually multimodal data with text and images). Previous works usually ignore the overall sentiment tendency when analyzing the sentiment of each aspect term. However, the overall sentiment tendency is highly correlated with aspect-specific sentiment. In addition, existing methods neglect to explore and make full use of the fine-grained multimodal information closely related to aspect terms. To address these limitations, we propose a dual-perspective fusion network (DPFN) that considers both global and local fine-grained sentiment information in multimodal data. From the global perspective, we use text-image caption pairs to obtain a global representation containing information about the overall sentiment tendencies. From the local fine-grained perspective, we construct two graph structures to explore the fine-grained information in texts and images. Finally, aspect-level sentiment polarities can be obtained by analyzing the combination of global and local fine-grained sentiment information. Experimental results on two multimodal Twitter datasets show that the proposed DPFN model outperforms state-of-the-art methods.

Abstract:
Skill determination aims to evaluate how well a participant performs a specific action. The task is rather challenging, due to the diversity of action types and the scarcity of samples. Many existing works train a skill determination model on limited samples of each action type separately. However, they neglect the skill similarities shared by different action types that can be exploited to enhance the skill determination process. How to exploit useful assessment skills from source actions to a related target action remains a challenge, and existing works have not ever found an effective way to accomplish this. In this work, we propose to achieve skill transfer for action assessment by an Adaptive Stage-aware Assessment Skill Transfer framework (AdaST) that transfers assessment skills from source actions to different stages of a target action adaptively. A source action search scheme is proposed to select relevant source actions for each target action. Furthermore, to encourage transferring effective and non-redundant assessment skills, a consistency loss and an orthogonality loss are introduced to ensure that the transferred assessment skills do not degrade the accurate determination and it provides complementary information. Extensive experiments on three public datasets demonstrate the effectiveness of the proposed method.

Abstract:
Aesthetic attributes are crucial for aesthetics because they explicitly present some photo quality cues that a human expert might use to evaluate a photo’s aesthetic quality. However, annotating aesthetic attributes is a time-consuming, costly, and error-prone task, which leads to the issue that photos available are partially annotated with attributes. To alleviate this issue, we propose a novel semi-supervised adversarial learning method for photo aesthetic assessment from partially attribute-annotated photos, which can greatly reduce the reliance on manual attribute annotation. Specifically, the proposed method consists of a score-attributes generator R, a photo generator G, and a discriminator D. The score-attributes generator learns the aesthetic score and attributes simultaneously to capture their dependencies and construct better feature representations. The photo generator reconstructs the photo by feeding aesthetic attributes, score, and informative feature representation. A discriminator is used to force the convergence of the features-attributes-score tuples generated from the score-attributes generator, the photo generator, and the ground-truth distribution in labeled data for training data. The proposed method significantly outperforms the state of the art, increasing the Spearman rank-order correlation coefficient (SRCC) from the existing best reported of 0.726 to 0.761 on Aesthetic and attributes database and 0.756 to 0.774 on Aesthetic visual analysis database, respectively.

Affiliations: Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology, Shanghai, China; IMI, University of Shanghai for Science and Technology, Shanghai, China; School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Institute for Artificial Intelligence, Tsinghua University (THUAI), Beijing, China; TAMS Group, Department of Informatics, Universität Hamburg, Hamburg, Germany

Abstract:
Source-free domain adaptation (SFDA) tends to forget the source domain, suffering from limitations in real-world scenarios. Recently, generalized source-free domain adaptation (GSFDA) problem naturally emerges, aiming for good performance on both target and source domains. The existing methods attempt to retain model parameters associated with the source domain to prevent such forgetting. However, this strategy is not conducive to improving cross-domain performance on the target domain, prioritizing mitigating forgetting on the source domain. This article introduces a Progressive Source-Aware Transformer approach for GSFDA, dubbed PSAT-GDA. Our core idea is to enforce the domain adaptation process to remember the source domain by imposing source guidance, offering a target domain-centric anti-forgetting mechanism. Specifically, for each epoch, a Transformer-based deep network is adapted to do domain alignment like the traditional SFDA method, because the transformer working on the image patch sequence helps to reduce image noise caused by domain shift. Meanwhile, another Transformer is designed to generate source guidance supervising domain alignment. By augmenting target sample and mining the source information from the historical models before current epoch, source injected feature group is constructed. Based on the Transformer mechanism, the attention block can select useful source information for each target sample. From it, we devise neighbour-based and augmentation-based regularizations to shape the source guidance. Experiments on three challenging datasets show that our method can achieve evident cross-domain improvement on the target domains. Also, it can mitigate forgetting on all domains after adapting to single or multiple target domains.

Abstract:
Consistency regularization has achieved great successes in Semi-Supervised Single-Label Image Classification (SS-SLC) with deep learning models, while few effort has been devoted to Semi-Supervised Multi-Label Image Classification (SS-MLC) with deep learning models. One intuitive solution for introducing consistency regularization to SS-MLC is to regularize model predictions to be invariant to different augmented data of the same input image. However, the solution lacks the consideration of label relations, which are key elements in multi-label image classification. In this article, we go beyond the consistency regularization for multi-view input images, and propose Conditional Consistency Regularization (CCR) that is tailored for SS-MLC. Specifically, for two augmented input images, we make the two model predictions conditioned on different label states (i.e., positive, negative, or unknown for each class). By encouraging the two predictions to be consistent, the model is able to build relations between the given two different label states, which helps to make use of label relations for boosting image classification. The experiments on large-scale real-world SS-MLC benchmarks demonstrate that the proposed method can surpass state-of-the-art methods by a large margin.

Abstract:
In the tracking literature, foreground and background information have been extensively investigated to discriminate a target from its surrounding background. However, both foreground and background possess their own spatial-temporal correlation relationship that provide significant information to separate the target from its surrounding background, which has been usually ignored by existing work. To address this issue, we propose a bidirectional transductive network based tracker, which incorporates long-range spatial-temporal and bidirectional constraints. Specifically, our tracker consists of two modules, namely the mask generation module (MGM) and the transduction attention module (TAM). MGM aggregates long-range interdependencies of a target along the history frames for generating accurate target masks. TAM retrieves back to the history frames to find patches similar to the current frame, which are then forwarded along with the target masks generated by MGM. In this manner, each position in the current frame can determine its own identity, whether belonging to either the background or the foreground, hence accurately distinguishing the target from its distractors. We conduct systematically experiments and achieve state-of-the-art performance on several benchmarks, obtaining 69.2% AO on GOT-10k and 82.1% on TrackingNet.

Abstract:
The mainstream tracking-by-detection paradigm for multi-object tracking generally conducts detection first, followed by Re-IDentification (Re-ID) and motion estimation. The associations between the predicted boxes and existing tracks are then performed via visual and motion association. However, challenges such as irregular motion patterns, similar appearances, and frequent occlusions often arise, making object tracking a nontrivial task. In this article, we propose a multi-object tracker based on Spatio-TemporAl Topological (STAT) constraints to address the above issues. More specifically, we design the Feature Adaptive Association Module (FAAM) to establish the association between motion and appearance regionally, completing a complementary combination of appearance and motion features. Among these, the Appearance Feature Update Module (AFUM) is proposed to manage the appearance updates of tracked objects by imposing constraints based on the spatial locations and the degree of object occlusion, while temporal consistency is adopted to smooth the appearance states of tracks to mitigate the accumulation of appearance noise. Moreover, the Robust Motion Tracking Module (RMTM) is established to reduce the impact of irregular motions and certain unreliable detection results. The proposed module includes a higher weighted momentum term to accommodate the excessive motion amplitude and considers low-confidence boxes accompanied by the stage-wise association strategy for high-confidence boxes. Extensive experiments on DanceTrack and benchmark MOT datasets verify the effectiveness of our STAT tracker, especially the state-of-the-art results on DanceTrack, which is characterized by irregular motion and indistinguishable appearance attributes.

Abstract:
Real-worldmulti-modal retrieval tasks always encounter modal imbalance scenarios. The scales of instances from different modalities are inconsistent and unaligned with each other. Though several methods alleviate the issue by establishing miscellaneous representations of multi-modal data, they still suffer from difficulties like laborious human-being annotations and complex common-space optimization. In the research, we present Constrained Bipartite Graph Learning (CBGL) for imbalanced correlations, where a size-flexible correlation graph is learned from instances' representations. To guide the graph learning, we take advantage of prior side information, including positive pairs and negative pairs, which readily express intra-modality affinity and inter-modality discrepancy. Accordingly, both positive and negative correlations are propagated over instances, and a similarity graph with satisfactory neighbors is achieved. Benefiting from the probabilistic similarities, a query graph is then naturally constructed that directly achieves multi-modal retrieval. To validate the effect, we build a Music-Video-Image (MVI) dataset in regard to music and images with imbalanced scales. Experimental results reported on MVI dataset and three benchmarks demonstrate our prominent superiority over ten representative competitors in multi-modal retrieval.

Abstract:
Exploiting the rich temporal information in human pose sequences to facilitate 3D pose estimation has garnered particular attention. While various learning architectures have been designed for temporal exploiting, these architectures are usually trained via the 3D pose loss independently imposed on every single frame, without explicit temporal signals introduced for supervision. This inevitably increases the difficulty of temporal exploiting, since the network must reason about the meaningful temporal information based on the non-temporal single-frame supervision first. Only then, the network can utilize this information to guide sequence modeling. Recently, some work introduce temporal smoothness as an explicit supervision signal, which makes the network more straightforwardly reaches the temporal information from the supervision signal, thus improving the temporal exploiting. However, the temporal smoothness only roughly measures the short-term temporal properties between adjacent frame pairs. In this work, we propose to generalize the supervision of temporal smoothness to temporal correlations, letting the network precisely consider more comprehensive temporal properties in sequences. We contribute two novel correlation-based loss functions, which adopt different strategies to respectively regularize the encoder and decoder sides of the network for temporal exploiting. Besides, we design a pre-training scheme to ensure a general convergence of existing pose estimators under our correlation losses. Experiments on three benchmarks demonstrate that our method can be compatible with different networks, improving their temporal exploiting ability to output more accurate and robust pose estimations.

Abstract:
Salient Object Detection (SOD) is dominated by Encoder-Decoder networks which involve multi-scale feature fusion and multi-resolution dense supervision. It is prevalent yet problematic to interpolate feature maps or pool ground truth (GT) to fit the size of decoder stages in SOD. Structural properties are unavoidably damaged since pixels are discarded or changed during scaling, resulting in restoration difficulties and poor predictions. Second, it is intuitive and suboptimal to posit the last layer of an encoder as global context, even though it has been widely accepted that high-level encoder features contain global information that contributes to the overall shape of a SOD. To this end, this paper aims to enhance the abovementioned techniques for richer details and a more complete shape. First, we developed a Global Context Branch (GCB) which is a patch-wise supervised SOD on top of the encoder for better global context modeling. Second, we developed a Context Refinement Module (CRM) to improve high/low-level feature fusion and enhance detail reconstruction. Lastly, we adopt Pixel Shuffle (PS) when scaling features and GT maps to preserve structural information. Experiments demonstrated that our proposed framework achieved state-of-the-art performance among all five benchmark datasets against six related existing evaluation metrics.

Abstract:
Typically, infrared small target detection aims to accurately localize objects from complex backgrounds where the object textures are often dim and the object shapes are varying. A feasible solution is learning discriminative representations with deep convolutional neural networks (CNNs). However, the representations learned by traditional deep CNNs often suffer from low shape bias. In this work, we propose a unified framework to learn shape-biased representations for facilitating infrared small target detection by explicitly incorporating shape information into model learning. The framework cascades a large-kernel encoder and a shape-guided decoder to learn discriminative shape-biased representations in an end-to-end manner. The large-kernel encoder describes infrared images into shape-preserving representations by using a few convolutions whose kernel size is as large as 9× 9, in contrast to commonly used 3× 3. The shape-guided decoder simultaneously addresses two tasks: decodes the encoder representations via upsampling reconstruction to reconstruct the segmentation, and hierarchically fuses the decoder representations and edge information via cascaded gated ResNet blocks to reconstruct the contour. In this way, the learned shape-biased representations are effective for identifying infrared small targets. Extensive experiments show our approach outperforms 18 state-of-the-arts.

Abstract:
Asymmetric image retrieval is a task that seeks to balance retrieval accuracy and efficiency by leveraging lightweight and large models for the query and gallery sides, respectively. The key to asymmetric image retrieval is realizing feature compatibility between different models. Despite the great progress, most existing approaches either rely on classifiers inherited from gallery models or simply impose constraints at the instance level, ignoring the structure of embedding space. In this work, we propose a simple yet effective structure similarity preserving method to achieve feature compatibility between query and gallery models. Specifically, we first train a product quantizer offline with the image features embedded by the gallery model. The centroid vectors in the quantizer serve as anchor points in the embedding space of the gallery model to characterize its structure. During the training of the query model, anchor points are shared by the query and gallery models. The relationships between image features and centroid vectors are considered as structure similarities and constrained to be consistent. Moreover, our approach makes no assumption about the existence of any labeled training data and thus can be extended to an unlimited amount of data. Comprehensive experiments on large-scale landmark retrieval demonstrate the effectiveness of our approach.

Abstract:
Recent years have witnessed extensive applications of Deep Neural Networks (DNNs) in various vision tasks. However, DNNs are vulnerable to adversarial images crafted by introducing perturbations into inputs to induce incorrect predictions. Unlike L_p-norm restricted adversarial attacks, many unrestricted attacks have been proposed by modifying attributes of the image (e.g., edge, color), while the critical components of the image are preserved. However, most existing unrestricted attacks easily introduce unnatural distortions, colors, stains and schemes, in the generated adversarial images. This paper proposes a novel unrestricted attack (named AdvST) to create stylized, natural-looking, and high-transferability adversarial images. The basic idea of AdvST is to embed adversarial perturbations when transferring the style from the reference image onto the original image (i.e., rendering the original image's semantic contents into the reference image's style). To further improve the image quality of generated adversarial images, we refine two kinds of reference images (i.e., photographs and artworks) based on different attractive styles and design two attacks accordingly. For photorealistic attack, we incorporate semantic information obtained from segmentation maps to improve the photo realism of adversarial images. For artistic attack, we propose integrating edge information extracted by the Laplace operator to preserve the structural integrity of the original image. Extensive experimental results validate the superior performance of AdvST in terms of adversarial image quality and black-box transferability compared to benchmark methods.

Abstract:
Compressive Sensing (CS) surpasses the limitations of the sampling theorem by reducing signal dimensions during sampling. Recent works integrate measurement coding into CS to enhance the compression ratio. However, these works significantly decrease image quality, and both encoding and decoding become time-consuming. This article proposes a Compressive Sensing based Image Codec with Partial Pre-calculation (CSCP) to solve these issues. The CSCP separates the original reconstruction procedure into two parts: reconstructing the frequency domain data and the inverse calculation. Depending on the feature of the chosen deterministic sensing matrix, the complex reconstruction procedure is reduced to twice matrix-based multiplications, resulting in a low time cost. Moreover, we can further optimize the reconstruction process by moving the frequency domain data reconstruction to the encoder, referred to as the partial pre-calculation process. Then compressing the sparse data in the frequency domain. This approach has two main benefits: 1) it reduces the complexity of the decoder, and 2) it results in less degradation in quality compared to existing measurement coding methods. Additionally, this work proposes the One-Row-Two-Tables strategy for defining Huffman Coding units. This approach leverages the quantized data distribution to improve compression efficiency while maintaining low complexity. In the decoder, the sequence of operations includes Huffman decoding, dequantization, and inverse calculation. Compared to the state-of-the-art, this work decreases 22.61% bpp with 17.72% increased quality. Meanwhile, time speeds up to 649.13× on the encoder, 11.03× on the decoder, and 288.46× in total.

Abstract:
Underwater image enhancement (UIE) aims to improve the visual quality of raw underwater images. Current UIE algorithms primarily train a deep neural network (DNN) on synthetic datasets or datasets with pseudo labels by minimizing the reconstruction loss between enhanced images and ground truth images. However, there is a domain gap between synthetic and real-world underwater images, and the widely used \ell _1 or \ell _2 loss tends to overlook the importance of human perception, resulting in unsatisfactory perceptual quality of the final enhanced results. In this paper, we propose an unsupervised perception-driven DNN called PDD-Net for generalizable UIE. Instead of relying on paired images for training, we resort to an unsupervised generative adversarial network (GAN) with a large-scale set of easily available natural images as the target domain. This enables training on larger image sets collected from various domains while avoiding over-fitted to any specific data generation protocol. Additionally, to make the visual quality of enhanced underwater images more in line with human perception, we pre-train a DNN-based pairwise quality ranking (PQR) model based on which a PQR loss is formulated to progressively guides the enhancement of raw underwater image toward the higher quality direction. In addition, we introduce a global attention module (GAM) that integrates modulation and attention mechanisms to enable capturing rich global and local information, leading to improvements in both brightness and contrast. Extensive experiments demonstrate that our proposed PDD-Net exhibits excellent generalization capabilities and outperforms existing methods in terms of both visual perception quality and quantitative indicators across different datasets.

Abstract:
Aligning multiple heterogeneous modalities in a parameter-sharing encoder to mine consistent information is a core idea of multimodal learning. However, two drawbacks hinder the development of such methods for clustering tasks: 1) each modality contains a considerable amount of superfluous information that cannot be aligned, impeding the mining of consistent information and 2) one-to-one alignment is contradictory to the clustering principle of minimum intra-cluster distance, leading to suboptimal clustering results. In this paper, we propose a novel Consistency-Guided Multimodal Clustering method (ConGMC) to remove superfluous information within the modalities unsupervised through information theory while improving one-to-one alignment for the clustering task. ConGMC contains multiple unimodal encoders and a multimodal shared encoder, where the former learns unimodal representation while the latter aligns multiple modalities to learn the cluster partition. Specifically, we first construct a mutual information maximin function to distinguish consistent information from superfluous information, in which the consistent and superfluous information are maximally retained and removed, respectively. Then a Clustering-Friendly Alignment strategy (CF-Align) is designed to address the contradiction between the alignment and clustering tasks. CF-Align dynamically adjusts the set of negative samples according to the learned cluster partition to avoid increasing the intra-cluster distance. Finally, we consider the cluster partition as a consistent constraint to optimize the multimodal shared encoder, enabling consistent information to guide the training process iteratively. Moreover, a variational optimization algorithm is proposed to ensure that ConGMC converges to a local optimum. Numerous experimental results on twelve real-world datasets validate that the proposed ConGMC method outperforms the state-of-the-art multimodal clustering methods.

Abstract:
Video-based commonsense captioning aims to generate captions for the video content while providing multiple commonsense about the underlying event. Existing methods utilize video features to explore and generate commonsense containing latent semantics. However, this process needs to overcome the complex semantic gap between visible videos and invisible commonsense, which is not supported by the limited knowledge in existing video captioning datasets. To this end, we propose a novel GPT-based Two-stage Knowledge Guiding Network (TKG-Net), which uses GPT to augment datasets knowledge and introduces a cross-attention mechanism to fuse multimodal knowledge. Specifically, to augment knowledge, we set prompts and finetune GPT to imagine and reason based on the video content description at the first stage. At the second stage, to prevent over-reasoning caused by the loss of visual features in GPT, TKG-Net extracts high-level semantic representations of commonsense knowledge and fuses them with video features in a cross-attention mechanism for multimodal semantic interaction. Our experiments on the large-scale Video-to-Commonsense dataset manifest significant improvements over the previous state-of-the-art approach on all metrics.

Abstract:
Human pose estimation and tracking are fundamental tasks for understanding human behaviors in videos. Existing top-down framework-based methods usually perform three-stage tasks: human detection, pose estimation and tracking. Although promising results have been achieved, these methods rely heavily on high-performance detectors and may fail to track persons who are occluded or miss-detected. To overcome these problems, in this article, we develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation in top-down approaches. Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded, and it is incorporated into the pose estimation module. In the tracking pipeline, we propose the Bbox-revision module to reduce missing detection and the ID-retrieve module to correct lost trajectories, improving the performance of the detection stage. Experimental results show that our approach is universal in human detection and pose estimation, achieving state-of-the-art performance on both PoseTrack 2017 and 2018 datasets.

Abstract:
Occluded person re-identification (Re-ID) is a challenging task, as various object-to-person (OTP) and person-to-person (PTP) occlusion scenarios cause diverse occlusion interference and target person feature loss problems in person matching. Most existing methods, which utilize auxiliary models to evaluate the unoccluded person parts for occlusion feature elimination, are inefficient and cannot handle the PTP occlusion scenarios and person feature loss problems. To solve these issues, we propose a novel Occlusion-Aware Feature Recover (OAFR) model. OAFR simulates diverse occlusions to facilitate the model perceiving OTP, PTP occlusions and recovers occluded query features with unoccluded retrieved gallery features. Concretely, the Prior Knowledge-based Occlusion Simulation method is firstly introduced to synthesize OTP, PTP occlusions and corresponding occlusion labels, empowering model target person perception and occlusion-aware capability through self-supervised learning. Afterward, the feature recovery module reconstructs occluded query features with corresponding unoccluded local features of the top-K retrieved images by the visibility weighted average scheme, thus recovering the occluded query features to maintain more comprehensive features for better retrieval. Extensive experiments demonstrate that the proposed OAFR achieves superior performance to the state-of-the-art for both holistic and occluded Re-ID. Especially for Occluded-DukeMTMC dataset, OAFR outperforms the state-of-the-art by 6.0% for Rank-1 accuracy and 2.2% for mAP score.

Abstract:
Hyperspectral salient object detection (HSOD) aims to detect spectrally salient objects in hyperspectral images (HSIs). However, existing methods inadequately utilize spectral information by either converting HSIs into false-color images or converging neural networks with clustering. We propose a novel approach that fully leverages the spectral characteristics by extracting two distinct frequency components from the spectrum: low-frequency Spectral Saliency and high-frequency Spectral Edge. The Spectral Saliency approximates the region of salient objects, while the Spectral Edge captures edge information of salient objects. These two complementary components, crucial for HSOD, are derived by computing from the inter-layer spectral angular distance of the Gaussian pyramid and the intra-neighborhood spectral angular gradients, respectively. To effectively utilize this dual-frequency information, we introduce a novel lightweight Spectrum-driven Mixed-frequency Network (SMN). SMN incorporates two parameter-free plug-and-play operators, namely Spectral Saliency Generator and Spectral Edge Operator, to extract the Spectral Saliency and Spectral Edge components from the input HSI independently. Subsequently, the Mixed-frequency Attention module, comprised of two frequency-dependent heads, intelligently combines the embedded features of edge and saliency information, resulting in a mixed-frequency feature representation. Furthermore, a saliency-edge-aware decoder progressively scales up the mixed-frequency feature while preserving rich detail and saliency information for accurate salient object prediction. Extensive experiments conducted on the HS-SOD benchmark and our custom dataset HSOD-BIT demonstrate that our SMN outperforms state-of-the-art methods regarding HSOD performance.

Abstract:
Effective Adaptive Bitrate (ABR) algorithm or policy is of paramount importance for Real-Time Video Communication (RTVC) amid this pandemic to pursue uncompromised quality of experience (QoE). Existing ABR methods mainly separate the network bandwidth estimation and video encoder control, and fine-tune video bitrate towards estimated bandwidth, assuming the maximization of bandwidth utilization yields the optimal QoE. However, the QoE of an RTVC system is jointly determined by the quality of the compressed video, fluency of video playback, and interaction delay. Solely maximizing the bandwidth utilization without comprehensively considering compound impacts incurred by both transport and video application layers, does not assure a satisfactory QoE. The decoupling of the transport and application layer further exacerbates the user experience due to codec-transport incoordination. This work, therefore, proposes the Palette, a reinforcement learning-based ABR scheme that unifies the processing of transport and video application layers to directly maximize the QoE formulated as the weighted function of video quality, stalling rate, and delay. To this aim, a cross-layer optimization is proposed to derive the fine-grained compression factor of the upcoming frame(s) using cross-layer observations like network conditions, video encoding parameters, and video content complexity. As a result, Palette manages to resolve the codec-transport incoordination and to best catch up with the network fluctuation. Compared with state-of-the-art schemes in real-world tests, Palette not only reduces 3.1%–46.3% of the stalling rate, 20.2%–50.8% of the delay but also improves 0.2%–7.2% of the video quality with comparable bandwidth consumption, under a variety of application scenarios.

Abstract:
Video summarization aims to distill the most important information from a source video into either an abridged video clip or a textual narrative. Existing methods often treat the generation of video and text summaries as independent tasks, thus neglecting the semantic correlation between visual and textual summarization. In other words, these methods only study a single modality as output without considering coherent video and text as outputs. In this work, we first introduce a novel task: cross-modal video summarization. This task seeks to transfer a long video into a condensed video clip and a semantically aligned textual summary, collectively referred to as a cross-modal summary. We then establish VideoXum (X refers to different modalities), a new large-scale human-annotated video benchmark for cross-modal video summarization. VideoXum is reannotated based on ActivityNet Captions with diverse open-domain videos. In the current version, VideoXum provides 14 K long videos, with a total of 140 K pairs of aligned video and text summaries. Compared to existing datasets, VideoXum offers superior scalability while preserving a comparable level of annotation quality. To validate the dataset's quality, we provide a comprehensive analysis of VideoXum, comparing it with existing datasets. Further, we perform an extensive empirical evaluation of several state-of-the-art methods on this dataset. Our findings highlight the impressive generalization capability of the vision-language encoder-decoder framework yields on VideoXum. Particularly, we propose VTSUM-BLIP, an end-to-end framework, serving as a strong baseline for this novel benchmark. Moreover, we adapt CLIPScore for VideoXum to measure the semantic consistency of cross-modal summaries effectively.

Abstract:
RGB-D cross-modal person re-identification is designed to match the people across the RGB and depth image modalities, where the large modality discrepancy makes this task intractable to tackle. To alleviate the negative effect brought by the discrepancy, this paper proposes a novel SYnergistic RElational Reasoning (SYRER) method, which targets at exploring the synergy between hetero-modalities for recognizing persons. We design a heterogeneous relationship contrast branch to establish intra-class and inter-class cross-modal relationships, which implements the cross-modal relation contrast learning to cope with imperceptible cross-modal inter-class differences and large cross-modal intra-class discrepancy. Additionally, in order to adequately represent the irregular depth images, we propose a point-wise depth extractor to extract non-uniform discriminative point features from depth images. Experimental results on two public datasets indicate the proposed SYRER surpasses the state-of-the-arts. And we also perform a series of analytic experiments to verify the effectiveness of each submodule of our SYRER.

Abstract:
Recently, many compression algorithms are applied to decrease the cost of video storage and transmission. This will introduce undesirable artifacts, which severely degrade visual quality. Therefore, Video Compression Artifacts Removal (VCAR) aims at reconstructing a high-quality video from its corrupted version of compression. Generally, this task is considered as a vision-related instead of media-related problem. In vision-related research, the visual quality has been significantly improved while the computational complexity and bitrate issues are less considered. In this work, we review the performance constraints of video coding and transfer to evaluate the VCAR outputs. Based on the analyses, we propose a Spatial-Temporal Attention-Guided Enhancement Network (STAGE-Net). First, we employ dynamic filter processing, instead of conventional optical flow method, to reduce the computational cost of VCAR. Second, we introduce self-attention mechanism to design Sequential Residual Attention Blocks (SRABs) to improve visual quality of enhanced video frames with bitrate constraints. Both quantitative and qualitative experimental results have demonstrated the superiority of our proposed method, which achieves high visual qualities and low computational costs.

Abstract:
Discovering temporal boundary is critical for untrimmed video tasks, such as temporal sentence grounding and action detection. Due to the labor-intensive boundary annotations, the recent studies focus on the weakly-supervised setting, with only sentences or action tags in the training videos. However, how to align temporal boundaries and textual descriptions is problematic in most weakly-supervised approaches. To alleviate this difficulty, we propose a novel Dual Masked Modeling (DM2) framework, which can effectively enhance clip-text alignment to boost temporal boundary discovery, by cross-modal masked modeling in the dual fashion. Specifically, we introduce two coupled reconstruction branches, i.e., Clip-Aware Masked Text Modeling (C-MTM), and Text-Aware Masked Clip Modeling (T-MCM), after generating a temporal proposal of the underlying clip. In C-MTM, we recover the masked text with visual assistance of the clip proposal. In T-MCM, we recover the masked clip proposal with lingual assistance of the text. Via such complementary reconstruction supervision, our DM2 can cooperatively exploit robust matching between the video clip and the referred text, allowing to unify grounding and localization in a concise manner. Finally, we perform extensive experiments on the popular temporal benchmarks, i.e., Charades-STA, ActivityNet Captions, ActivityNet-v1.3 and THUMOS-14. Our DM2 achieves state-of-the-art for both weakly-supervised temporal grounding and localization. Codes and models will be released afterward.

Abstract:
Due to seasonal and illumination variance, long-term visual localization tasks in dynamic environments is a crucial problem in the field of autonomous driving and robotics. At present, image-based retrieval is an effective method to solve this problem. However, it is difficult to completely distinguish changes in the same location over times by relying on content information alone. In order to solve these above problems, a double-domain network model combining semantic information and content information is proposed for visual localization task. In addition, this approach only needs to use the virtual KITTI 2 dataset for training. To reduce the domain difference between real scene and virtual image, the cross-predictive semantic segmentation mechanism is introduced to solve this problem. In addition, the obtained model achieves good domain adaptation and further has well generalization on other real datasets by introducing a domain loss function and a triplet semantic loss function. A series of experiments on the Extended CMU-Seasons dataset and the Oxford RobotCar-Seasons dataset demonstrates that the proposed network model outperformes the state-of-the-art baselines for retrieval-based visual localization in challenging environments.

Abstract:
In this paper, we propose a novel framework named DRL-CPG to learn disentangled latent representation for controllable person image generation, which can produce realistic person images with desired poses and human attributes (e.g. pose, head, upper clothes, and pants) provided by various source persons. Unlike the existing works leveraging the semantic masks to obtain the representation of each component, we propose to generate disentangled latent code via a novel attribute encoder with transformers trained in a manner of curriculum learning from a relatively easy step to a gradually hard one. A random component mask-agnostic strategy is introduced to randomly remove component masks from the person segmentation masks, which aims at increasing the difficulty of training and promoting the transformer encoder to recognize the underlying boundaries between each component. This enables the model to transfer both the shape and texture of the components. Furthermore, we propose a novel attribute decoder network to integrate multi-level attributes (e.g. the structure feature and the attribute representation) with well-designed Dual Adaptive Denormalization (DAD) residual blocks. Extensive experiments strongly demonstrate that the proposed approach is able to transfer both the texture and shape of different human parts and yield realistic results. To our knowledge, we are the first to learn disentangled latent representations with transformers for person image generation.

Abstract:
Class activation maps generated by image classifiers are widely used as priors for image-level weakly supervised semantic segmentation. However, these activation maps mainly focus on the sparse discriminative regions, which has been a bottleneck for the segmentation task. Based on our observations, the activation maps actually capture almost the entire target regions, and some regions with lower activation values are easily to be neglected. Thus, to solve the issue, we propose an adaptive activation network with two branches to recalibrate the low-confidence regions in the activation maps. Specifically, an activation enhancement branch is designed to redistribute the activation values by leveraging attention mechanism. Since multi-scale images can provide complementary information, a scale adaptation branch is paralleled to supervise the activation enhancement branch. The mutual supervision and fusion of the two branches can promote the less-discriminative parts, and deactivate the background regions. Based on them, a simple yet effective denoising module is proposed to further improve the quality of pseudo masks, which makes use of the large scale predictions of the trained segmentation network. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 benchmarks show that our method achieves state-of-the-art performance, demonstrating the effectiveness of our algorithm. Code will be made publicly available.

Abstract:
Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in a more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.

Abstract:
Multi-focus image fusion can extract the focus regions from different source images and combine them into a fully clear image. Existing unsupervised methods typically use gradient information to measure the focus regions in images and generate a fusion weight map, but ordinary gradient operators are difficult to measure information accurately in regions with weaker textures. In addition, using only gradient information as a constraint cannot make the model fully distinguish all the focus regions in the image, which seriously restricts the clarity of the fusion image. To address these issues, a novel unsupervised multi-focus image fusion method is proposed in this paper. Specifically, a neighborhood information fusion network is designed to generate an initial fusion weight map. It can capture features within different neighborhood ranges at once, which enhances the information association between different regions. In addition, to further improve the feature extraction ability of the model in the regions with low texture information, a local difference evaluation loss function is proposed. It is combined with the gradient measure loss function to constrain the network. Finally, a fusion weight optimization module is proposed to improve the clarity of the fusion image in the repeated defocusing regions and overexposed regions of different source images, which redistributes the weights of different source images. The proposed fusion method is compared with advanced methods on three public multi-focus datasets. Experimental results indicate that the proposed method has achieved better performance in qualitative and quantitative aspects.

Abstract:
JPEG is a widely used compression scheme to efficiently reduce the volume of the transmitted images at the expense of visual perception drop. The artifacts appear among blocks due to the information loss in the compression process, which not only affects the quality of images but also harms the subsequent high-level tasks in terms of feature drifting. High-level vision models trained on high-quality images will suffer performance degradation when dealing with compressed images, especially on mobile devices. In recent years, numerous learning-based JPEG artifacts removal methods have been proposed to handle visual artifacts. However, it is not an ideal choice to use these JPEG artifacts removal methods as a pre-processing for compressed image classification for the following reasons: 1) These methods are designed for human vision rather than high-level vision models. 2) These methods are not efficient enough to serve as a pre-processing on resource-constrained devices. To address these issues, this paper proposes a novel lightweight adaptive feature de-drifting module (AFD-Module) to boost the performance of pre-trained image classification models when facing compressed images. First, a Feature Drifting Estimation Network (FDE-Net) is devised to generate the spatial-wise Feature Drifting Map (FDM) in the DCT domain. Next, the estimated FDM is transmitted to the Feature Enhancement Network (FE-Net) to generate the mapping relationship between degraded features and corresponding high-quality features. Specially, a simple but effective RepConv block equipped with structural re-parameterization is utilized in FE-Net, which enriches feature representation in the training phase while keeping efficiency in the deployment phase. After training on limited compressed images, the AFD-Module can serve as a “plug-and-play” module for pre-trained classification models to improve their performance on compressed images. Experiments on images compressed once (i.e. ImageNet-C) and multiple times demonstrate that our proposed AFD-Module can comprehensively improve the accuracy of the pre-trained classification models and significantly outperform the existing methods.

Abstract:
Many recent image restoration methods use Transformer as the backbone network and redesign the Transformer blocks. Differently, we explore the parameter-sharing mechanism over Transformer blocks and propose a dynamic recursive process to address the image super-resolution task efficiently. We firstly present a Recursive Image Super-resolution Transformer (RIST). By sharing the weights across different blocks, a plain forward process through the whole Transformer network can be folded into recursive iterations through a Transformer block. Such a parameter-sharing based recursive process can not only reduce the model size greatly, but also enable restoring images progressively. Features in the recursive process are modeled as a sequence and propagated with a temporal attention network. Besides, by analyzing the prediction variation across different iterations in RIST, we design a dynamic recursive process that can allocate adaptive computation costs to different samples. Specifically, a quality assessment network estimates the restoration quality and terminates the recursive process dynamically. We propose a relativistic learning strategy to simplify the objective from absolute image quality assessment to relativistic quality comparison. The proposed Recursive Image Super-resolution Transformer with Relativistic Assessment (RISTRA) reduces the model size greatly with the parameter-sharing mechanism, and achieves an instance-wise dynamic restoration process as well. Extensive experiments on several image super-resolution benchmarks show the superiority of our approach over state-of-the-art counterparts.

Abstract:
In recent years, interactive recommender systems (IRSs) have attracted extensive interest. Existing IRSs are typically implemented with offline reinforcement learning (RL). They are devoted to improving recommendation accuracy by optimizing the extraction of users' inherent preferences. However, there hasn't been much attention on recommendation diversity, which could result in the monotony effect, i.e., categories of recommended items are consistently fixed and unchanging. In this paper, we center on category diversification in IRSs while largely preserving or even boosting recommendation accuracy. To this end, we propose a ChatGPT-aided diversity-aware causal model (CDCM) to enhance the offline RL framework with causal inference and ChatGPT. Specifically, we first propose a diversity-aware causal user model (DCUM) to estimate user satisfaction. This model disentangles the causal effect of users' inherent preferences and the monotony effect to obtain user satisfaction with both accuracy and diversity. Then, DCUM is used to assist the RL agent in recommendation policy learning. A ChatGPT-aided state encoder (CSE) is proposed to provide user state representation for each time step of policy learning. With the help of ChatGPT, CSE incorporates multi-category information in line with users' potential preferences to promote diverse and relevant category recommendations. Extensive experiment results on two real-world datasets validate the superiority of our CDCM regarding both accuracy and diversity.

Affiliations: School of Internet of Things, Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China; Ant Group, Hangzhou, China; College of Computer Science and Technology/College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computer Science and Technology, Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China; College of Automation, Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China; School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications (NJUPT), Nanjing, China; Guangdong Provincial Key Laboratory of Petrochemical Equipment Fault Diagnosis, Guangdong University of Petrochemical Technology, Maoming, China; Department of Computer Science, University of Rochester, Rochester, NY, USA

Abstract:
Video-based visible-infrared person re-identification (VVI-ReID) aims to match the identity of a person captured in video sequences from both visible and infrared cameras. The VVI-ReID task requires considering both the spatial relationship between body parts within each frame and the temporal change of appearance between successive frames. Existing VVI Re-ID methods employ Convolutional Neural Networks to extract local spatial features and Long Short-Term Memory to form temporal associations. However, these methods can not effectively capture the global spatial feature and the long-range temporal dependencies in ultra-long sequences. In this paper, we propose a Cross-modality Spatial-temporal Transformer (CST) including a Cross-frame Tube Transformer Module (CTTM) and a Multi-frame Transformer Fusion Module (MTFM) to address these challenges. Firstly, CTTM tokenizes a video clip into multiple 3D tubes, each encapsulating local spatial-temporal information of pedestrians, and then obtains global spatial-temporal representations by establishing the relationship between tubes. Secondly, we design MTFM to exchange information between multiple frames using message tokens, thus modeling the long-range temporal dependencies of features of pedestrians. In addition, to prevent the potential representation collapse caused by triplet-based loss functions, we propose a diversity-consistency (DC) loss function to preserve the diversity and consistency of cross-modality feature representations by imposing variance, invariance, and covariance constraints in feature representations. Extensive benchmark experiments demonstrate that our approach outperforms the state-of-the-art methods with large margins.

Abstract:
Text-to-image person re-identification (ReID) aims to retrieve images of a person based on a given textual description. The key challenge is to learn the relations between detailed information from visual and textual modalities. Existing work focuses on learning a latent space to narrow the modality gap and further build local correspondences between two modalities. However, these methods assume that image-to-text and text-to-image associations are modality-agnostic, resulting in suboptimal associations. In this work, we demonstrate the discrepancy between image-to-text association and text-to-image association and proposecross-modal adaptive dual association (CADA) to build fine bidirectional image-text detailed associations. Our approach features a decoder-based adaptive dual association module that enables full interaction between visual and textual modalities, enabling bidirectional and adaptive cross-modal correspondence associations. Specifically, this paper proposes a bidirectional association mechanism: Association of text Tokens to image Patches (ATP) and Association of image Regions to text Attributes (ARA). We adaptively model the ATP based on the fact that aggregating cross-modal features based on mistaken associations will lead to feature distortion. For modeling the ARA, since attributes are typically the first distinguishing cues of a person, we explore attribute-level associations by predicting the masked text phrase using the related image region. Finally, we learn the dual associations between texts and images, and the experimental results demonstrate the superiority of our dual formulation.

Abstract:
Existing unbiased visual question answering (VQA) models reduce the spurious correlation between questions and answers to force the models to focus on visual information. However, the visual information captured by these unbiased models is irrelevant to the correct answer, resulting in leveraging spurious correlation to predict incorrect answers. This makes these unbiased methods fail to obtain critical visual information, thus performing poorly on questions dominated by the visual information. To capture the valuable visual information, this article proposes a novel unbiased VQA model based on causal inference, leveraging Instrumental Variable (IVar) to increase the causal effect between visual features and answers. First, to obtain suitable instrumental variables, the noise generator is proposed according to the constraints of IVar. The generated noise can be regarded as IVar, which is used to pollute the original visual features. Then, this article proposes IVar loss which utilizes the generated IVar to increase the causal effect between visual features and answers. When the visual feature is polluted by IVar, IVar loss guides the model to predict incorrect answers to enhance the correlation between IVar and the answer. Since the correlation between IVar and the answer is proportional to the causal effect between the visual feature and the answer, IVar loss enhances the importance of the visual information, thereby rectifying the model to capture critical visual information. The extensive experimental results on widely-used benchmarks demonstrate the advantages of the proposed method. The proposed method gains the best accuracy on answer type Other of VQA-CP v2. These results demonstrate the superiority of the proposed method in capturing critical visual information since most questions on the answer type Other are dominated by visual information.

Abstract:
Unsupervised cross-domain scene segmentation approach adapts the source model to the target domain, which utilizes two-stage strategies to minimize the inter-domain and intra-domain gap. However, the accumulation of errors in the previous stages affects the training of the subsequent stages. In this paper, a framework called statistical and structural domain adaptation (SSDA) is proposed to optimize inter-domain and intra-domain adaptation jointly. Firstly, the statistical inter-domain adaptation (StaIA) is proposed to model dynamic subdomains, which continuously adjust seed samples during the process of domain adaptation to mitigate error accumulation. The dynamic subdomains are modeled by exploring Bayesian uncertainty statistics and global balance statistics, which alleviate the imbalance problem in uncertainty estimation. StaIA encourages the model to transfer comprehensive and genuine knowledge through the seed loss for inter-domain adaptation. Secondly, the structural intra-domain adaptation (StrIA) is proposed to align the intra-domain gap among dynamic subdomains by the structural priors. Specifically, the StrIA models structural priors by truncated conditional random field (TruCRF) loss within the neighborhood, which constrains intra-domain semantic consistency to reduce the intra-domain gap. Experimental results demonstrate the effectiveness of the proposed cross-domain scene segmentation approaches on two commonly-used unsupervised domain adaptation benchmarks.

Abstract:
The abuse of deepfake techniques has raised serious concerns about social security and ethical problems, which motivates the development of deepfake detection. However, without fully addressing the domain gap issue, existing deepfake detection methods still show weak generalization ability among datasets belonging to different domains with domain-specific characteristics like identities and generation methods, limiting their practical applications. In this article, we propose the Invariant Domain-oriented Deepfake Detection method (ID_3), which improves the generalization of deepfake detection on multiple domains through invariant risk minimization, a novel learning paradigm that addresses the domain gap problem by jointly training a purified invariant predictor and learning an aligned invariant representation. To train a purified invariant predictor, we design the Domain Refinement Data Augmentation strategy with self-face-swapping and region-erasing approaches, which suppresses domain-specific features and encourages the models to focus on critical domain-invariant characteristics. To learn an aligned invariant representation, we propose the Domain Calibration Batch Normalization approach with multiple BN branches, which normalizes input features from different domains into aligned representations during both training and testing. Extensive experiments on multiple datasets demonstrate that our framework can boost the deepfake detection generalization ability and outperform other baselines by large margins. Our codes can be found here.

Abstract:
Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression. Due to the essentially distinct data properties between image and text, most of existing methods either introduce complex designs towards fine-grained vision-language alignment or lack required dense alignment, resulting in scalability issues or mis-segmentation problems such as over- or under-segmentation. To achieve effective and efficient fine-grained feature alignment in the RIS task, we explore the potential of masked multimodal modeling coupled with self-distillation and propose a novel cross-modality masked self-distillation framework named CM-MaskSD, in which our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment for better segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost model performance in a nearly parameter-free manner, since it shares weights between the main segmentation branch and the introduced masked self-distillation branches, and solely introduces negligible parameters for coordinating the multimodal features. Comprehensive experiments on three benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task convincingly demonstrate the superiority of our proposed framework over previous state-of-the-art methods.

Abstract:
Unsupervised domain adaptive (UDA) gaze estimation aims to predict gaze directions of unlabeled target face or eye images given a set of annotated source images, which has been widely applied in practical applications. However, existing methods still perform poorly due to two major challenges. 1) There exists large personalized differences and style discrepancies between source and target samples, which leads the learned source model easily collapsing to biased results. 2) Data uncertainties inherent in reference samples will affect the generalization ability of their models. To tackle the above challenges, in this paper, we propose a novel Domain-Consistent and Uncertainty-Aware (DCUA) network for generalizable gaze estimation. Our DCUA network employs a two-phase framework where a primary training sub-network (PTNet) and a refined adaptation sub-network (RANet) are trained on the source and target domain, respectively. Firstly, to obtain robust and pure gaze-related features, we propose twain domain consistent constraints, that is, the intra-domain consistent constraint and the inter-domain consistent constraint. These two constraints could eliminate the impact of gaze-irrelevant factors by maintaining consistency between label and feature space. Secondly, to further improve the adaptability of our model, we propose dual uncertainty perception modules, which include an intrinsic uncertainty module and an extrinsic uncertainty module. These modules help DCUA network distinguish inferior reference samples and avoid overfitting to them. Experiments on four cross-domain gaze estimation tasks demonstrate the effectiveness of our method.

Abstract:
Recently, camouflaged object detection (COD), which suffers from numerous challenges such as low contrast between camouflaged objects and background and large variations of camouflaged object appearances, has received more and more concerns. However, the performance of existing camouflaged object detection methods is still unsatisfactory, especially when dealing with complex scenes. Therefore, in this article, we propose a novel Decoupling and Integration Network (DINet) to detect camouflaged objects. Here, the depiction of camouflaged objects can be regarded as the iterative decoupling and integration of the body features and detail features, where the former focuses on the center of camouflaged objects and the latter contains pixels around edges. Concretely, firstly, we deploy two complementary decoder branches including a detail branch and a body branch to learn the decoupling features, namely body decoder features and detail decoder features. Particularly, each decoder block of the two branches incorporates features from three components, i.e., the previous interactive feature fusion (IFF) module, adjacent encoder layers, and corresponding encoder layer. Besides, to further elevate the body decoder features, the body blocks also introduce the global contextual information, which is the combination of all body encoder features via the global context (GC) unit, to provide coarse object location information. Secondly, to integrate the two decoupling decoder features, we deploy the interactive feature fusion (IFF) module based on the interactive combination and channel attention. Following this way, we can progressively provide a complete and accurate representation for camouflaged objects. Extensive experiments on three public challenging datasets, including CAMO, COD10 K, and NC4K, show that our DINet presents competitive performance when compared with the state-of-the-art models.

Abstract:
Video camouflaged object detection aims to identify objects that are visually concealed within the surroundings in a video. Most of the existing methods fall into analyzing the implicit inter-frame motion to capture the camouflaged object. However, due to a lack of exploring the prior explicit motion of the camouflaged object, these works generally encounter difficulty in capturing the complete camouflaged object. To address this issue, we propose to integrate implicit and explicit motion learning into a unified framework, namely Implicit-Explicit Motion Learning network (IMEX), for video camouflaged object detection. Specifically, to promote the identifiability of the camouflaged object, a cross-scale representation fusion was proposed for global inter-frame alignment. By establishing cross-scale temporal-spatial association and aggregating the temporal-spatial attentive representations, it also achieves an elimination of the implicit motion of inter-frame to some extent. Moreover, to further improve the discriminability of boundary regions of the detected object, an explicit motion-induced consistency preserving of camouflaged objects is proposed, in which the prior boundary-aware explicit motion field is leveraged to supervise the consistency of camouflaged objects in consecutive frames. Extensive experiments show that our proposed IMEX achieves substantial performance improvements by a large margin.

Abstract:
Domain adaptation (DA) addresses the challenge of distribution discrepancy between the training and test data, while multi-source domain adaptation (MSDA) is particularly appealing for realistic scenarios. With the emergence of extensive unlabeled datasets, self-supervised learning has gained significant popularity in deep learning. It is noteworthy that multi-source domain adaptation and self-supervised learning share a common objective: leveraging unlabeled data to acquire more informative representations. However, conventional self-supervised learning encounters two main limitations. Firstly, the traditional pretext task falls to transfer fine-grained knowledge to downstream task with general representation learning. Secondly, the scheme of the same feature extractor with distinct prediction heads makes the cross-task knowledge exchange and information sharing ineffective. In order to tackle these challenges, we introduce a novel approach called Domain-Aware Graph Network (DAGNet). DAGNet utilizes a graph neural network as a bridge to facilitate efficient cross-task knowledge exchange. By employing a mask token strategy, we enhance the robustness of representations by selectively masking certain domain or self-supervised information. In terms of datasets, the uneven and style-based domain shifts in current datasets make it challenging to measure the model's domain adaptation performance in real-world applications. To address this issue, we introduce a benchmark dataset DomainVerse with continuous spatio-temporal domain shifts encountered in the real world. Our extensive experiments demonstrate that DAGNet achieves state-of-the-art performance not only on mainstream multi-source domain adaptation datasets but also on different settings within DomainVerse.

Abstract:
Recently, multi-view clustering methods have been widely used in handling multi-media data and have achieved impressive performances. Among the many multi-view clustering methods, anchor graph-based multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To address this issue, we draw inspiration from regression and feature selection to propose Anchor Graph-Based Feature Selection for One-Step Multi-View Clustering (AGFS-OMVC). Our method combines embedding learning and sparse constraint to perform feature selection, allowing us to remove noisy anchor points and redundant connections in the anchor graph. This results in a clean anchor graph that can be projected into the label space, enabling us to obtain clustering labels in a single step without post-processing. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches.

Abstract:
Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their ability to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at ×2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at ×4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area.

Abstract:
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision. Existing literature addresses this challenge by employing local-based representation approaches, which may not sufficiently facilitate meaningful object-specific semantic understanding, leading to a reliance on apparent background correlations. Moreover, they primarily rely on high-dimensional local descriptors to construct complex embedding space, potentially limiting the generalization. To address the above challenges, this article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition. RSaD introduces additional saliency-aware supervision via saliency detection to guide the model toward focusing on the intrinsic discriminative regions. Specifically, RSaD utilizes the saliency detection model to emphasize the critical regions of each sub-category, providing additional object-specific information for fine-grained prediction. RSaD transfers such information with two symmetric branches in a mutual learning paradigm. Furthermore, RSaD exploits inter-regional relationships to enhance the informativeness of the representation and subsequently summarize the highlighted details into contextual embeddings to facilitate the effective transfer, enabling quick generalization to novel sub-categories. The proposed approach is empirically evaluated on three widely used benchmarks, demonstrating its superior performance.

Abstract:
Dense captioning creates diverse Region of Interests (RoIs) descriptions for complex visual scenes. While promising results have been obtained, several issues persist. In particular: 1) it is hard to find the optimal parameters for artificially designed modules (e.g., non-maximum suppression (NMS)) causing redundancies and fewer interactions to benefit the two sub-tasks of RoI detection and RoI captioning; 2) the absence of a multi-scale decoder in current methods hinders the acquisition of scale-invariant features, thus leading to poor performance. To tackle these limitations, we bypass the artificially designed modules and present an end-to-end dense captioning framework via multi-scale transformer decoding (DCMSTRD). DCMSTRD solves dense captioning by set matching and prediction instead. To further enhance the discriminative quality of the multi-scale representations during caption generation, we introduce a multi-scale module, termed multi-scale language decoder (MSLD). Our proposed method tested on standard datasets achieves a mean Average Precision (mAP) of 16.7% on the challenging VG-COCO dataset, demonstrating its effectiveness against the current methods.

Abstract:
Unsupervised person re-identification (Re-ID) targets to learn discriminative representations without annotations. Recently, clustering-based methods have shown promising performance, which utilize clustering to generate identity pseudo labels for model optimization. Large intra-class variance mainly caused by domain discrepancy among cameras could lead to noisy clustering results. However, abundant camera-aware sample pairs relations have not been exploited fully to facilitate learning of features with comprehensive knowledge, so as to tackle this issue. In this paper, we propose hierarchical camera-aware contrast extension (HCACE) for unsupervised person Re-ID. Firstly, cognitive collaboration contrast scheme (CCCS) is introduced to explore hierarchical camera-aware relations at the proxy-level, so as to collaboratively promote model to learn representative knowledge. Secondly, aggregative instance contrast extension scheme (AICES) is proposed to promote the learning of potential fine-grained knowledge by aggregating refined camera-aware inter-instance relations. Especially in AICES, hard negative instance extension (HNIE) is designed to generate extended negative instances, so as to assist the exploration of transitional cross-camera inter-instance relations. Finally, extensive experiments on three benchmark datasets validate superior performance of proposed HCACE.

Abstract:
Spread spectrum (SS) watermarking has gained significant attention as it prevents attackers from reading, tampering with, or removing watermarks. Secret key estimation can help with the first two unauthorized operations but cannot remove watermarks. Moreover, existing deep-learning watermark removal methods do not consider the characteristics of SS watermarking, thus leading to unsatisfactory results. In this paper, we design a secret key estimation method that treats secret key estimation as a binary classification problem and updates the estimated key via backpropagation and parameter optimization algorithms. We develop a watermark removal network using quaternion convolutional neural networks (QCNNs) to learn watermark features while capturing the relationship between channels to improve image quality. Based on our estimation method and QCNN-based network, we propose a two-stage watermark removal framework that utilizes information of the secret key to train the network. A loss function is introduced to directly prevent watermark extraction, thereby improving removal performance. Extensive experiments demonstrate the superiority of our methods over the state-of-the-art methods.

Affiliations: National Engineering Research Center of Communications and Networking, Nanjing University of Posts & Telecommunications, Nanjing, China; New HC Technologies Company Ltd., Hangzhou, China; Institute of Advanced Technology, Nanjing University of Posts & Telecommunications, Nanjing, China; Department of Internet of Things, Nanjing University of Posts & Telecommunications, Nanjing, China; School of Big Data and Computer Science, Guizhou Normal University, Guiyang, China; School of Automation, Southeast University, Nanjing, China

Abstract:
Lightweight semantic segmentation plays an essential role in image signal processing that is beneficial to many multimedia applications, such as self-driving, robotic vision, and virtual reality. Due to the powerful capability to encode image details and semantics, many lightweight dual-resolution networks have been proposed in recent years for semantic segmentation. In spite of achieving remarkable progresses, they often ignore semantic context ranged from different scales. Furthermore, most of them always neglect the object boundaries, serving as a significant assistance for lightweight semantic segmentation. To alleviate these problems, this paper develops a Boundary-guide dual-resolution lightweight network with multi-scale Semantic Context, called BSCNet, for semantic segmentation. Specifically, to enhance the capability of feature representation, an Extremely Lightweight Pyramid Pooling Module (ELPPM) is designed to capture multi-scale semantic context at the top of low-resolution branch of BSCNet. In addition, to increase feature similarity of the same object while keeping feature discrimination of different objects, pixel information is propagated throughout the entire object area using a simple Boundary Auxiliary Fusion Module (BAFM), where the predicted object boundaries are served as high-level guidance to refine low-level convolutional features. The comprehensive experimental results have demonstrated that our BSCNet is simple and effective, achieving state-of-the-art trade-off in terms of segmentation accuracy and running efficiency on CityScapes, CamVid, and KITTI datasets.

Abstract:
Point cloud analysis faces computational system overhead, limiting its application on mobile or edge devices. Directly employing small models may result in a significant drop in performance since it is difficult for a small model to adequately capture local structure and global shape information simultaneously, which are essential clues for point cloud analysis. This paper explores feature distillation for lightweight point cloud models. To mitigate the semantic gap between the lightweight student and the cumbersome teacher, we propose bidirectional knowledge reconfiguration (BKR) to distill informative contextual knowledge from the teacher to the student. Specifically, a top-down knowledge reconfiguration and a bottom-up knowledge reconfiguration are developed to inherit diverse local structure information and consistent global shape knowledge from the teacher, respectively. However, due to the farthest point sampling in most point cloud models, the intermediate features between teacher and student are misaligned, deteriorating the feature distillation performance. To eliminate it, we propose a feature mover's distance (FMD) loss based on optimal transportation, which can measure the distance between unordered point cloud features effectively. Extensive experiments conducted on shape classification, part segmentation, and semantic segmentation benchmarks demonstrate the universality and superiority of our method.

Abstract:
The object of image super-resolution reconstruction is to overcome the limitations imposed by hardware imaging conditions and patterns, aiming to restore high-frequency details in images through signal processing techniques. Recently, deep learning-based single-image super-resolution reconstruction (SISR) has achieved remarkable performance. However, the current methods exhibit inadequate performance in the reconstruction of texture details, thereby posing a challenge for further enhancing the accuracy of super-resolution reconstruction. In this study, we propose a novel contextual texture enhancement network (CTE-Net) aimed at improving the level of texture details in image super-resolution. The CTE-Net comprises of two crucial components: the multi-level feature aggregation module (MFAM) and the contextual information enhancement module (CIEM). The MFAM integrates global and local low-resolution (LR) features from both the pixel space and channel dimensions, thereby enhancing the feature representation capability of the network. The CIEM is deployed to enhance the network's learning capacity by integrating a meticulously designed context-attention mechanism, which effectively explores the adjacent contextual information of images and thereby amplifies the expressive capability of the generated features. Moreover, we utilize local binary patterns (LBP) to guide the feature selection strategies for MFAM and CIEM, thereby prioritizing the network's decision logic towards the recovery of texture details. The extensive experiments demonstrate that our method yields satisfactory results. In comparison to the state-of-the-art approaches, our method exhibits superior performance on the benchmark datasets.

Abstract:
Traditionally, art images have to be restored by professionals for a very long time. It is also possible to maintain the artistic value of damaged art images by digitizing them and restoring them through computer-aided means. However, existing advanced image inpainting methods are mainly intended for natural images and are not suitable for art images. Thus, we propose a novel style-guided dual-branch inpainting network (SDI-Net) to address the above-mentioned issue. Specifically, our SDI-Net consists of a style reconstruction (SR) branch and a style inpainting (SI) branch, in which the SR branch provides intermediate supervision (style and content supervision) for the SI branch. The SI branch performs art image inpainting with a coarse-to-fine approach. At the coarse inpainting stage, the content and style of art image are separated and preliminarily inpainted under the supervision of SI branch. In addition, we propose a class style learning (CSL) module to inpaint the style feature guided by the style label, which can provide more effective brushstrokes from the same class of art images. The coarse inpainted results can be obtained by fusing the inpainted style feature with the inpainted content feature. At the fine inpainting stage, a style attention (SA) module is proposed in the decoder to further refine the coarse inpainted results. We employ the style loss, the content loss, the multi-class style adversarial loss, and the reconstruction loss to jointly train the proposed SDI-Net. A variety of experiments demonstrate the effectiveness of the proposed method, which allows the filled brushstrokes to appear as realistic as possible.

Abstract:
In this paper, we study facial expression recognition (FER) in the class-incremental learning (CIL) setting, which defines the classification of well-studied and easily-accessible basic expressions as an initial task while learning new compound expressions gradually. Motivated by the fact that compound expressions are meaningful combinations of basic expressions, we treat basic expressions as attributes (i.e., semantic descriptors), and thus compound expressions are represented in terms of attributes. To this end, we propose a novel visual-textual attribute learning network (VTA-Net), mainly consisting of a textual-guided visual module (TVM) and a textual compositional module (TCM), for class-incremental FER. Specifically, TVM extracts textual-aware visual features and classifies expressions by incorporating the textual information into visual attribute learning. Meanwhile, TCM generates visual-aware textual features and predicts expressions by exploiting the dependency between textual attributes and category names of old and new expressions based on a textual compositional graph. In particular, a visual-textual distillation loss is introduced to calibrate TVM and TCM during incremental learning. Finally, the outputs from TVM and TCM are fused to make a final prediction. On the one hand, at each incremental task, the representations of visual attributes are enhanced since visual attributes are shared across old and new expressions. This increases the stability of our method. On the other hand, the textual modality, which involves rich prior knowledge of the relevance between expressions, facilitates our model to identify subtle visual distinctions between compound expressions, improving the plasticity of our method. Experimental results on both in-the-lab and in-the-wild facial expression databases show the superiority of our method against several state-of-the-art methods for class-incremental FER.

Abstract:
We present MosaicMVS, a novel learning-based depth estimation framework for a mosaic-based omnidirectional multi-view stereo (MVS) camera setup. It uses a regular field of view (FOV) MVS network for an omnidirectional imaging setup with explicit consideration of hypothetical voxel-wise FOV overlaps. The resulting depth predictions are accurate and agree on the omnidirectional multi-view geometry. Unlike existing MVS setups, MosaicMVS camera setup can be easily applied to omnidirectional indoor scenes without having to account for constraints such as intricate epipolar constraints and the distortion of omnidirectional cameras. We validate the effectiveness of our framework on a new challenging indoor dataset in terms of depth estimation, reconstruction, and view synthesis. We also present new evaluation metric to check reconstruction performance using post-processed masks for accurate evaluation without any ground truth depth map or laser-scanned reconstructions. Experimental results show that our framework outperforms the state-of-the-art MVS methods in a large margin in all test scenes.

Abstract:
This paper proposes a new multi-modal question -answering task, named as Cross-Modal Information Complementation based Question Answering (CroMIC-QA), to promote the exploration on bridging the semantic gap between visual and linguistic signals. The proposed task is inspired by the common phenomenon that, in most user-generated QA scenarios, the information of the given textual question is incomplete, and thus it is required to merge the semantics of both the text and the accompanying image to infer the complete real question. In this work, the CroMIC-QA task is first formally defined and compared with the classic Visual Question Answering (VQA) task. On this basis, a specified dataset, CroMIC-QA-Agri, is collected from an online QA community in the agriculture domain for the proposed task. A group of experiments is conducted on this dataset, with the typical multi-modal deep architectures implemented and compared. The experimental results show that the appropriate text/image presentations and text-image semantic interaction methods are effective to improve the performance of the framework.

Abstract:
By providing an immersive experience with panoramic views, 360-degree video streaming has gained increasing popularity recently. In many cases, videos are transmitted to mobile users over cellular networks. However, due to the high bandwidth requirement of 360-degree videos and the growing number of users, it is challenging to provide high-quality live streaming services to all users with limited bandwidth. Improving spectral efficiency and reducing bandwidth consumption are two major approaches to address this issue, which can be achieved with non-orthogonal multiple access (NOMA) and scalable video coding (SVC), respectively. In this article, we apply NOMA and SVC to 360-degree video streaming over a cellular network and propose a multicast scheme called SVCast, aiming to maximize the sum quality of experience of users served by a base station. Such a problem is formulated as an NP-hard problem, and we decompose it into two levels of subproblems. The lower-level subproblem is inter-group spectrum allocation, which is solved by a knapsack approach. The higher-level subproblem is intra-group multicast scheduling, and we propose a recursive algorithm to solve it. Simulation results demonstrate that SVCast improves the system utility by 31.9% on average. Furthermore, SVCast eliminates the need for viewport prediction by aggregating the contents from the viewports of multiple users.

Abstract:
Domain adaptive hashing has received increasing attention since it is capable of enhancing the performance of retrieval if the target domain for testing meets domain shift. However, owing to data security and transmission constraints nowadays, abundant source data is often not available. Towards this end, this paper investigates a novel yet practical problem named test-time adaptive hashing, which aims to enhance the performance of hashing models without access to the source domain data when tested on the target domain with domain shift. This problem is challenging due to both fugacious domain shift and label scarcity on the target domain. In this paper, we propose a novel hashing approach named Discrepancy and Structure-based Contrast (DISC) for effective test-time adaptive retrieval. In particular, DISC first trains the hashing model using the source domain data and stores the distribution of each class in the hidden space. During test-time adaptation, we generate simulated source features based on stored distributions and compare class-specific distributions across domains using maximum mean discrepancy (MMD) to overcome potential domain shift. Furthermore, to tackle the label scarcity, we estimate the graph structure using deep features on the target domain, which guides effective hashing contrastive learning for generating discriminative and domain-invariant hash codes. Extensive experiments on various benchmark datasets validate the superiority of our proposed DISC compared with a range of competing baselines.

Abstract:
RGB-Thermal pedestrian detection has shown many notable advantages in various lighting and weather conditions by combining the information from RGB-T images. Due to distinct imaging principles, RGB-T modalities consist of modality-specific and modality-consistent information. However, most existing RGB-T pedestrian detection methods indiscriminately integrate these two types of information, which leads to the pollution of modality information. To address this issue, we propose a novel mask-guided multi-level fusion network (M2FNet) for RGB-T pedestrian detection. M2FNet independently explores consistent and specific features in RGB-T modalities at three different levels, utilizing pixel-level positional information in masks to exclusively focus on pedestrian-related features. Specifically, at the feature extraction level, we selectively embed cross-modality differential compensation (CDC) modules and design the bidirectional multiscale fusion (BMF) module to fully utilize the complementary modality-specific information and enhance the precision of predicted pedestrian masks. At the feature fusion level, the mask-guided global consistency mining (MGCM) module is introduced to capture intra-modal and inter-modal consistent information of pedestrians, which generates highly discriminative RGB-T features. Finally, to further reduce inter-modal differences, we propose a mask-guided pixel-level decision fusion (MPDF) strategy to dynamically weight the RGB-T predictions. Extensive experiments and comparisons demonstrate that our proposed M2FNet, with different backbones, outperforms the state-of-the-art detectors on both publicly available KAIST and CVC-14 RGB-T pedestrian detection datasets.

Abstract:
By formulating the data generation as a sequence procedure of denoising autoencoding, diffusion models have achieved superior in-painting performance on image data and beyond. Nevertheless, it is not trivial when capitalizing on diffusion models to generate missing 3D points. The difficulty originates from the intrinsic structure where 3D point cloud is a set of unordered and irregular coordinates. That motivates us to delve into the 3D structural information for designing point cloud encoder-decoder and shape latent generator, to precisely formulate the latent distribution of the complete point cloud and partial observation. In this paper, we propose Point cloud completion with Latent Diffusion Models (PointLDM), a new approach that leverages the conditional denoising diffusion probabilistic modeling (DDPM) in the 3D latent space for shape reconstruction. The architecture of PointLDM consists of a transformer-based variational auto-encoder (VAE) to model the complete shape latent, and a diffusion network for shape latent prediction. The encoder of VAE exploits both of global shape latent and local point features in shape distribution learning. With the learnt shape latent, the decoder first decodes the shape latent into coarse points, and then recovers the fine-grained details around each coarse point by deforming a 2D grid. To reconstruct the shape latent from partial observation, the diffusion network treats the partial observation as the conditional input and generates the shape latent via DDPM. Extensive experiments conducted on MVP, Completion3D, and KITTI quantitatively and qualitatively demonstrate the efficacy of PointLDM over the state-of-the-art shape completion approaches.

Abstract:
Pixel-level adaptive convolution, which overcomes the deficiency of the spatial-invariance of standard convolution, is always limited to performing feature extraction from local patches and ignores the latent long-range dependencies imperceptible in the feature space, which are more significant in pixel-level tasks such as hyperspectral image super-resolution (HSISR). To handle such limitations, we propose kernel-space non-local convolution (KNLConv), which explores non-local dependencies in the generated kernel space, to leverage these global information to guide the network to extract image features more flexibly. Technically, the proposed KNLConv first decomposes the convolutional kernel space into spatial and channel dimensions, and designs a depth-wise non-local expansion convolution (NLEC) in the spatial dimension of the kernel-space to explore underlying global correlations. Then introduce an adaptive point-wise convolution (APC), generalizing the NLEC to the pixel-level while integrating features in the channel dimension. In addition, applying KNLConv, we design an effective network architecture for hyperspectral image super-resolution. Extensive experiments demonstrate that our approach performs favorably against current state-of-the-art HSISR methods, both on quantitative indicators and visual quality.

Abstract:
Few-shot semantic segmentation is a challenging task that aims to segment novel classes in the query images given only a few annotated support samples. Most existing prototype-based approaches extract global or local prototypes by global average pooling (GAP) or clustering to represent all object information. Subsequently, the prototype information is employed as guidance for query image segmentation. However, these frameworks fail to fully mine the object details and ignore information from query images. Consequently, we propose a Dual-Guided Frequency Prototype Network (DGFPNet) to solve these issues. Specifically, to mine the global and local object information, a Frequency Prototype Generation Module (FPGM) is first proposed to extract more comprehensive frequency prototypes by multi-frequency pooling (MFP) in the DCT domain. Then, with the guidance of support and query information, a Dual-Guided Selection Module (DGSM) is presented to produce the query attention mask and select more effective prototypes. Based on the query attention mask and support information, the generalized object information is integrated into the feature with the proposed Feature Generalization Module (FGM). Finally, we propose a Multi-Dimension Feature Enrichment Decoder Module (MDFEDM) to capture multi-dimension object information and tackle hard pixels for refining the final segmentation results. Extensive experiments on PASCAL-\mathbf5^\bm i and COCO-\mathbf20^\bm i show that our model achieves new state-of-the-art performances.

Abstract:
Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer – a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers.

Abstract:
Deep network-based point cloud geometry compression is becoming more crucial and attractive due to constantly expanding 3D applications. The current strategy employing holistic point clouds as input imposes limitations on the compressed point cloud size and results in the loss of local geometry information, simplifying the repetition of local point cloud patterns. Nevertheless, employing point cloud patches as input results in a fresh set of erroneous points within the spliced point cloud. This occurrence can be attributed to the loss of global information and the relationship among the point cloud patches. Hence, this paper introduces a novel framework for progressive downsampling point cloud compression that follows the principles of two distinct methodologies. Our strategy still uses the autoencoder structure. Specifically, the encoder reduces the point cloud size and learns features incorporating local patch structure information and global semantic information. At the same time, the decoder upsamples the quantized and entropy-encoded features to reconstruct the original point cloud. More specifically, the geometrical details inside the patch and the relationship between patches are encoded to obtain the point cloud's global semantic and local geometry information. Concurrently, the computational burden is reduced by partitioning the complete point cloud input into segments and conducting continuous downsampling. Furthermore, we introduced an attention-based point cloud deconvolution module to address the localized repeating concentrations in reconstructed point clouds resulting from linear interpolation. This module samples the parent node and its neighboring relationships in the multi-space domain to improve the characteristics of the parent nodes to be sampled. It then uses the deconvolution to create sub-node characteristics that are more varied than the linear interpolation. Empirical evidence demonstrates that the proposed methodology achieves a effective equilibrium between the compression quality and ratio.

Abstract:
Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles in multimedia analysis. The crucial issue in this task is to model the global and the local matching between the image and different languages. Existing cross-modal embedding methods based on the transformer architecture oversee the local matching between the image region and monolingual words, especially when dealing with diverse languages. To overcome these limitations, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships and local correspondences between images and different languages by using a heterogeneous network. EHAT comprises Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA). The HARN serves as the core network and it captures cross-domain relationships by leveraging visual bounding box representation features to connect word features from two languages and to learn heterogeneous maps. MHCA and HCA facilitate cross-domain integration in the encoder through specialized heterogeneous attention mechanisms, enabling a single model to generate captions in two languages. We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families. The experimental results demonstrate the superior performance of our method compared to existing advanced monolingual methods. Our proposed EHAT framework effectively addresses the challenges of cross-lingual image captioning, paving the way for improved multilingual image analysis and understanding.

Abstract:
In the field of 3D shape recognition, the view-based approach has achieved state-of-the-art performance. A major challenge that needs to be addressed by the view-based approach is how to effectively aggregate multi-view features to obtain a better 3D shape representation. Existing methods which rely on networks with static parameters for feature aggregation adversely coerce the network to learn a general feature aggregation strategy for all inputs, ignoring the diversity of input 3D shapes in real-world scenarios. In this work, we propose a novel Dynamic View Aggregation Network called DVA-Net to address this challenge. DVA-Net can dynamically adjust the network parameter depending on the input 3D shapes to flexibly fuse multi-view information. The shape-specific parameter adaptation is achieved by our designed Dynamic Relation-aware Aggregation module, dubbed DRA module. It is responsible for learning relations among views and adaptively integrating multi-view features. Comprehensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance for 3D shape classification and retrieval.

Abstract:
Accurate 2D human pose estimation from images is vital for understanding human actions. However, deploying the latest models, e.g., regression-based models, on resource-limited devices remains challenging due to their high computational requirements. In this paper, we address the resolution dilemma in regression-based multiperson pose estimation, where low-resolution inputs cause performance degradation, while high-resolution inputs drastically increase computational costs. To achieve a lightweight regression approach, it becomes crucial to enhance the model's capabilities in low-resolution scenarios. We propose the staggered alignment self-distillation (SASD) method and a corresponding network architecture. Our approach involves training two twin networks with shared weights: a high-resolution network and a low-resolution network. The high-resolution network serves as a teacher, guiding the learning process of the low-resolution network through feature map staggered alignment. The knowledge from the high-resolution network enhances the performance of the low-resolution network during low-resolution inference. Additionally, we employ a normalized skeleton loss to capture the loss of bone-related structure during training. Through extensive experiments on the MS-COCO and CrowdPose datasets, we demonstrate the superiority of our proposed method over state-of-the-art, lightweight multiperson pose estimation techniques, achieving much better performance with lower computational costs. Furthermore, our method achieves comparable performance to recent advanced regression-based pose estimation methods but with only 1/4 of the computational cost.

Abstract:
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube, including temporal boundaries and object bounding boxes, that semantically corresponds to a given language description in an untrimmed video. The existing one-stage solutions in this task face two significant challenges, namely, vision-text semantic misalignment and spatial mislocalization, which limit their performance in grounding. These two limitations are mainly caused by neglect of fine-grained alignment in cross-modality fusion and the reliance on a text-agnostic query in sequentially spatial localization. To address these issues, we propose an effective model with a newly designed Feature Semantic Matching (FSM) module based on a Transformer architecture to address the above issues. Our method introduces a cross-modal feature matching module to achieve multi-granularity alignment between video and text while preventing the weakening of important features during the feature fusion stage. Additionally, we design a query-modulated matching module to facilitate text-relevant tube construction by multiple query generation and tubulet sequence matching. To ensure the quality of tube construction, we employ a novel mismatching rectify contrastive loss to rectify the mismatching between the learnable query and the objects corresponding to the text descriptions by restricting the generated spatial query. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on two challenging STVG benchmarks.

Abstract:
Image aesthetics assessment (IAA) is an interesting but challenging task, owing to the ineffable nature of human sense of beauty. The study of IAA has evolved from simple binary classification to more complex score regression and distribution prediction. It is effortless for people to perform aesthetic binary classification, i.e., aesthetically pleasing or not. However, further judgment on the fine-level scalar aesthetic score is complex and typically determined by aesthetic attributes presented in the image, such as content, lighting and color. Motivated by the above facts, this paper presents a Coarse-to-fine image Aesthetics assessment model guided by Dynamic Attribute Selection, dubbed CADAS. The underlying idea is to simulate the process of human aesthetic perception by performing coarse-to-fine aesthetic reasoning. Specifically, a hierarchical AttributeNet is first pre-trained by imitating the staged mechanism of human aesthetic experience, producing the candidate aesthetic attributes. Then, an AestheticNet is introduced to perform the coarse-level binary classification, based on which a confidence-based attribute selection strategy is designed to dynamically pick out the dominant aesthetic attributes from the candidate ones. Finally, a self-attention-based FusionNet is designed to explore the interaction between dominant aesthetic attributes and aesthetic features, producing the fine-level aesthetic prediction. Extensive experiments demonstrate that the proposed model is superior to the state-of-the-arts. Furthermore, CADAS is also able to output the dominant aesthetic attributes in images, facilitating model explainability.

Abstract:
Multi-Object Tracking (MOT) remains a vital component of intelligent video analysis, which aims to locate targets and maintain a consistent identity for each target throughout a video sequence. Existing works usually learn a discriminative feature representation, such as motion and appearance, to associate the detections across frames, which are easily affected by mutual occlusion and background clutter in practice. In this paper, we propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets, so as to achieve robust data association in the tracking process. For the detections without being associated, we design a novel single-shot feature learning module to extract discriminative features of each detection, which can efficiently associate targets between adjacent frames. For the tracklets being lost several frames, we design a novel multi-shot feature learning module to extract discriminative features of each tracklet, which can accurately refind these lost targets after a long period. Once equipped with a simple data association logic, the resulting VisualTracker can perform robust MOT based on the single-shot and multi-shot feature representations. Extensive experimental results demonstrate that our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.

Abstract:
Recently, the source-free domain adaptation (SFDA) problem has attracted much attention, where the pre-trained model for the source domain is adapted to the target domain in the absence of source data. However, due to domain shift, the negative alignment usually exists between samples from the same class, which may lower intra-class feature similarity. To address this issue, we present a self-supervised representation learning strategy for SFDA, named as neighborhood-aware mutual information (NAMI), which maximizes the mutual information (MI) between the representations of target samples and their corresponding neighbors. Moreover, we theoretically demonstrate that NAMI can be decomposed into a weighted sum of local MI, which suggests that the weighted terms can better estimate NAMI. To this end, we introduce neighborhood consensus score over the set of weakly and strongly augmented views and point-wise density based on neighborhood, both of which determine the weights of local MI for NAMI by leveraging the neighborhood information of samples. The proposed method can significantly handle domain shift and adaptively reduce the noise in the neighborhood of each target sample. In combination with the consistency loss over views, NAMI leads to consistent improvement over existing state-of-the-art methods on three popular SFDA benchmarks.

Abstract:
Constructing effective proxy data is one of the core challenges in data-free knowledge distillation. The existing models ignore the influence of the category entanglement of the generated data on the distillation. To alleviate this issue, imitating the human learning process, a new category-aware curriculum learning mechanism is proposed in this paper to perform data-free knowledge distillation, called CCL-D. The main ideology of this category-aware curriculum learning mechanism is to provide a new learning mode for data generation and network training, which enables the model to realize the knowledge distillation process from easy to difficult through automated curriculum learning. In this novel learning mechanism, a category-aware monitoring module is proposed to constrain the category attribute of generated data. Based on this monitoring module, the curriculum learning process for data generation and network training is designed and applied. Initially, the generator is guided to obtain new data with clear category features. The utilization of data with apparent category features is easy for student network training, and it enables the student network to learn clear and significant category features at the early training stage. Subsequently, the generator is guided to generate data with category entanglement. Utilizing these new data with category entanglement problems can improve the recognition ability of the student network to interclass interference and enhance network robustness. The effectiveness of the CCL-D is verified on the six benchmark experimental datasets (MNIST, CIFAR-10, CIFAR-100, SVHN, Caltech-101, Tiny-Imagenet).

Abstract:
3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. This motivates the use of semi-supervised learning which can additionally exploit unlabeled data to further boost the performance. While 2D semi-supervised learning methods focus on generating pseudo-labels for unlabeled existing samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels. In addition, we maintain a dynamic pseudo-database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. We also conducted a series of experiments to analyze the effectiveness of our method including pseudo-label quality analysis, the effect of different filtering and thresholding strategies, and ablations of each component.

Abstract:
Blind image quality assessment (BIQA) is a regression task with continuous label space, the feature space of which is expected to have a corresponding continuity in the target space. However, existing approaches typically learn quality score regression directly in an end-to-end fashion, which leaves networks susceptible to interference from task-agnostic information, and fails to capture the continuity of BIQA. In this work, by explicitly establishing inter-sample associations, a simple yet effective BIQA framework based on perceptual comparison is proposed to capture the continuity. To this end, besides the basic quality score regression, the relative quality scores between images are predicted to exploit the relative quality relationships between samples for optimizing the representation of image perceptual quality. In addition, based on the human perceptual characteristic, we derive a novel sample weighting strategy to dynamically adjust the weights for different samples in the network learning process for further improving the robustness of the model. The performances on both single-database and cross-database experiments achieve state-of-the-art, indicating the effectiveness of the proposed method. Besides, the proposed framework is model-agnostic, which can effectively improve the performance of the benchmark model with no extra inference cost.

Abstract:
Multi-modal applications are expected to dominate in the 5G and B5G era. However, traditional source coding methods are not efficient or reliable due to neglecting semantic redundancy and mutual influences between different modalities' sources. To address this, cross-modal source coding (CMSC) has been proposed as a promising solution. However, there are still two main challenges: determining the optimum rate of CMSC considering delay and reliability constraints, and designing a practical CMSC near the optimum rate. To tackle these challenges, this paper focuses on studying the optimum source coding rate of CMSC and its practical implementation. On the theoretical side, an (n,\epsilon)-achievable rate region is derived, representing the source coding rates subject to a fixed blocklength n and the target error probability \epsilon. Additionally, the optimum source coding rate can be approximated by calculating the infimum of the (n,\epsilon)-achievable rate region with a rate dispersion function. On the technical side, a general implementation for CMSC is proposed, which fully leveraging channel coding and artificial intelligence (AI) semantic analysis to achieve the optimum rate. Numerical results demonstrate that CMSC can obtain 50% improvement in theory and 37.5% enhancement in practice against the baseline model abstracted from traditional schemes when multi-modal sources are semantically correlated.

Abstract:
In this paper, we propose an effective method for mismatch removal, termed as graph neighborhood motion consensus, to address the feature matching problem which plays a pivotal role in various computer vision tasks. In our method, we convert each feature correspondence into a motion field sample and model it with the probabilistic graphical model (PGM). To differentiate mismatches from true matches, we firstly design a metric based on neighborhood topology consensus and neighborhood interaction to evaluate the correctness of each match. We also design a variance-based similarity search module to make the information used more reliable for better matching performance. To derive the solution of PGM, we build a model to transform the problem into an integer quadratic programming problem and obtain its closed-form solution with linear time complexity. Extensive experiments on general feature matching, fundamental matrix estimation and image registration tasks demonstrate that our proposed method can achieve superior performance over several state-of-the-art approaches.

Abstract:
Machine learning models often suffer from severe performance degradation due to distributional shifts between testing and training data. To address this issue, researchers have focused on domain generalization (DG), which aims to generalize a model trained on source domains to arbitrary unseen target domains. Recently, ensemble learning has emerged as a popular strategy for addressing the DG problem, and domain-specific experts are typically involved. However, the existing methods do not sufficiently consider the generalizability of individual experts or leverage the consistency and diversity among them, thus limiting the generalizability of the constructed models. In this paper, we propose a consistent-but-diverse mixture of experts (CBDMoE) algorithm, which is an improved MoE framework that effectively harnesses ensemble learning for solving the DG problem. Specifically, we introduce individual expert learning (IEL), which incorporates a novel domain-class-balanced subset division (DCBSD)-based sampling strategy to facilitate a generalizable expert learning process. Additionally, we present consistent-but-diverse learning (CBDL), which employs two regularizing losses to encourage consistency and diversity in the predictions of the experts. Our proposed strategy significantly enhances the generalizability of the MoE framework. Extensive experiments conducted on three popular DG benchmark datasets demonstrate that our method outperforms the state-of-the-art approaches.

Abstract:
Text recognition remains challenging, primarily due to the scarcity of annotated real data or the hard labor to annotate large-scale real data. Most existing solutions rely on synthetic training data, where the synthetic-to-real domain gaps limit the model performance on real data. Unsupervised domain adaptation (UDA) methods have been proposed, aiming to obtain domain-invariant representations. However, they commonly focus on domain-level alignment, neglecting the fine-grained character features and thus leading to indistinguishable characters. In this paper, we propose a simple yet effective self-supervised UDA framework tailored for cross-domain text recognition, named TextAdapter, which integrates contrastive learning and consistency regularization to mitigate domain gaps. Specifically, a fine-grained feature alignment module based on character contrastive learning is designed to learn domain-invariant character representations by category-level alignment. Additionally, to address the task-agnostic problem in contrastive learning, i.e., ignoring the sequence semantics, an instance consistency matching module is proposed to perceive the contextual semantics by matching the prediction consistency among target data different augmented views. Experimental results on cross-domain benchmarks demonstrate the effectiveness of our method. Furthermore, TextAdapter can be embedded in most off-the-shelf text recognition models with new state-of-the-art performance, which illustrates the generality of our framework.

Abstract:
Low-light images commonly exhibit issues such as reduced contrast, heightened noise, faded colors, and the absence of critical details. Enhancing these images is challenging due to the complex interplay of various factors. Existing methods primarily focus on learning the intricate mapping between low-light input and normal-light output through well-designed deep neural networks, potentially overlooking the valuable priors inherent in normal-light images. In this paper, we introduce a Code Bank-Guided Transformer (CodedBGT) for low-light image enhancement. Initially, we pre-train a VQGAN on an extensive collection of high-quality normal-light images to capture a high-quality prior. This prior is stored in a discrete codebook along with its corresponding decoded feature space, forming the code bank that guides the enhancement process. To effectively align low-light features with undistorted normal-light code bank features, we design a Code Bank-Guided Block (CBGB) within our enhancement network. The CBGB is integrated into the transformer to aggregate prior information into the enhancement network. Benefiting from the high-quality code bank, our method produces results with more satisfying visual quality. In comparison with the state-of-the-art methods, higher quantitative and qualitative experimental results on the paired dataset and unpaired datasets with various evaluation metrics show the superiority of our method.

Abstract:
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pre-training (MAC), for video-text retrieval tasks. Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency. Comparing conventional temporal sparse sampling, we propose to randomly mask a high ratio of spatial regions and only take visible regions into the encoder as sparse spatial sampling. Similarly, we adopt the mask sampling technique for text inputs for consistency. Instead of blindly applying the mask-then-prediction paradigm from MAE, we propose a masked-then-alignment paradigm for efficient video-text alignment. The motivation is that video-text retrieval tasks rely on high-level alignment rather than low-level reconstruction, and multimodal alignment with masked modeling encourages the model to learn a robust and general multimodal representation from incomplete and unstable inputs. Coupling these designs enables efficient end-to-end pre-training: 3× speed up, 60%+ computation reduction, and 4%+ performance improvement. Our MAC achieves state-of-the-art results on various video-text retrieval datasets including MSR-VTT, DiDeMo, and ActivityNet. Our approach is omnivorous to input modalities. With minimal modifications, we achieve competitive results on image-text retrieval tasks.

Abstract:
One tough problem of image inpainting is to restore complex structures in the corrupted regions. It motivates interactive image inpainting which leverages additional hints, e.g., sketches, to assist the inpainting process. A sketch is simple and intuitive for end users to provide, but meanwhile has free forms with much randomness. Such randomness may confuse the inpainting models, and incur severe artifacts in completed images. To better facilitate image inpainting with sketch guidance, we propose a two-stage image inpainting system, termed SketchRefiner. The first stage of our approach serves as a data provider that simulates real sketches and derives the capability of sketch calibration from the simulated data. In the second stage, our approach aligns the sketch guidance with the inpainting process so as to elevate image inpainting with sketches. We also propose a real-world test protocol to address the evaluation of inpainting methods upon practical applications with user sketches. Experimental results on three prevailing benchmark datasets, i.e., CelebA-HQ, Places2, and ImageNet, and the proposed test protocol demonstrate the state-of-the-art performance of our approach, and its great potentials upon real-world applications. Further analyses illustrate that our approach effectively utilizes sketch information as guidance and eliminates the artifacts due to the free-form sketches.

Abstract:
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise.

Abstract:
Micro-videos, as an increasingly popular form of user-generated content (UGC), naturally include diverse multimodal cues. However, in pursuit of consistent representations, existing methods neglect the simultaneous consideration of exploring modality discrepancy and preserving modality diversity. In this paper, we propose a multimodal progressive modulation network (MPMNet) for micro-video multi-label classification, which enhances the indicative ability of each modality through gradually regulating various modality biases. In MPMNet, we first leverage a unimodal-centered parallel aggregation strategy to obtain preliminary comprehensive representations. We then integrate feature-domain disentangled modulation process and category-domain adaptive modulation process into a unified framework to jointly refine modality-oriented representations. In the former modulation process, we constrain inter-modal dependencies in a latent space to obtain modality-oriented sample representations, and introduce a disentangled paradigm to further maintain modality diversity. In the latter modulation process, we construct global-context-aware graph convolutional networks to acquire modality-oriented category representations, and develop two instance-level parameter generators to further regulate unimodal semantic biases. Extensive experiments on two micro-video multi-label datasets show that our proposed approach outperforms the state-of-the-art methods.

Abstract:
Remote photoplethysmography (rPPG) is a non-invasive technique that aims to capture subtle variations in facial pixels caused by changes in blood volume resulting from cardiac activities. Most existing unsupervised methods for rPPG tasks focus on the contrastive learning between samples while neglecting the inherent self-similarity prior in physiological signals. In this paper, we propose a Self-Similarity Prior Distillation (SSPD) framework for unsupervised rPPG estimation, which capitalizes on the intrinsic temporal self-similarity of cardiac activities. Specifically, we first introduce a physical-prior embedded augmentation technique to mitigate the effect of various types of noise. Then, we tailor a self-similarity-aware network to disentangle more reliable self-similar physiological features. Finally, we develop a hierarchical self-distillation paradigm for self-similarity-aware learning and rPPG signal decoupling. Comprehensive experiments demonstrate that the unsupervised SSPD framework achieves comparable or even superior performance compared to the state-of-the-art supervised methods. Meanwhile, SSPD has the lowest inference time and computation cost among end-to-end models.

Abstract:
Cross-modality vessel re-identification (ReID) presents a formidable challenge in the domain of maritime surveillance, necessitating the development of robust methodologies to accurately match vessels across disparate imaging modalities. This paper introduces a novel Cross-modality Alignment Decomposition Network (CAD-Net) to address the inherent complexities associated with this task. CAD-Net incorporates a geometric-semantic cross-modal alignment module for effectively mitigating geometric and modality variances within the global features. Additionally, it integrates an adaptive local decomposition module associated with a diversity regularization, enabling the capture of local vessel features, all while circumventing the reliance on predefined part separation criteria. To address the scarcity of cross-modal vessel datasets, which are predominantly biased towards visible light modality, and to evaluate the performance of the proposed framework, we have constructed a novel dataset named KongTong-boat (KT-boat). It comprises 2,826 high-resolution images, including 1,443 RGB images and 1,383 IR images, featuring 117 distinct vessels. This dataset can be served as a new fundamental benchmark for evaluating the efficacy of cross-modality vessel ReID algorithms, filling a critical gap in the field. The experimental results obtained on the KT-boat dataset unequivocally demonstrate the remarkable effectiveness of CAD-Net in the context of cross-modality ReID. Notably, when compared to state-of-the-art cross-modality ReID algorithms applied to general cross-modality pedestrian benchmarks on KT-boat and RegDB dataset, CAD-Net consistently outperforms them across key evaluation metrics, including the rank-1 index and mean Average Precision (mAP).

Abstract:
Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.

Abstract:
Generalized Zero-Shot Learning (GZSL) aims to recognize both seen and unseen categories by establishing visual and semantic relations. Recently, generation-based methods that focus on synthesizing fictitious visual features from corresponding attributes have gained significant attention. However, these generated features often lack discriminative capabilities due to inadequate training of the generative model. To address this issue, we propose a novel Discriminative Enhanced Network (DENet) to harness the potential of the generative model by adapting the training features and imposing constraints on the generated features. Our approach incorporates three pivotal modules. 1) Before the generative network training, we implement a Pre-Tuning Module (PTM) to eliminate irrelevant background noise in the raw features extracted from a fixed CNN backbone. Therefore, PTM can provide tuned training features without redundant noise for generative model. 2) During the generative network training, we propose an Asymmetry Cross-authenticity Contrastive (AC2) loss to group visual features of the same category while repel features from different categories by optimizing a large number of sample pairs. Additionally, we incorporate intra-class and relation-specific inter-class boundaries within the AC2 loss to enrich sample diversity and preserve valid semantic information. 3) Also within the generative network training, a Dual-semantic Alignment Module (DAM) is designed to align visual features with both attributes and label embeddings, enabling the model to learn attribute-related information and discriminative extended semantics. Experiments on four standard benchmarks demonstrate that our approach learns more discriminative features and surpasses the existing methods.

Abstract:
This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.

Abstract:
Text-Based Person Retrieval (TBPR) aims to identify a particular individual within an extensive image gallery using text as the query. The principal challenge inherent in the TBPR task revolves around how to map cross-modal information to a potential common space and learn a generic representation. Previous methods have primarily focused on aligning singular text-image pairs, disregarding the inherent polymorphism within both images and natural language expressions for the same individual. Moreover, these methods have also ignored the impact of semantic polymorphism-based intra-modal data distribution on cross-modal matching. Recent methods employ cross-modal implicit information reconstruction to enhance inter-modal connections. However, the process of information reconstruction remains ambiguous. To address these issues, we propose the Learning Semantic Polymorphic Mapping (LSPM) framework, facilitated by the prowess of pre-trained cross-modal models. Firstly, to learn cross-modal information representations with better robustness, we design the Inter-modal Information Aggregation (Inter-IA) module to achieve cross-modal polymorphic mapping, fortifying the foundation of our information representations. Secondly, to attain a more concentrated intra-modal information representation based on semantic polymorphism, we design Intra-modal Information Aggregation (Intra-IA) module to further constrain the embeddings. Thirdly, to further explore the potential of cross-modal interactions within the model, we design the implicit reasoning module, Masked Information Guided Reconstruction (MIGR), with constraint guidance to elevate overall performance. Extensive experiments on both CUHK-PEDES and ICFG-PEDES datasets show that we achieve state-of-the-art results on Rank-1, mAP and mINP compared to existing methods.

Abstract:
This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation.

Abstract:
Assessing the artness of AI-generated images continues to be a challenge within the realm of image generation. Most existing metrics cannot be used to perform instance-level and reference-free artness evaluation. This paper presents ArtScore, a metric designed to evaluate the degree to which an image resembles authentic artworks by artists (or conversely photographs), thereby offering a novel approach to artness assessment. We first blend pre-trained models for photo and artwork generation, resulting in a series of mixed models. Subsequently, we utilize these mixed models to generate images exhibiting varying degrees of artness with pseudo-annotations. Each photorealistic image has a corresponding artistic counterpart and a series of interpolated images that range from realistic to artistic. This dataset is then employed to train a neural network that learns to estimate quantized artness levels of arbitrary images. Extensive experiments reveal that the artness levels predicted by ArtScore align more closely with human artistic evaluation than existing evaluation metrics, such as Gram loss and ArtFID.

Abstract:
Unsupervised continual learning (UCL) has made remarkable progress over the past two years, significantly expanding the application of continual learning (CL). However, existing UCL approaches have only focused on transferring continual strategies from supervised to unsupervised. They have overlooked the relationship issue between visual features and representational continuity. This work draws attention to the texture bias problem in existing UCL methods. To address this problem, we propose a new UCL framework called InfoUCL, in which we develop InfoDrop contrastive loss to guide continual learners to extract more informative shape features of objects and discard useless texture features simultaneously. The proposed InfoDrop contrastive loss is general and can be combined with various UCL methods. Extensive experiments on various benchmarks have demonstrated that our InfoUCL framework can lead to higher classification accuracy and superior robustness to catastrophic forgetting.

Abstract:
Federated learning (FL) enables distributed clients to collaboratively learn a global model, suggesting its potential for use in improving data privacy in machine learning. However, although FL has made many advances, its performance usually suffers from degradation due to the impact of domain shift when the trained models are applied to unseen domains. To enhance the model's generalization ability, we focus on solving federated domain generalization, which aims to properly generalize a federated model trained based on multiple source domains belonging to different distributions to an unseen target domain. A novel approach, namely Prototype-Decomposed Knowledge Distillation (PDKD), is proposed herein. Concretely, we first aggregate the local class prototypes that are learned from different clients. Subsequently, Singular Value Decomposition (SVD) is employed to decompose the local prototypes to obtain discriminative and generalized global prototypes that contain rich category-related information. Finally, the global prototypes are sent back to all clients. We exploit knowledge distillation to encourage local client models to distill generalized knowledge from the global prototypes, which boosts the generalization ability. Extensive experiments on multiple datasets demonstrate the effectiveness of our method. In particular, when implemented on the Office dataset, our method outperforms FedAvg by around 13.5%, which shows that our method is instrumental in ameliorating the generalization ability of federated models.

Abstract:
Deep learning models for time series analysis often require large-scale labeled datasets for training. However, acquiring such datasets is cost-intensive and challenging, particularly for individual institutions. To overcome this challenge and concern about data confidentiality among different institutions, federated learning (FL) servers as a viable solution to this dilemma by offering a decentralized learning framework. However, the datasets collected by each institution often suffer from imbalance and may not adhere to uniform protocols, leading to diverse data distributions. To address this problem, we design a global model to approximate the global data distribution of all participant clients, then transfer it to local clients as an induction in the training phase. While discrepancies between the approximate distribution and the actual distribution result in uncertainty in the predicted results. Moreover, the diverse data distributions among various clients within the FL framework, combined with the inherent lack of reliability and interpretability in deep learning models, further amplify the uncertainty of the prediction results. To address these issues, we propose an uncertainty calibration method based on Bayesian deep learning techniques, which captures uncertainty by learning a fidelity transformation to reconstruct the output of time series regression and classification tasks, utilizing deterministic pre-trained models. Extensive experiments on the regression dataset (C-MAPSS) and classification datasets (ESR, Sleep-EDF, HAR, and FD) in the Independent and Identically Distributed (IID) and non-IID settings show that our approach effectively calibrates uncertainty within the FL framework and facilitates better generalization performance in both the regression and classification tasks, achieving state-of-the-art performance.

Abstract:
Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.

Abstract:
The 3D Lookup Table (3DLUT)-based methods are gaining popularity due to their satisfactory and stable performance in achieving automatic and adaptive real time image enhancement. In this paper, we present a new solution to the intractability in handling continuous color transformations of 3DLUT due to the lookup via three independent color channel coordinates in RGB space. Inspired by the inherent merits of the HSV color space, we separately enhance image intensity and color composition. The Transformer-based Pixel-Learnable 3D Lookup Table is proposed to undermine contouring artifacts, which enhances images in a pixel-wise manner with non-local information to emphasize the diverse spatially variant context. In addition, noticing the underestimation of composition color component, we develop the Saturation-Aware Compensation (SAC) module to enhance the under-saturated region determined by an adaptive SA map with Saturation-Interaction block, achieving well balance between preserving details and color rendition. Our approach can be applied to image retouching and tone mapping tasks with fairly good generality, especially in restoring localized regions with weak visibility. The performance in both theoretical analysis and comparative experiments manifests that the proposed solution is effective and robust.

Abstract:
Motion models are vital for solving multiple object tracking (MOT), which makes instance-level position predictions of targets to handle occlusions and noisy detections. Recent methods have proposed the use of Single Object Tracking (SOT) techniques to build motion models and unify the SOT tracker with the object detector into a single network for high-efficiency MOT. However, three feature incompatibility issues in the required features of this paradigm are ignored, leading to inferior performance. First, the object detector requires class-specific features to localize objects of pre-defined classes. Contrarily, target-specific features are required in SOT to track the target of interest with an unknown category. Second, MOT relies on intra-class differences to associate targets of the same identity (ID). On the other hand, the SOT trackers focus on inter-class differences to distinguish the tracking target from the background. Third, classification confidence is used to determine the existence of targets, which is obtained with category-related features and cannot accurately reveal the existence of targets in tracking scenes. To address these issues, we propose a novel Task-specific Feature Encoding Network (TFEN) to extract task-driven features for different sub-networks. Besides, we propose a novel Quadruplet State Sampling (QSS) strategy to form the training samples of the motion model and guide the SOT trackers to capture identity-discriminative features in position predictions. Finally, we propose an Existence Aware Tracking (EAT) algorithm by estimating the existence confidence of targets and re-considering low-scored predictions to recover missed targets. Experimental results indicate that the proposed Discriminative Motion Model-based tracker (DMMTracker) can effectively address these issues when employing SOT trackers as motion models, leading to highly competitive results on MOT benchmarks.

Abstract:
We present a novel approach for generating isotropic surface triangle meshes directly from unoriented 3D point clouds, with the mesh density adapting to the estimated local feature size (LFS). Popular reconstruction pipelines first reconstruct a dense mesh from the input point cloud and then apply remeshing to obtain an isotropic mesh. The sequential pipeline makes it hard to find a lower-density mesh while preserving more details. Instead, our approach reconstructs both an implicit function and an LFS-aware mesh sizing function directly from the input point cloud, which is then used to produce the final LFS-aware mesh without remeshing. We combine local curvature radius and shape diameter to estimate the LFS directly from the input point clouds. Additionally, we propose a new mesh solver to solve an implicit function whose zero level set delineates the surface without requiring normal orientation. The added value of our approach is generating isotropic meshes directly from 3D point clouds with an LFS-aware density, thus achieving a trade-off between geometric detail and mesh complexity. Our experiments also demonstrate the robustness of our method to noise, outliers, and missing data and can preserve sharp features for CAD point clouds.

Abstract:
Zero-shot learning (ZSL) has received extensive attention recently especially in areas of fine-grained object recognition, retrieval, and image captioning. Due to the complete lack of training samples and high requirement of defense transferability, the ZSL model learned is particularly vulnerable against adversarial attacks. Recent work also showed adversarially robust generalization requires more data. This may significantly affect the robustness of ZSL. However, very few efforts have been devoted towards this direction. In this paper, we take an initial attempt, and propose a generic formulation to provide a systematical solution (named ATZSL) for learning a defensive ZSL model. It is capable of achieving better generalization on various adversarial objects recognition while only losing a negligible performance on clean images for unseen classes, by casting ZSL into a min-max optimization problem. To address it, we design a defensive relation prediction network, which can bridge the seen and unseen class domains via attributes to generalize prediction and defense strategy. Additionally, our framework can be extended to deal with the poisoned scenario of unseen class attributes. An extensive group of experiments are then presented, demonstrating that ATZSL obtains remarkably more favorable trade-off between model transferability and robustness, over currently available alternatives under various settings.

Abstract:
Pedestrian attribute recognition (PAR) aims to generate a structured description of pedestrians and plays an important role in surveillance. Current work focusing on 2D images can achieve decent performance when there is no variation in the captured pedestrian orientation. However, the performance of these works cannot be maintained in scenarios when the orientation of pedestrians is ignored. To mitigate this problem, this paper proposes orientation-aware pedestrian attribute recognition based on graph convolution network (GCN), which is composed of an orientation-aware spatial attention (OSA) module and an orientation-guided attribute-relation learning (OAL) module. Since some attributes can be invisible for certain orientations, OSA is proposed for orientation-aware feature extraction to enhance the learned representation of the visual attributes. Moreover, since different orientations result in different relations among attributes, OAL is proposed to achieve distinguishable and impactful attribute relations by eliminating the confusion of attribute relations in different orientations. Experiments on three challenging datasets (PETA, RAP, and PA100 K) demonstrate that the proposed PAR outperforms the state-of-the-art methods by considerable margins.

Abstract:
Ina composite image, the foreground and background are filmed under different scenarios, such as different lighting conditions, causing inconsistency and reducing the overall realism of the image. Image harmonization aims to generate visually realistic composite images by adjusting the foreground to the background conditions while maintaining the structure. Existing methods focus on adjusting the foreground object by directly training the foreground generation network with the ground truth, neglecting the different roles of the illumination and structure of the foreground in image harmonization. Moreover, the use of background, except for providing illumination, is not thoroughly investigated in this task. In this paper, we propose a structure-preserving and illumination-consistent cycle (SP-IC cycle) framework for image harmonization by exploring the illumination and structure of both the foreground and background. It achieves image harmonization by specifically changing the illumination and keeping the structure instead of ambiguously changing the foreground. Then, an illumination-consistent foreground harmonization cycle is developed to change the foreground illumination, while a structure-preserving cycle is designed to keep the foreground structure. Background information is explored in both cycles to assist in decomposing the illumination and structure of the foreground. In addition, the proposed SP-IC cycle framework can be applied to any image harmonization method to further boost its performance. Experimental results demonstrate that our method achieves better harmonious image quality than state-of-the-art methods, especially on an illumination-varying dataset.

Abstract:
Handwritten mathematical expression recognition (HMER) is an essential task in the OCR community, which consists of two sub-tasks, i.e., symbol recognition and structure parsing. Modern literature treats HMER as a LaTeX sequence predicting problem that simultaneously recognizes symbols and parses the structures of MEs. Although deep learning-based HMER methods have been achieving promising results on public benchmarks, it is admitted that the misclassification error between visually similar symbols still prevents these approaches from more generalized scenes. In this paper, we try to solve this issue from three aspects. 1) We enhanced the feature extraction progress by introducing path signature features, which incorporates local writing details and global spatial information. 2) We developed a language model that uses contextual information to correct the symbols misclassified by vision-only-based recognition models. 3) We solved the misalignment problem in existing ensemble method by designing a dynamic time warping (DTW) based algorithm. By combining the above improvements, our method achieved state-of-the-art results on three CROHME benchmarks, outperforming previous methods by a large margin.

Abstract:
Compared to many other dense prediction tasks, object detection plays a fundamental role in visual perception and scene understanding. Dense object detection, aiming at localizing objects directly from the feature map, has drawn great attention due to its low cost and high efficiency. Though it has been developed for a long time, the training pipeline of dense object detectors is still compromised to lots of conjunctions. In this paper, we demonstrate the existence of three conjunctions lying in the current paradigm of one-stage detectors: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. Based on this, we propose Disentangled Dense Object Detector (DDOD), a simple, direct, and efficient framework for 2D detection with strong performance. We derive two DDOD variants (i.e., DR-CNN, and DDETR) following the basic one-stage/two-stage and recently developed transformer-based pipelines. Specifically, we develop three effective disentanglement mechanisms and integrate them into the current state-of-the-art object detectors. Extensive experiments on MS COCO benchmark show that our approach obtains significant enhancements with negligible extra overhead on various detectors. Notably, our best model reaches 55.4 mAP on the COCO test-dev set, achieving new state-of-the-art performance on this competitive benchmark. Additionally, we validate our model on several challenging tasks including small object detection and crowded object detection. The experimental results further prove the superiority of disentanglement on these conjunctions. Code is available at https://github.com/zehuichen123/DDOD.

Abstract:
Convolutional Neural Networks (CNNs) and Transformer are two powerful representation learning techniques for visual tracking. Although CNNs can effectively reduce local redundancy via small-neighborhood convolution operations, their limited receptive fields make it difficult to capture global dependency. Self-attention in Transformer uses patches as the input representation, which can effectively capture long-range dependency. However, blind similarity comparisons between all patches can lead to high redundancy. Is there then a technique that combines well the advantages of both paradigms for visual tracking? In this work, we design a novel backbone network for feature extraction. First, we choose Depthwise Convolution and Pointwise Convolution to build a Convolution Mixer, which effectively separates spatial mixing from channel-wise mixing of information. The Convolution Mixer reduces redundancy in spatial and channel features while increasing receptive field. Then, to exploit the global modeling ability of self-attention, we construct a module by aggregating Convolution Mixer and self-attention. The module shares dominant computational complexity (the square of the channel size) in the first stage. In the second stage, the shift and summation operations are lightweight. Finally, to alleviate the overfitting of the backbone network during training, a dropout layer is added at the end of the module to improve the generalization ability of the network model. Stronger image features are provided for subsequent feature fusion and prediction. The proposed tracker (named CMAT) achieves satisfying tracking performance on ten challenging datasets. In particular, CMAT achieves a 64.1% AUC on LaSOT and a 68.9% AUC on UAV123 while running at 23 FPS.

Abstract:
Stereo superpixel segmentation aims at grouping the discretizing pixels into perceptual regions through left and right views more collaboratively and efficiently. Existing superpixel segmentation algorithms mostly utilize color and spatial features as input, which may impose strong constraints on spatial information while utilizing the disparity information in terms of stereo image pairs. To alleviate this issue, we propose a stereo superpixel segmentation method with a decoupling mechanism of spatial information in this work. To decouple stereo disparity information and spatial information, the spatial information is temporarily removed before fusing the features of stereo image pairs, and a decoupled stereo fusion module (DSFM) is designed to handle the stereo features alignment as well as occlusion problems. Moreover, since the spatial information is vital to superpixel segmentation, we further design a dynamic spatiality embedding module (DSEM) to re-add spatial information, and the weights of spatial information will be adaptively adjusted through the dynamic fusion (DF) mechanism in DSEM for achieving a finer segmentation. Comprehensive experimental results demonstrate that our method can achieve the state-of-the-art performance on the KITTI2015 and Cityscapes datasets, and also verify the efficiency when applied in salient object detection on NJU2K dataset. The source code will be available publicly after paper is accepted.

Affiliations: School of Computer Science and Technology, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi'an, China; School of Electronic Engineering, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi'an, China; Yantai Research Institute, Harbin Engineering University, Yantai, China; School of Artificial Intelligence, Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi'an, China

Abstract:
The self-attention (SA) network revisits the essence of data and has achieved remarkable results in text processing and image analysis. SA is conceptualized as a set operator that is insensitive to the order and number of data, making it suitable for point sets embedded in 3D space. However, working with point clouds still poses challenges. To tackle the issue of exponential growth in complexity and singularity induced by the original SA network without position encoding, we modify the attention mechanism by incorporating position encoding to make it linear, thus reducing its computational cost and memory usage and making it more feasible for point clouds. This article presents a new framework called multiscale point cloud transformer (MPCT), which improves upon prior methods in cross-domain applications. The utilization of multiple embeddings enables the complete capture of the remote and local contextual connections within point clouds, as determined by our proposed attention mechanism. Additionally, we use a residual network to facilitate the fusion of multiscale features, allowing MPCT to better comprehend the representations of point clouds at each stage of attention. Experiments conducted on several datasets demonstrate that MPCT outperforms the existing methods, such as achieving accuracies of 94.2% and 84.9% in classification tasks implemented on ModelNet40 and ScanObjectNN, respectively.

Abstract:
Multi-attribute editing aims to synthesize new facial images with multiple desired attributes while at the same time preserving other contents. Generative Adversarial Networks (GANs) with encoder-decoder-based generators are typically applied to this task, while the co-occurrence nature of attributes is overlooked by generic discriminators when identifying real and synthesized instances. To address the issue, we focus on precisely capturing semantics associated with target attributes in this work, and propose a Graph-based Discriminator architecture for a GAN model, which is referred to as GD-GAN, for explicitly modeling and leveraging the attribute dependencies. Specifically, the co-occurrence ratio between attributes is used to build a correlation matrix, which captures inter-attribute relationships. We design a discriminator with a Graph Convolutional Network (GCN) to integrate knowledge about the attribute dependencies into the adversarial training process. Different from the existing methods that identify the synthesized data conditioned on the attributes individually, we leverage the attribute correlations by performing feature propagation over the graph of attributes, which leads to interdependent representations for real-fake instance identification. Incorporating the relationships of attributes eventually induces the generator to capture precise semantics associated with the attributes. Empirical results on multiple benchmarks demonstrate the superior performance of GD-GAN in high-quality semantic manipulation.

Affiliations: Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC, Australia; School of Computing Technologies, RMIT University, Melbourne, VIC, Australia; Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; College of Computer Science and Technology, Zhejiang University, Zhejiang, China; ReLER Lab, AAII, Faculty of Engeering and Information Technology, University of Technology Sydney, Ultimo, NSW, Australia

Abstract:
The human visual system is capable of not only recognizing individual objects but also comprehending the contextual relationship between them in real-world scenarios, making it highly advantageous for object detection. However, in practical applications, such contextual information is often not available. Previous attempts to compensate for this by utilizing cross-modal data such as language and statistics to obtain contextual priors have been deemed sub-optimal due to a semantic gap. To overcome this challenge, we present a seamless integration of context into an object detector through Knowledge Distillation. Our approach intuitively represents context as a knowledge graph, describing the relative location and semantic relevance of different visual concepts. Leveraging recent advancements in graph representation learning with Transformer, we exploit the contextual information among objects using edge encoding and graph attention. Specifically, each image region propagates and aggregates the representation from its highly similar neighbors to form the knowledge graph in the Transformer encoder. Extensive experiments and a thorough ablation study conducted on challenging benchmarks MS-COCO, Pascal VOC and LVIS demonstrate the superiority of our method.

Abstract:
Cross-domain semantic segmentation, which aims to address the distribution shift while adapting from a labeled source domain to an unlabeled target domain, has achieved great progress in recent years. However, most existing work adopts a source-to-target adaptation path, which often suffers from clear class mismatching or class imbalance issues. We design PBAL, a prototypical bidirectional adaptation and learning technique that introduces bidirectional prototype learning and prototypical self-training for optimal inter-domain alignment and adaptation. We perform bidirectional alignments in a complementary and cooperative manner which balances both dominant and tail categories as well as easy and hard samples effectively. In addition, We derive prototypes efficiently from a source-trained classifier, which enables class-aware adaptation as well as synchronous prototype updating and network optimization. Further, we re-examine self-training and introduce prototypical contrast above it which greatly improves inter-domain alignment by promoting better intra-class compactness and inter-class separability in the feature space. Extensive experiments over two widely studied benchmarks show that the proposed PBAL achieves superior domain adaptation performance as compared with the state-of-the-art.

Abstract:
Copy-move forgery causes a big challenge to copy-move forgery detection (CMFD) due to that the photometrical characteristics of genuine and tampered regions in the same image remain highly consistent. A novel U-Net-like architecture with multiple asymmetric cross-layer connections associated with self-correlation and atrous spatial pyramid pooling (ASPP) between feature extraction module (FEM) and tampered region localization module (TRLM), called UCM-Net, is proposed in this article. Different from existing deep learning based CMFD networks which indiscriminately process large or small tampered regions without considering the statistical characteristics of regions, FEM differentially treats large or small tampered regions by exploiting deep backbone networks to extract high-level features with rich semantic information for large tampered regions while utilizing lightweight backbone networks to extract low-level features for small tampered regions. Multiple cross-layer connections between two modules utilize the self-correlation calculation and ASPP to remove as much irrelevant semantic information as possible while retaining multi-scale tampered features from shallow to deep convolutional layers of FEM. Unlike the previous CMFD networks, which cannot capture multi-scale features because of simply stacking convolution blocks in the upsampling step, TRLM exploits multiple U-shaped residual U-block modules with different depths to change the receptive field of each point in the tampered feature maps so as to capture global and local information, greatly improving the localization accuracy of tampered regions. Experimental results on three publicly available databases demonstrate that UCM-Net outperforms several state-of-the-art algorithms in terms of various evaluation metrics.

Abstract:
Skeleton-based gesture recognition methods have achieved high success using Graph Convolutional Network (GCN), which commonly uses an adjacency matrix to model the spatial topology of skeletons. However, previous methods use the same adjacency matrix for skeletons from different frames, which limits the flexibility of GCN to model temporal information. To solve this problem, we propose a Temporal Decoupling Graph Convolutional Network (TD-GCN), which applies different adjacency matrices for skeletons from different frames. The main steps of each convolution layer in our proposed TD-GCN are as follows. To extract deep spatiotemporal information from skeleton joints, we first extract high-level spatiotemporal features from skeleton data. Then, channel-dependent and temporal-dependent adjacency matrices corresponding to different channels and frames are calculated to capture the spatiotemporal dependencies between skeleton joints. Finally, to fuse topology information from neighbor skeleton joints, spatiotemporal features of skeleton joints are fused based on channel-dependent and temporal-dependent adjacency matrices. To the best of our knowledge, we are the first to use temporal-dependent adjacency matrices for temporal-sensitive topology learning from skeleton joints. The proposed TD-GCN effectively improves the modeling ability of GCN and achieves state-of-the-art results on gesture datasets including SHREC'17 Track and DHG-14/28.

Affiliations: Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China; Regional Medical Center for the National Institute of Respiratory Diseases, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, China; School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; AI Lab, Lenovo Research, Beijing, China; Department of Medical Oncology at Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, China

Abstract:
Medical report generation generates the corresponding report according to the given radiology image, which has been attracting increasing research interest. However, existing methods mainly adopt supervised training which rely on large amount of medical reports that are actually unavailable owing to the labor-intensive labeling process and privacy protection protocol. In the meanwhile, the intrinsic relationships between local pathological changes in the image are often ignored, which actually are important hints to high quality report generation. To this end, we propose a Relation-Aware Mean Teacher (RAMT) framework, which follows a standard mean teacher paradigm for semi-supervised report generation. The key to the encoder of the backbone network is the Graph-guided Hybrid Feature Encoding (GHFE) module, which exploits a prior disease knowledge graph to encode the intrinsic relations between pathological changes into the graph embedding and learns a word dictionary to retrieve the semantic embedding for each potential pathological change. GHFE combines the graph embedding, semantic embedding and visual features to form hybrid features, which are sent to a Transformer-based decoder for report generation. Extensive experiments on the MIMIC-CXR and IU X-Ray datasets demonstrate the effectiveness of our proposed approach.

Affiliations: National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, China; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore; PCA Lab, the Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; College of Computer and Control Engineering, Minjiang University, Fuzhou, China; College of Computer Science, Sichuan University, Chengdu, China

Abstract:
Light field salient object detection (SOD) has shown remarkable success and gained considerable attention from the computer vision community. Existing methods usually employ a single-/two-stream network to detect saliency. However, these methods can only handle up to two different modalities at a time, preventing them from being able to fully explore the rich information in multi-modal light field derived data. To address this, we propose the first joint multi-modal learning framework, called FES-Net, for light field SOD, which can take rich inputs not limited to two modalities. Specifically, we propose an attention-aware adaptation module to first transform the multi-modal inputs for use in our joint learning framework. The transformed inputs are then fed to a Siamese network along with multiple embedded feature fusion modules to extract informative multi-modal features. Finally, we predict saliency maps from the high-level extracted features using a saliency decoder module. Our joint multi-modal learning framework effectively resolves the limitations of existing methods, providing efficient and effective multi-modal learning that can fully explore the valuable information in light field data for accurate saliency detection. Furthermore, we improve the performance by introducing the Transformer as our backbone network. To the best of our knowledge, the improved version of our model, called FES-Trans, is the first attempt to address the challenging light field SOD with the powerful Transformer technique. Extensive experiments on benchmark datasets demonstrate that our models are superior light field SOD approaches and outperform cutting-edge models remarkably.

Abstract:
Emotion is one of the most crucial attributes of music. However, due to the scarcity of emotional music datasets, emotion-conditioned symbolic music generation using deep learning techniques has not been investigated in depth. In particular, no study explores conditional music generation with the guidance of emotion, and few studies adopt time-varying emotional conditions. To address these issues, first, we endow three public lead sheet datasets with fine-grained emotions by automatically computing the valence labels from the chord progressions. Second, we propose a novel and effective encoder-decoder architecture named EmoMusicTV to explore the impact of emotional conditions on multiple music generation tasks and to capture the rich variability of musical sequences. EmoMusicTV is a transformer-based variational autoencoder (VAE) that contains a hierarchical latent variable structure to model holistic properties of the music segments and short-term variations within bars. The piece-level and bar-level emotional labels are embedded in their corresponding latent spaces to guide music generation. Third, we pretrain EmoMusicTV with the lead sheet continuation task to further improve its performance on conditional melody or harmony generation. Experimental results demonstrate that EmoMusicTV outperforms previous methods on three tasks, i.e., melody harmonization, melody generation given harmony, and lead sheet generation. Ablation studies verify the significant roles of emotional conditions and hierarchical latent variable structure on conditional music generation. Human listening shows that the lead sheets generated by EmoMusicTV are closer to the ground truth (GT) and perform slightly worse than the GT in conveying emotional polarity.

Abstract:
The availability of datasets annotated with verified events by the public is a necessary prerequisite for unleashing the potential of multimodal deep learning for news event detection. Publicly available datasets are either incompletely annotated due to expensive cost, or ignore the verifiability of event labels, which are susceptible to bias and errors introduced by a limited number of annotators. In this article, we provide a YouTube dataset labelled by real-world news events that can be verified by Wikipedia-like crowd sourcing platforms, with the target of advancing temporal event detection. The events in our dataset cover a wide range of event topics including public security, natural disasters, elections, sports, and entertainment events, etc. In the dataset, each sample is labelled with real-world event that is verifiable by the public. We extensively evaluate the performance of 13 state-of-the-art algorithms on our dataset in a temporal manner, involving the multiple relationships between training and testing event labels, and provide a thorough analysis of the findings.

Abstract:
Unsupervised fine-grained image generation is a challenging issue in computer vision. Although many recent significant advances have improved performance, the ability to synthesize photo-realistic images in an unsupervised manner remains extremely difficult. The existing methods compose an image via complex three-stage generative adversarial networks and impose constraints between the latent codes. This pipeline focuses on the disentanglement and ignores the quality of generated images. In this article, we propose a novel two-stage approach for unsupervised fine-grained image generation, termed Model-Guided Generative Adversarial Networks (MG-GAN). We introduce an attention module for exploring the correlation between fine-grained latent codes and image features in the foreground generation stage. The attention module enables the network to automatically focus on the color details and semantic concepts of objects related to different fine-grained classes. Furthermore, we incorporate knowledge distillation strategy and design a simple but effective inverse background image generator as a teacher to guide the background image generation. With the help of knowledge learned in the pre-trained inverse background image generator, a comfortable canvas is synthesized and combined with foreground object more reasonably. Extensive experiments on three popularly fine-grained datasets demonstrate that our approach achieves state-of-the-art performance and is even competitive with semi-supervised method.

Abstract:
Local feature learning is believed to be of important significance in classic vision tasks such as visual localization, image matching and 3D reconstruction. Limited by training samples, weakly-supervised strategy has become one of widely-concerned effective schemes for local feature learning. Currently, it still has some weaknesses needing further improvement, mainly including the discrimination power of extracted local descriptors, the localization accuracy of detected keypoints, and the efficiency of weakly-supervised local feature learning. Focusing on promoting the performance of sparse local feature learning with camera pose supervision, this article pertinently proposes a Shared Coupling-bridge scheme with four light-weight yet effective improvements for weakly-supervised local feature (SCFeat) learning. It mainly contains: i) the Feature-Fusion-ResUNet Backbone (F2R-Backbone) for local descriptors learning, ii) a shared coupling-bridge normalization to improve the decoupling training of description network and detection network, iii) an improved detection network with peakiness measurement to detect keypoints and iv) a new reward factor of fundamental matrix error to further optimize feature detection training. Extensive experiments prove that our SCFeat scheme is effective and has wide task adaptability. It could often obtain a state-of-the-art performance on classic image matching and visual localization. Even in terms of 3D reconstruction, it could still achieve competitive results.

Abstract:
Large collections of geo-referenced panoramic images are freely available for cities across the globe, as well as detailed maps with location and meta-data on a great variety of urban objects. They provide a potentially rich source of information on urban objects, but manual annotation for object detection is expensive, laborious and challenging. Can we utilize such multimedia sources to automatically annotate street level images as an inexpensive alternative to manual labeling? With the PanorAMS framework we introduce a method to automatically generate bounding box annotations for panoramic images based on urban context information. Following this method, we acquire large-scale, albeit noisy, annotations for an urban dataset solely from open data sources in a fast and automatic manner. The dataset covers the City of Amsterdam and includes over 14 million noisy bounding box annotations of 22 object categories present in 771,299 panoramic images. For many objects further fine-grained information is available, obtained from geospatial meta-data, such as building value, function and average surface area. Such information would have been difficult, if not impossible, to acquire via manual labeling based on the image alone. For detailed evaluation, we introduce an efficient crowdsourcing protocol for bounding box annotations in panoramic images, which we deploy to acquire 147,075 ground-truth object annotations for a subset of 7,348 images, the PanorAMS-clean dataset. For our PanorAMS-noisy dataset, we provide an extensive analysis of the noise and how different types of noise affect image classification and object detection performance.

Abstract:
Multiple human parsing (MHP) is typically treated as two sub-tasks, i.e., instance separation and body part segmentation. Existing methods usually tackle the sub-tasks by adopting a two-stage strategy, which regards MHP as an ROI-based (i.e., detect-then-segment) or grouping-based (i.e., segment-then-grouping) paradigm. However, the strong dependence between the two sub-tasks limits the potential of an MHP method, since it often requires qualified prior predictions. Besides, isolated models responsible for the two sub-tasks bring a significant computational burden. Unlike existing methods, we regard MHP as a hierarchical set prediction problem and handle two sub-tasks using several landmarks of body parts. Motivated by this, we propose a novel multiple human parser with representative sets, termed ReSParser. In ReSParser, several landmarks of body parts are hierarchically estimated, resulting in coarse-to-fine representative sets. After that, each representative set is adaptively responsible for segmenting pixels into semantically consistent regions belonging to the corresponding person. In such a manner, the ReSParser simultaneously addresses two sub-tasks in a fully convolutional fashion, thus eliminating the dependence between two sub-tasks and significantly alleviating computational complexity. Extensive experiments on two challenging benchmarks demonstrate that our proposed ReSParser is an efficient framework with a superior parsing performance, which significantly outperforms that of other ROI-free yet grouping-free methods. Besides, it achieves competitive results to that of the best two-stage methods such as RP-RCNN, but requires a much lower inference time, showing a good precision-speed trade-off. We hope the ReSParser serves as a new baseline for multiple human parsing research in the future.

Abstract:
In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how the models make predictions. However, it is not so applicable for explaining automatic speech recognition (ASR) Transformers. An ASR Transformer makes a particular prediction for every input token to form a sentence, but a vision Transformer only makes an overall classification for the input data. Therefore, traditional attention visualization methods may fail in ASR Transformers. In this work, we propose a novel attention visualization method in ASR Transformers and try to explain which frames of the audio result in the output text. Inspired by the model explainability, we also explore ways of improving the effectiveness of the ASR model. Comparing with other Transformer attention visualization methods, our method is more efficient and intuitively understandable, which unravels the attention calculation from information flow of Transformer attention modules. In addition, we demonstrate the utilization of visualization result in three ways: (1) We visualize attention with respect to connectionist temporal classification (CTC) loss to train an ASR model with adversarial attention erasing regularization, which effectively decreases the word error rate (WER) of the model and improves its generalization capability. (2) We visualize the attention on some specific words, interpreting the model by effectively demonstrating the semantic and grammar relationships between these words. (3) Similarly, we analyze how the model manage to distinguish homophones, using contrastive explanation with respect to homophones.

Abstract:
Arbitrary style transfer is attracting increasing attention in the computer vision community due to its application flexibility. Existing approaches directly fuse deep style features with deep content features or adaptively normalize content features for global statistical matching. Although effective, it is prone to suffer from local unnatural outputs and artifacts owing to the lack of exploring the global contextual semantic distribution of style image features. In this article, a novel global context self-attentional network (GCSANet) is proposed to efficiently generate high-quality stylized results based on the global semantic spatial distributions of style images. First, a context modeling module is proposed to aggregate the depth features of style images into global context features. Then, channel-wise interdependencies are captured with the feature transform module. Finally, the style features are appropriately aggregated to each location of the content image. In addition, novel external contrastive losses are proposed to balance the distribution of content and style features to ensure the reasonableness of the texture patterns in the stylized images. The ablation studies validate the effectiveness of the proposed components. Various quantitative and qualitative experiments demonstrate the superiority of our method for real-time arbitrary image/video style transfer.

Abstract:
Single-stage models for multi-person pose estimation have garnered significant attention due to their streamlined approach in generating person position localization and body structure perception in a single pass. These two parts, however, are processed individually by existing methods, leading to suboptimal results, e.g., candidates with high confidences for person localization while poor structure estimations. To this end, we propose a simple yet effective approach, namely Structure-guided Person Localization (SPL), jointly leveraging the advantages of the two aspects to solve the multi-person pose estimation problem, with two complementary novelties. First, we propose to incorporate body structure perception to guide person position localization, consequently, we introduce the Structure-guided Center Learning (SCL) to unify the quality of the body structure perception in the displacement map with the confidence of the person existence in the center map, thus achieving more accurate keypoint position localization results even with extreme poses. Second, to facilitate the end-to-end training of SPL, we propose the efficient Agency-based Scale-adaptive Learning (ASL). Specifically, we predict an agency map of the same size as the center map, which focuses on the foreground area and can adaptively adjust the scale size for each central area with the body structure perception confidence. Comprehensive experiments on challenging benchmarks including COCO and CrowdPose clearly verify the superiority of our framework, which achieves new state-of-the-art single-stage multi-person pose estimation results. Specifically, SPL obtains 72.1 AP scores and 69.5 AP scores in COCO test-dev2017 and CrowdPose test set, respectively.

Abstract:
Single-view point cloud reconstruction aims to generate a 3D point cloud of an object given one 2D image taken from an arbitrary viewpoint. Most previous works assume that all the test categories have been present to the model during training. However, it is impossible to know all the test categories that the model will meet in advance. And we discover these methods can not deal with novel categories well. Therefore, in this article, we investigate a more realistic and challenging setting of single-view point cloud reconstruction, zero-shot, where the model's performance on novel categories is pursued. Towards this task, we propose the Cross-Category Knowledge Transferring Network (CCKTN), which maintains a knowledge bank to mine transferable knowledge from known categories to help reconstruct novel categories. Additionally, we conduct auxiliary learning for the point cloud reconstruction model with the point cloud autoencoder via sharing the same knowledge bank. This design enables the knowledge bank to collect more fruitful 3D knowledge of point clouds. Moreover, we devise a diversity loss regularization for the knowledge vectors to guarantee their diversities, further enhancing CCKTN's performance. Comprehensive experiments conducted on ShapeNet and ModelNet datasets show CCKTN's superiority towards existing methods and demonstrate CCKTN's effectiveness for reconstructing novel category objects.

Abstract:
Few-shot image classification aims to recognize unseen classes with few labeled samples. Existing meta-learning models learn the ability of learning good representation or model parameters, in order to adapt to new tasks with a few training samples. However, when there exists a domain gap between training and test tasks, the learned ability often does not generalize well across domains, resulting in degraded performance on new tasks. In this article, we propose variational neuron shifting to generate adapted feature representations for few-shot learning. To do so, we introduce a working memory module to store the shifted neurons from the support set, which will be accessed to generate adapted feature representations of query samples. Under the meta-learning paradigm, the model is learned to acquire the ability of adaptation with single sample at meta-training time so as to further adapt itself to each single test sample at meta-test time. We formulate the adaptation process as a variational Bayesian inference problem, which incorporates the test sample as the condition into the generation of the model neuron shifting. We conduct extensive experiments on both within and across domain few-shot classification tasks. The new state-of-the-art performance substantiates the effectiveness of our variational neuron shifting. The thorough ablation studies further demonstrate the benefit of each component in our model.

Abstract:
Video scene graph generation has been an emerging research topic, which aims to interpret a video as a temporally-evolving graph structure by representing video objects as nodes and their relations as edges. Existing approaches predominantly follow a multi-step scheme, including frame-level object detection, relation recognition and temporal association. Although effective, these approaches neglect the mutual interactions between independent steps, resulting in a sub-optimal solution. We present a novel end-to-end framework for video scene graph generation, which naturally unifies object detection, object tracking, and relation recognition via a new Transformer structure, namely Temporal Propagation Transformer (TPT). Particularly, TPT extends the existing Transformer-based object detector (e.g., DETR) along the temporal dimension by involving a query propagation module, which can additionally associate the detected instances by identities across frames. A temporal dynamics encoder is then leveraged to dynamically enrich the features of the detected instances for relation recognition by attending to their historic states in previous frames. Meanwhile, the relation propagation strategy is devised to emphasize the temporal consistency of relation recognition results among adjacent frames. Extensive experiments conducted on VidHOI and Action Genome benchmarks demonstrate the superior performance of the proposed TPT over the state-of-the-art methods.

Abstract:
Fashion attribute recognition is a not-new topic, but rather a core task in understanding fashion from the perspective of computer vision. This article proposes a structured relation-aware network (sRA-Net), which exploits multiple hidden relations in fashion images to enrich and achieve accurate attribute representations to boost the performance of fashion attribute recognition. Specifically, it deconstructs the features of a clothing fashion item into three levels, including low-level attribute-related image region information, mid-level attribute dependency information, and high-level clothing look information. To learn these multi-relational embeddings, we present three relation-aware attention mechanisms. The attribute attention mechanism describes the relationship among different attribute vectors through self-attention and uses the attention map to update the attribute embedding. Then, the spatial attention mechanism associates the attribute with the image features and enhances the attribute embedding by leveraging the attribute-related image region. Finally, the channel attention mechanism selects attribute-related image feature channels to obtain a more fine-grained attribute embedding. Furthermore, we introduce structure-aware embedding to constrain attribute recognition in images from a global perspective by identifying the inner structure of the clothing. Without bells and whistles, sRA-Net outperforms all state-of-the-art attribute recognition methods on two mainstream fashion attribute datasets, namely the DeepFashion-C dataset and iFashion-Attribute dataset, with over 1%-3% improvement.

Abstract:
Flow-based and deformable convolution (DConv)-based methods are two mainstream approaches for solving the video frame interpolation (VFI) problem, which have made remarkable progress with the development of deep convolutional networks over the past years. However, flow-based VFI methods often suffer from the inaccuracy of flow map estimation, especially in dealing with complex and irregular real-world motions. DConv-based VFI methods have advantages in handling complex motions, while the increased degree of freedom makes the training of the DConv model difficult. To address these problems, in this article, we propose a flow guidance deformable compensation network (FGDCN) for the VFI task. FGDCN decomposes the frame sampling process into two steps: a flow step and a deformation step. Specifically, the flow step utilizes a coarse-to-fine flow estimation network to directly estimate the intermediate flows and synthesizes an anchor frame simultaneously. To ensure the accuracy of the estimated flow, a distillation loss and a task-oriented loss are jointly employed in this step. Under the guidance of the flow priors learned in step one, the deformation step designs a new pyramid deformable compensation network to compensate for the missing details of the flow step. In addition, a pyramid loss is proposed to supervise the model in both the image and frequency domains. Experimental results show that the proposed algorithm achieves excellent performance on various datasets with fewer parameters.

Abstract:
Text-driven image editing aims to manipulate images with the guidance of natural language description. Text is much more natural and intuitive than many other interaction modes, and attracts more attention recently. However, compared with classical supervised learning tasks, there is no standard benchmark dataset for text-driven interactive image editing up to now. Therefore, it is hard to train an end-to-end model for pixel-aligned interactive image editing driven by text. Some methods follow the paradigm of text-to-image models by incorporating the target image into the process of text-to-image generation. However, these methods relying on cross-modal text-to-image generation involve complicated and expensive models, which can lead to inconsistent editing effects. In this article, a novel text-driven image editing method is proposed. Our key observation is that this task can be more efficiently learned using image-to-image translation. To ensure effective learning for image editing, our framework takes paired text and the corresponding images for training, and disentangles each image into content and attributes, such that the content is maintained while the attributes are modified according to the text. Our network is a lightweight encoder-decoder architecture that accomplishes pixel-aligned end-to-end training via cycle-consistent supervision. Quantitative and qualitative experimental results show that the proposed method achieves state-of-the-art performance.

Abstract:
Personalized image aesthetics assessment (PIAA) is aimed at modeling the unique aesthetic preferences of individuals, based on which personalized aesthetic scores are predicted. People have different standards for image aesthetics, and accordingly, images rated at the same aesthetic level by different users explicitly reveal their aesthetic preferences. However, previous PIAA models treat each individual as an isolated optimization target, failing to take full advantage of the contrastive information among users. Further, although people's aesthetic preferences are unique, they still share some commonalities, meaning that PIAA models could be built on the basis of generic aesthetics. Motivated by the above facts, this article presents a Multi-level Transitional Contrast Learning (MTCL) framework for PIAA by transiting features from generic aesthetics to personalized aesthetics via contrastive learning. First, a generic image aesthetics assessment network is pre-trained to learn the common aesthetic features. Then, image sets rated to have the same aesthetic levels by different users are employed to learn the differentiated aesthetic features through multiple level-wise contrast learning based on the generic aesthetic features. Finally, a target user's PIAA model is built by integrating generic and differentiated aesthetic features. Extensive experiments on four benchmark PIAA databases demonstrate that the proposed MTCL model outperforms the state-of-the-arts.

Abstract:
Next Point-of-Interest (POI) recommendation seeks to recommend locations that users are most likely to visit next based on their historical trajectories, providing both users and service providers with substantial benefits. However, most next POI recommendation methods calculate the distances between POIs when mining spatial information and adjust their weights accordingly, ignoring the characteristics and multimedia content features of the regions in which POIs are located. In addition, the next POI recommendations suffer from the long tail effect, in which only a small portion of POIs appear frequently in users' recommendation lists due to their high popularity, while remainders maintain a low presence. To this end, we propose the cross-task multimodal reinforcement method which enriches the representations of regions by incorporating information from auxiliary domains. Moreover, we devise a cross-task reinforcement module to effectively integrate the local representations with pre-trained encoders from auxiliary domains. Actually, the enhanced region representations contain constructive district properties which are helpful to find proper POIs that suit users' tastes and thus alleviate the long tail effect. Experiments conducted on two real-world datasets indicate that our proposed method outperforms the state-of-the-art models in terms of both general performance and that of niche POIs.

Abstract:
Siamese trackers usually use the target in the first frame as a fixed template, but the static template cannot adapt to target changes. The existing updater is challenging to deal with target deformation and update noise, and there is an excellent risk of updating with an inaccurate updater. In our research, a dynamic template updating strategy based on spatial-temporal information is proposed to improve the tracking accuracy of the Siamese tracker. Furthermore, Tracking Confidence Network (TCNet) is proposed to judge whether to update, which ensures that high-quality target features are used to update and reduce the noise caused by adding unreliable targets. In experiments, the proposed method is embedded into two baseline trackers: SiamRPN and SiamFC++, and tested on five popular benchmarks. The experimental results show that the proposed method can improve the performance of the Siamese trackers while maintaining real-time speed.

Abstract:
Recent research has demonstrated that Vision Transformers (ViTs) are capable of comparable or even better performance than convolutional neural network (CNN) baselines. The differences in their structural designs are obvious, but our understanding of the differences in their feature representations remains limited. In this work, we propose several techniques to achieve high-quality visualization of representations in ViTs. Both qualitative and quantitative experiments show that our technical improvements can observably improve ViT visualization quality compared to previous studies. Furthermore, we conduct visualizations to explore the disparities between ViTs and CNNs pre-trained on ImageNet1K, revealing three intriguing properties of ViTs: a) ViT feature propagation retains image detail information with minimal loss, whereas CNNs discard most image details for class discrimination. b) Different from CNNs, object-related features do not show in ViT higher layers, suggesting that class-discriminative features may not be required for ViT classification. c) Our visualization-assisted texture-bias experiment reveals that both ViTs and CNNs exhibit texture bias, of which ViTs seem to be more biased towards local textures.

Abstract:
As the cornerstone of human-behavior analysis in video understanding, temporal action proposal generation aims to predict the starting and ending time of human action instances in untrimmed videos. Although large achievements in temporal action proposal generation have been achieved, most previous studies ignore the variability of action frequency in raw videos, leading to unsatisfying performances on high-action-frequency videos. In fact, there exists two main issues which should be well addressed: data imbalance between high and low action-frequency videos, and inferior detection of short actions in high-action-frequency videos. To address the above issues, we propose an effective framework by adapting to the variability of action frequency, namely Action Frequency Adaptive Network (AFAN), which can be flexibly built upon any temporal action proposal generation method. AFAN consists of two modules: Learning From Experts (LFE) and Fine-Grained Processing (FGP). The LFE first trains a series of action proposal generators on different subsets of imbalanced data as experts and then teaches a unified student model via knowledge distillation. To better detect short actions, FGP first finds out high-action-frequency videos and then performs fine-grained detection. Extensive experimental results on four benchmark datasets (ActivityNet-1.3, HACS, THUMOS14 and FineAction) demonstrate the effectiveness and generalizability of the proposed AFAN, especially for high-action-frequency videos.

Abstract:
Due to the light absorption and scattering in waterbodies, acquired underwater images frequently suffer from color cast, blur, low contrast, noise, etc., which seriously degrade the image quality and affect their subsequent applications. Therefore, it is necessary to propose a reliable and practical underwater image quality assessment (IQA) model that can faithfully evaluate underwater image quality. To this end, in this article, we establish a novel quality assessment model for underwater images by in-depth analysis and characterization of multiple image properties. Specifically, we propose characterizing the image luminance, color cast, sharpness, contrast, fog density and noise to comprehensively describe the image quality to evaluate the underwater image quality more accurately. Dedicated features are elaborately investigated to characterize those quality-aware image properties. After feature extraction, we employ support vector regression (SVR) to integrate all the quality-aware features and regress them onto the underwater image quality score. Extensive tests performed on standard underwater image quality databases demonstrate the superior prediction performance of the proposed underwater IQA model to state-of-the-art congeneric quality assessment models.

Abstract:
The existing face forgery algorithms have achieved remarkable progress in how to generate reasonable facial images and can even successfully deceive human beings. Considering public security, face forgery detection is of vital importance, making it essential to design face forgery detection algorithms to detect forgery images over the Internet. Despite the great success achieved by the existing Deepfake detection algorithms, they usually failed to achieve satisfactory Deepfake detection performance when deployed to handle the forgery videos in practice. One significant reason is compression. The videos over the Internet are inevitably compressed considering the transmission efficiency. The video compression results in significant Deepfake detection performance degradation for the existing Deepfake detection algorithms. To address this issue, in this article, we propose a generic, simple yet effective “bleaching” pre-processing module based on the generative model and the high-level feature representations to produce a bleached image, which shares a similar appearance with the compressed images. The bleached images with recovered information can be identified accurately by the optimized Deepfake detection models without retraining. The proposed method has utilized a redesigned feature representation, which serves as a navigator to effectively and sufficiently alter the feature distribution in the high-dimensional space to remedy the difference between real facial images and forgery counterparts. Thus, the proposed method can successfully avoid misclassification. Comprehensive and extensive experiments are carried out on four low-quality Faceforensics++ datasets, demonstrating the effectiveness of our method in recovering the information loss caused by the compression artifacts across various backbones and compression.

Abstract:
This article draws inspiration from deep learning image restoration technology. If users involved in the rumor topic are regarded as pixels in an image, the uncertainty of user behavior is similar to the ambiguity of pixels in a mosaic image. The prediction of user behavior is influenced by the user and neighboring friends. Similarly, the recovery of mosaic image pixels is also influenced by these pixels and neighboring pixels. Thus, during rumor propagation, the prediction of user behavior is equivalent to the restoration of pixels in the mosaic image. Based on this inspiration, this study proposes a rumor propagation prediction model based on image restoration technology. First, we propose the concept of topic images and design the rumor2pixel algorithm to pixelate the topic of rumor propagation. Second, through the Generative Adversarial Network model, fuzzy pixels in the “rumor topic image” are compensated to learn more realistic rumor propagation trends. Finally, a dynamic approach for predicting the propagation of rumor and countering it based on evolutionary game theory is proposed, named Rumor-DPM (rumor dynamic propagation model). This approach is focused on reconstructing rumor images while taking into account the conflict between rumors and anti-rumors as well as its timeliness. The experimental findings demonstrate that this strategy can more accurately depict the internal dynamics between rumors and anti-rumors and effectively and successfully improve the ability to forecast user behavior throughout the rumor-propagation process.

Affiliations: School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, China; Shandong Provincial Key Laboratory of Digital Media Technology, School of Computer Science and Technology, Shandong China-U.S. Digital Media International Cooperation Research Center, Shandong University of Finance and Economics, Jinan, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; School of Software, Shandong University, Jinan, China

Abstract:
Tiny objects often have a small proportion of pixels in the image, leading to significant differences in the number of positive and negative samples and the lack of feature information. Accurately determining the position and category of tiny objects remains a huge challenge for object detection research. Therefore, we design an Adaptive Sample Assignment Strategy(ASAS) and tiny object focusing enhancement module to solve the above two problems. Specifically, starting from the study of positive and negative sample selection and balance strategies for tiny objects, we construct a lightweight Object Existence Probability Determination Network (OEPD/Net) to focus on the areas where tiny objects exist, and achieve adaptive assignment and balance of samples. A top/down, layer by layer focusing enhancement module is designed to effectively enhance the propagation ability of high/level semantic information for tiny objects. The above two solutions have excellent generalization and migration capabilities and can be applied to any stage and two-stage object detection network, effectively enhancing TOD performance. Finally, this article provides a performance analysis of detection performance the detection network based on the OEPD/Net output results, and demonstrates the effectiveness of the proposed OEPD-Net and focusing enhancement module through extensive experiments on a public dataset.

Abstract:
Undoubtedly, the object is the primary factor in 3D single-object tracking (SOT) tasks. However, prior Siamese-based trackers overlook the adverse effects resulting from randomly dropped object points during backbone sampling, hindering the prediction of accurate bounding boxes (BBoxes). Therefore, developing an approach that maximizes the preservation of object points and their object-aware features is of the utmost significance. To address this, we propose an object-preserving Siamese network (OPSNet) that can effectively maintain object integrity and boost tracking performance. First, an object highlighting module amplifies the object-aware features and extracts discriminative features from the template and search area. Next, object-preserving sampling selects object candidates, obtains object-preserving search area seeds, and discards background points that have less impact on tracking. Finally, an object localization network accurately locates 3D BBoxes based on the object-preserving search area seeds. Extensive experiments demonstrate that the performance of OPSNet exceeds the state-of-the-art performance, achieving success gains of ～9.4% and ～2.5% on the KITTI and Waymo Open datasets, respectively.

Abstract:
In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modalities. In this paper, we aim to address the noisy modality problem and balance the contributions of multiple modalities dynamically in a parallel format. Specifically, we construct multiple base learners and formulate our framework as a boosting-like algorithm, where different base learners focus on different aspects of multimodal learning. To identify the contributions of individual base learners, we develop a contribution learning network that dynamically determines the contribution and noise level of each base learner. In contrast to the commonly considered attention mechanism, we define the transformation of predictive loss as the supervision signal to train the contribution learning network, which enables more accurate learning of modality importance. We derive the final prediction by incorporating the predictions of base learners based on their contributions. Notably, different from late fusion, we devise a multimodal base learner to explore the cross-modal interactions. To update the network, we design the ‘complementary update mechanism’, where for each base learner, we assign higher weights to those samples that are incorrectly predicted by other base learners. In this way, we can leverage the available information to correctly predict each sample to the utmost extent and enable different base learners to learn different aspects of multimodal information. Extensive experiments demonstrate that the proposed method achieves superior performance on multimodal sentiment analysis and emotion recognition.

Abstract:
Product quantization is an effective strategy for compact feature learning in image retrieval, which generates compact quantization codes of different lengths for varying scenarios. However, existing deep quantization methods obtain quantization codes with different lengths by training multiple models separately for each code length, which brings about large training time cost and degrades deployment flexibility. To this end, we propose a new deep scalable Progressive Similarity Preservation Product Quantization (PSPPQ) framework, which enables us to train the quantized features in different code lengths simultaneously and imposes no additional cost during inference. By progressively approximating the ground truth similarity of image pairs, we achieve direct optimization of similarity ranking, which improves the retrieval accuracy and generates sequential quantization codes with more efficiency. Besides, by combining the advantages of classification loss and hinge loss, we design a semantic ArcFace loss to optimize our network architecture. Experiments on three datasets demonstrate the effectiveness of our proposed method with variable code lengths for scalable image retrieval.

Abstract:
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories. In this article, we contend that the learning bias originates from two factors: 1) the unequal competition arising from the imbalanced distribution of foreground categories, and 2) the lack of sample diversity in tail categories. To tackle these issues, we introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution and dynamic intensification of sample diversities in a synchronized manner. Specifically, a novel foreground classification balance loss (FCBL) is developed to ameliorate the domination of head categories and shift attention to difficult-to-differentiate categories by introducing pairwise class-aware margins and auto-adjusted weight terms, respectively. This loss prevents the over-suppression of tail categories in the context of unequal competition. Moreover, we propose a dynamic feature hallucination module (FHM), which enhances the representation of tail categories in the feature space by synthesizing hallucinated samples to introduce additional data variances. In this divide-and-conquer approach, BACL sets a new state-of-the-art on the challenging LVIS benchmark with a decoupled training pipeline, surpassing vanilla Faster R-CNN with ResNet-50-FPN by 5.8% AP and 16.1% AP for overall and tail categories. Extensive experiments demonstrate that BACL consistently achieves performance improvements across various datasets with different backbones and architectures.

Abstract:
Fine-grained image classification (FGIC) aims to separate different subcategories from one general superclass, which requires the classification model to extract distinctive representations from subtle yet discriminative regions of the objects. Learning multiple part representations can give a detailed description of the object from different perspectives, boosting the classification performance. However, it still remains a challenging problem to effectively locate diverse parts and extract their features without the assistance of part annotations. In this article, we present a novel method to achieve accurate fine-grained image classification by learning a set of diverse and discriminative part representations without requiring additional supervision. Firstly, our method utilizes a simple attention interaction module to lead learned spatial attentions to focus on different parts, resulting in mutually exclusive part representations. Then, to reduce the impairment of channel coupling among part representations, a part-wise channel weighting module is designed to adjust the amplitudes of different representations adaptively, making them to be diverse along the channel dimension. Moreover, to ensure comprehensive and sufficient part representations, our method introduces multi-granularity feature learning. It enables the extraction of part representations from different semantic and content levels, capturing fine-grained details effectively. To evaluate our method, extensive experiments are conducted on various benchmark fine-grained image datasets, and the results show that our method can achieve outstanding performance for FGIC, demonstrating its effectiveness.

Abstract:
Multi-image steganography refers to a stegano- graphic method where a user tries to hide multiple confidential images within a single cover image, and all confidential images can be correspondingly recovered perfectly by the recipient. Multi-image steganography essentially belongs to a high-capacity image steganographic scheme, but such high hiding capacity may easily cause severe contour shadows or color distortion of steganographic images, resulting in a significant reduction in anti-steganalysis capability. To address the above problem, this article designs a deep invertible neural network by introducing spatial-channel joint attention mechanism, in which the confidential image hiding and recovery can be regarded as a pair of coupled invertible processes. Specifically, a series of simple invertible networks having the same structure are firstly used to construct a cascaded deep invertible neural network framework, in which multiple confidential images can be sequentially embedded into a single cover image through a series of flexible cascaded iterative operations. Subsequently, spatial-channel joint attention module is designed to re-construct invertible network model, which can guide the embedding of secret information into more secure image regions. Accordingly, this joint attention mechanism can effectively address the problem of visual quality and security degradation of steganographic images due to high embedding capacity. Extensive experiments demonstrate that our scheme can obtain superior performance over different large-scale image sets, and outperforms state-of-the art methods with higher visual quality and stronger anti-steganalysis capability.

Abstract:
In this article, we propose a novel historical snapshot-based ensemble tracker (HSET) to address visual object tracking. Specifically, our HSET tracker collects multiple historical tracker snapshots to model various appearance patterns of the target object during tracking, and performs ensemble operations based on these tracker snapshots to successfully detect the target object. To obtain diverse and representative tracker snapshots for ensemble tracking, we design a tracker snapshot verification scheme to handle dynamical appearance variations of the target object and alleviate unreliable tracker snapshots. Furthermore, the weights of different tracker snapshots are given by an online weight assign algorithm with consideration of both historical appearance information and recent appearance information of the target object. By employing ensemble learning and historical tracker snapshots, the proposed HSET method can get impressive generalization power and tracking robustness to handle significant appearance changes and model drift. Extensive experimental results on public tracking benchmarks indicate that the proposed HSET tracking algorithm reaches encouraging tracking performance compared to multiple state-of-the-art tracking algorithms.

Abstract:
In the current spatial Scalable High Efficiency Video Coding (SHVC) standard, the main techniques involve exploiting the correlation between pixel values of different layers to achieve inter-layer prediction samples, allowing the enhancement layer (EL) to predict samples from the upsampled base layer (BL) frame and remove temporal redundancy. However, existing network-based methods cannot effectively handle multi-layer compressed images with different resolutions to generate reference frame in spatial SHVC. Meanwhile, spatial SHVC only uses traditional interpolation filters to upsample the BL frame for EL frame sample prediction, which cannot handle different structures and contents. Therefore, considering the high correlation of multi-scale distortion characteristics across different layers, this article proposes a spatial-temporal inter-layer reference frame generation network (ST-ILR) for spatial SHVC, which can generate a high-fidelity reference frame for efficient inter-prediction and insert it into the EL reference picture list. The proposed method consists of two modules: a multi-scale motion restoration (MMR) module and a guided multi-scale feature reconstruction (GMFR) module. The MMR model is designed to accurately predict the motion trend of the EL based on the BL motion information, while implicitly compensating for previous EL frames. This is achieved by dynamically modeling the current EL motion information from the BL, capturing compression downsampling differences of prior motion vectors across different layers. The GMFR module adaptively super-resolves compressed BL frames and selectively aggregates high-frequency information from aligned EL features to preserve precise spatial detail, fusing abundant features from different layers to achieve better ILR frame quality performance. Extensive experiments show that our network achieves a 13.6% BD-rate (Bjøntegaard Delta Rate) reduction in random access configuration compared to the SHVC baseline, which offers state-of-the-art coding performance.

Abstract:
Depth information plays a pivotal role in numerous computer vision applications, including autonomous driving, 3D reconstruction, and 3D content generation. When deploying depth estimation models in practical applications, it is essential to ensure that the models have strong generalization capabilities. However, existing depth estimation methods primarily concentrate on robust single-image depth estimation, leading to the occurrence of flickering artifacts when applied to video inputs. On the other hand, video depth estimation methods either consume excessive computational resources or lack robustness. To address the above issues, we propose ViTA, a video transformer adaptor, to estimate temporally consistent video depth in the wild. In particular, we leverage a pre-trained image transformer (i.e., DPT) and introduce additional temporal embeddings in the transformer blocks. Such designs enable our ViTA to output reliable results given an unconstrained video. Besides, we present a spatio-temporal consistency loss for supervision. The spatial loss computes the per-pixel discrepancy between the prediction and the ground truth in space, while the temporal loss regularizes the inconsistent outputs of the same point in consecutive frames. To find the correspondences between consecutive frames, we design a bi-directional warping strategy based on the forward and backward optical flow. During inference, our ViTA no longer requires optical flow estimation, which enables it to estimate spatially accurate and temporally consistent video depth maps with fine-grained details in real time. We conduct a detailed ablation study to verify the effectiveness of the proposed components. Extensive experiments on the zero-shot cross-dataset evaluation demonstrate that the proposed method is superior to previous methods.

Abstract:
Inspired by the recent progress in object detection (i.e., DETR), the set prediction mechanism significantly advances the research of semantic segmentation and achieves state-of-the-art performance on popular segmentation benchmarks. The generic pipeline of such a mechanism often firstly takes learnable query features to predict classes and segment masks separately and then blends these class-aware segment masks into the final segmentation mask. One key factor behind the successful training of this pipeline is to apply the bipartite matching strategy between the set of predictions and ground-truth segments. However, we find that the bipartite matching-based assignment often tends to segment one target class with only a few learnable queries, making many other pre-defined queries useless. In this article, we propose a simple way, named DropQueries (DQ), to facilitate the set prediction based segmentation architectures. At each iteration of training, our DQ randomly and independently drops each learnable query with a certain probability before bipartite matching. In this way, more queries are encouraged to participate in the segmentation process to discover comprehensive segment representations. We conduct extensive experiments using MaskFormer and Mask2Former as two basic yet powerful segmentation architectures. Without bells and whistles, our DQ strategy can bring consistent improvements over strong baselines on popular semantic segmentation benchmarks, including ADE 20 K, Cityscapes, COCO Stuff 10 K and VSPW.

Abstract:
Semi-supervised video learning is an increasingly popular approach for improving video understanding tasks by utilizing large-scale unlabeled videos along with a few labels. Recent studies have shown that multimodal contrastive learning and consistency regularization are effective techniques for generating high-quality pseudo-labels for semi-supervised action recognition. However, existing pseudo-labeling approaches are solely based on the model's class predictions and can suffer from confirmation biases due to the accumulation of false predictions. To address this issue, we propose exploiting audio-visual feature correlations to achieve high-quality pseudo-labels instead of relying on model confidence. To achieve this goal, we introduce Audio-visual Contrastive and Consistency Learning (AvCLR) for semi-supervised action recognition. AvCLR generates reliable pseudo-labels from audio-visual feature correlations using deep embedded clustering to mitigate confirmation biases. Additionally, AvCLR introduces two contrastive modules: intra-modal contrastive learning (ImCL) and cross-modal contrastive learning (XmCL) to discover complementary information from audio-visual alignments. The ImCL module learns informative representations within audio and video independently, while the XmCL module aims to leverage global high-level features of audio-visual information. Furthermore, the XmCL is constrained by introducing intra-instance negatives from one modality to the other. We jointly optimize the model with ImCL, XmCL, and consistency regularization in an end-to-end semi-supervised manner. Experimental results have demonstrated that the proposed AvCLR framework is effective in reducing confirmation biases and outperforms existing confidence-based semi-supervised action recognition methods.

Abstract:
Online action detection plays a vital role in video action understanding and can be widely used in various video analysis applications. This task aims to detect actions at the current moment within long untrimmed video streams. However, accurately identifying action-background transitions that are ambiguous in terms of time during detection can be challenging due to the similarity between the action and background clips, adding to the difficulty in finding a suitable division between them. To address this issue, we propose a hard video clip mining method based on deep metric learning for online action detection named HCM. The HCM method first selects video clips that are hard to distinguish to determine the optimization objects. Then, a hard clip mining loss is adopted to push the features toward the centers of the categories to which they belong and away from others. Furthermore, we introduce an intra-class feature compaction loss to constrain the divergence of action features, ensuring the stability of their distribution. We evaluated the proposed method on two challenging online action detection datasets, THUMOS14 and TVSeries. The results show that HCM is effective and efficient in online action detection and action anticipation tasks.

Abstract:
Research in psychology demonstrates that visual features and semantic content can convey various emotions. Furthermore, studies have proved that image emotion and aesthetics are inextricably linked. During the image aesthetic assessment process (IAA), images elicit emotional responses from individuals, leading to emotional resonance and influencing the evaluation of images. This article proposes an image aesthetics assessment method based on hypernetwork of emotion fusion (HNEF). Our method incorporates the emotions depicted in images into the process of IAA. To accomplish this, we extract both aesthetic and emotional features from the images. Additionally, we employed the self-attention mechanism of the transformer to comprehensively investigate the intimate connection between aesthetics and emotion. Additionally, the hypernetwork is designed to establish perception rules governing the high-level semantic information in images. The experimental results validate the strong correlation between emotion and aesthetics. Furthermore, the proposed method exhibits a significantly competitive advantage when compared to existing methods on the Aesthetic Visual Analysis (AVA) dataset.

Abstract:
Video corpus moment retrieval has become a hot topic recently, which aims to localize a consequent video moments highly relevant to the given query language description from video corpus. Existing methods towards this challenging task are suffering from the cases when the visual information and textual information in the video are very different from each other or from the cases where the redundant video content is semantically irrelevant with the query language description, which make the model confused of figuring out the truly useful within- and cross-modality information. In this article, we propose a novel Cross-Modality Knowledge Calibration Network (CKCN) to solve the issue mentioned above. Specifically, a dual calibration transformer module with improved multi-head attention is proposed to simultaneously capture the within- and cross-modality features between the visual and textual modality of the video automatically compressing the redundant information, and then a query-dependent fusion module is designed to guide feature fusion of the video's multi-modal information using the prior knowledge of query which further refine more important modality features. At last, a query-guided calibration transformer module with a well-designed learnable cell is utilized to align the query and video, forming a single joint representation for moment localization. Meanwhile, we introduce transfer learning into the task of video corpus moment retrieval (VCMR) for the first time to solve the defect of insufficient labeled data. Extensive experiments have been conducted on both the widely used TVR dataset and DiDeMo dataset which have achieved new state-of-the-art, thus verifying the effectiveness of our proposed CKCN.

Abstract:
The Deep Contextual Video Compression framework (DCVC) utilizes a conditional coding paradigm, where the context is extracted and employed as a condition for the contextual encoder-decoder and entropy model. In this paper, we propose enhanced context mining and filtering to improve the compression efficiency of DCVC. Firstly, considering the context of DCVC is generated without supervision and redundancy may exist among context channels, an enhanced context mining model is proposed to mitigate redundancy across context channels to obtain superior context features. Then, we introduce a transformer-based enhancement network as a filtering module to capture long-distance dependencies and further enhance compression efficiency. The transformer-based enhancement adopts a full-resolution pipeline and calculates self-attention across channel dimensions. By combining the local modeling ability of the enhanced context mining model and the non-local modeling ability of the transformer-based enhancement network, our model outperforms LDP configurations of Versatile Video Coding (VVC), achieving an average bit savings of 6.7% in terms of MS-SSIM.

Abstract:
Current image-text retrieval methods mainly utilize region features that provide object-level information to represent images, making the retrieval results more accurate and interpretable. However, there are several issues with region features, such as lack of rich contextual information, loss of object details and risk of detection redundancy. The ideal visual features in image-text retrieval should have three characteristics: object-level, semantically-rich, and language-aligned. To this end, we propose a novel visual representation framework to capture more comprehensive and powerful visual features. Specifically, since these region feature disadvantages are the grid feature advantages, we first build a two-step interaction model to explore the complex relationship between them from the spatial and semantic perspectives to integrate their complementary information, making the fused visual features both object-level and semantic-rich. Then, we design a text-integrated visual embedding module that utilizes textual information as guidance to filter redundant regions, further endowing visual features with language-aligned capabilities. Finally, we develop a multi-attention pooling module to better aggregate these enhanced visual features in a more fine-grained manner. Extensive experiments demonstrate that our proposed model achieves state-of-the-art performance on the benchmark datasets Flickr30K and MS-COCO.

Abstract:
Due to high labeling cost, it is inevitable to introduce a certain proportion of noisy correspondence into visual-text datasets, resulting in poor model robustness for cross-modal matching. Although recent methods divide the datasets into clean and noisy pair subsets to yield promising achievements, they still suffer from deep neural networks over-fitting on noisy correspondence. In particular, the similar positive pairs with partially relevant semantic correspondence are easily partitioned into noisy pair subset by mistake without carefully selection, which brings harmful impact on robust learning. Meanwhile, the similar negative pairs with partially relevant semantic correspondence lead to ambiguous distance relations in common space learning, which also damages the stability of performance. To solve the coarse-grained dataset division problem, we propose Correspondence Tri-Partition Rectifier (CTPR) to partition the training set into clean, hard, and noisy pair subsets based on the memorization effect of neural networks and prediction inconsistency. Then, we refine the correspondence labels for each subset to indicate the real semantic correspondence between visual-text pairs. The differences between rectified labels of anchors and hard negatives are recast as the adaptive margin in the improved triplet loss for robust training in a co-teaching manner. To verify the effectiveness and robustness of our method, we conduct experiments by implementing image-text and video-text matching as two showcases. Extensive experiments on Flickr30 K, MS-COCO, MSR-VTT, and LSMDC datasets verify that our method successfully partitions the visual-text pairs according to their semantic correspondence and improves performance under noisy data training.

Abstract:
Masked autoencoder (MAE) is a recently widely used self-supervised learning method that has achieved great success in NLP and computer vision. However, the potential advantages of masked pre-training for point cloud understanding have not been fully explored. There is preliminary work on MAE-based point clouds using the Transformer architecture to explore low-level geometric representations in 3D space, which is insufficient for fine-grained decoding completion and downstream tasks. Inspired by multimodality, we propose Inter-MAE, a inter-modal MAE method for self-supervised learning on point clouds. Specifically, we first use Point-MAE as a baseline to partition point clouds into random low percentage of visible and high percentage of masked point patches. Then, a standard Transformer-based autoencoder is built by asymmetric design and shifting mask operations, and latent features are learned from the visible point patches aiming to recover the masked point patches. In addition, we generate image features based on ViT after point cloud rendering to form inter-modal contrastive learning with the decoded features of the completed point patches. Extensive experiments show that the proposed Inter-MAE generates pre-trained models that are effective and exhibit superior results in various downstream tasks. For example, an accuracy of 85.4% is achieved on ScanObjectNN and 86.3% on ShapeNetPart, outperforming other state-of-the-art self-supervised learning methods. Notably, our work establishes for the first time the feasibility of applying image modality to masked point clouds.

Abstract:
Amodal instance segmentation (AIS) predicts the complete shape of the occluded object, including both visible and occluded regions. Because visual clues are lacking, the occluded region is difficult to segment accurately. In human amodal perception, shape-prior knowledge is helpful for AIS. The previous method uses a 2D shape prior by rote memorizing, establishing a shape dictionary and retrieving the closest mask to the segmentation result. However, this approach cannot obtain the shape prior, which is not prestored in the shape dictionary. In this article, to improve generalization ability, we propose a generative invariant shape-prior network (GIN), simulating the human perception process that learns the basic shape, which is invariant to transformations, including translation, rotation, and scaling. We design a novel framework that decouples the learning of shape priors from transformation. GIN is end-to-end trainable and needs no dictionary establishment, making the whole pipeline efficient. GIN outperforms state-of-the-art methods on three public datasets (D2SA, COCOA-cls, and KINS) with large margins.

Abstract:
For the skeleton-based gesture recognition, graph convolutional networks (GCNs) have achieved remarkable performance since the human skeleton is a natural graph. However, the biological structure might not be the crucial one for motion analysis. Also, spatial differential information like joint distance and angle between bones may be overlooked during the graph convolution. In this article, we focus on obtaining meaningful joint groups and extracting their discriminative features by the path signature (PS) theory. Firstly, to characterize the constraints and dependencies of various joints, we propose three types of paths, i.e., spatial, temporal, and learnable path. Especially, a learnable path generation mechanism can group joints together that are not directly connected or far away, according to their kinematic characteristic. Secondly, to obtain informative and compact features, a deep integration of PS with few parameters are introduced. All the computational process is packed into two modules, i.e., spatial-temporal path signature module (ST-PSM) and learnable path signature module (L-PSM) for the convenience of utilization. They are plug-and-play modules available for any neural network like CNNs and GCNs to enhance the feature extraction ability. Extensive experiments have conducted on three mainstream datasets (ChaLearn 2013, ChaLearn 2016, and AUTSL). We achieved the state-of-the-art results with simpler framework and much smaller model size. By inserting our two modules into the several GCN-based networks, we can observe clear improvements demonstrating the great effectiveness of our proposed method.

Abstract:
Automatic analysis of image sentiment has gained considerable attention with the increasing throughput of user-generated visual contents online. Recently, researchers generally tend to design different Convolutional Neural Networks (CNNs) to extract image content features for sentiment analysis. However, they underestimated the importance of image color, which has been proved very crucial for image sentiment expressing by psychology and art theory. Moreover, we further observe that the coordination of content and color is the main form of image sentiment expressing. Different combinations of content and color could express extremely different sentiments. To that end, in this paper, we propose a Color Enhanced Cross Correlation Net (CECCN), a novel architecture for image sentiment analysis that not only leverages contents and colors simultaneously, but also takes their correlations into consideration. Specifically, we first use a pre-trained CNN to extract content features and color moment to collect color features from multiple color spaces. Then, we propose a novel Cross Correlation (CC) method to model the correlations between content features and color features with attention mechanism and sequence convolution, in which sentiment expressing of content and color can be enhanced by each other. Finally, we integrate these two types of information for better image sentiment analysis. Extensive experiments on two popular and well-studied benchmark datasets demonstrate the superiority and rationality of our proposed CECCN.

Abstract:
Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.

Abstract:
In this article, we present skeleton-based isolated sign language recognition (IsoSLR) with part mixing - SKIM. An IsoSLR model that solely takes the skeleton representation of the human body as input. Previous skeleton-based works either perform worse when compared to RGB-based counterparts or require fusion with other modalities to obtain competitive results. With SKIM, a single skeleton-based model without complex pre-training can obtain similar or even higher accuracy than current state-of-the-art methods. This margin can be further increased by simple late fusion within the same modality. To achieve this, we first develop a novel data augmentation technique called part mixing. It swaps the corresponding keypoints within one region (e.g. hand) between two randomly selected samples and combines their labels linearly as the new label. As regions like hand and face are key articulators for sign language, direct swapping of such parts creates a believable pseudo sign that promotes the model to recognize the true pairs. Secondly, following current advances in skeleton-based action recognition, we devise a channel-wise graph neural network with multi-scale awareness and per-keypoint temporal re-weighting. With this design, the backbone is capable of leveraging both manual and non-manual features. The combination of hand mixing and the channel-wise multi-scale GCN backbone allows us to achieve state-of-the-art accuracy on both WLASL and NMFs-CSL benchmarks.

Abstract:
Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised methods have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively. Furthermore, our approach even outperforms existing weakly supervised methods.

Abstract:
The goal of multiple-image hiding is to hide several secret images within another carrier image without significantly changing its appearance, and then perfectly reconstruct all of the secret images. The challenge is to ensure that the stego-image has great visual quality and can resist various steganalysis under the premise of hiding as much information as possible in one image. To address this issue, the majority of known image-hiding methods focus on hiding images using compression techniques. In this article, we present a novel multiple-image hiding method based on up-sampling and reversible color transformation. First, the interpolation algorithm up-samples the carrier image, so that the attribute of similar neighboring pixel values in the up-sampled image can significantly improve the effect of image hiding. The embedding procedure is then performed using the proposed Euclidean Distance (ED)-based block matching and reversible color transformation, which decreases the chance of local blurring in the stego-image. Experimental results show that the proposed method surpasses existing advanced methods by achieving an average of 33 dB and 28 dB of PSNR for the stego-image with a hiding capacity 2 BPP and 8 BPP, and obtaining 100% reconstructing accuracy for all secret images. It also has a high level of resistance to steganalysis and a strong robustness against various image-processing attacks.

Abstract:
Content-based 3D object retrieval is a challenging problem in computer vision and graphics, especially for non-rigid 3D shapes. This article proposes a multiview-based robust point representation approach for 3D non-rigid shape retrieval. First, we propose an efficient local descriptor called the local point histogram, which is robust to non-rigid changes in shape. Second, we encode local point histogram features into high-level point features (HPF) using Fisher vectors. Finally, we present an efficient feature fusion method that can further enhance the performance of 3D non-rigid shape retrieval. We extensively tested our approach on two benchmark 3D non-rigid shape datasets, including the SHREC2015 non-rigid shape and SHREC2015 canonical forms. Our method achieves 98.33% and 90.55% retrieval accuracy on the SHREC2015 non-rigid shape and SHREC2015 canonical forms datasets, surpassing previous state-of-the-art methods by nearly 2% and 7%, respectively. In addition, we further tested our method on the well-known 3D rigid shape dataset ModelNet, and the experimental results demonstrate that our method is also effective for 3D rigid shape retrieval. We also combine the proposed HPF shape features with deep convolutional features for the 3D rigid shape retrieval task, achieving a retrieval performance comparable to the prior state-of-the-art methods, which indicates a strong complementarity between HPF shape features and deep convolutional features.

Abstract:
To improve the compression performance of screen content coding, extension coding standards (HEVC-SCC, VVC-SCC) have been developed. However, considering the compression ratio alone may lead to packet losses in bitstreams which may cause plenty of images decoded incorrectly, degrading the video quality at the receiver side. Thus, it urgently needs to study source-channel jointly coding scheme of screen content video. The most significant challenge lies in the complex spatial-temporal characteristics of screen content video, which complicate the creation of an accurate end-to-end distortion model. In this article, we delve into the traits of screen content video and construct an end-to-end distortion model. Building upon this, we introduce an error resilient coding scheme specifically for screen content video. More specifically, we first consider the characteristic of non-stationary temporal domain variation and classify the screen content images into three types of frames using a fast block-searching method. We then propose an adaptive error concealment method, taking into account the spatial-temporal prediction characteristics. Following this, we derive a pixel-level end-to-end distortion model and incorporate it into the rate distortion optimization process. Our experimental results reveal that, compared to state-of-the-art methods, our proposed method significantly enhances both objective and subjective quality across a variety of channel conditions.

Abstract:
Weakly supervised object localization (WSOL) aims to localize the entire and well-defined objects only via image-level labels for reducing the need of labor-intensive annotation and mitigating annotation errors. However, many WSOL methods via class activation maps (CAMs) often suffer from incomplete activation and inaccurate boundaries for object localization. In this article, we propose a novel multi-layer decoupling attention localization (MDAL) network to address these issues. We first present a simple yet effective multi-layer comparison decoupling mechanism including a maximum decoupling function and a minimum decoupling function to sufficiently activate and fuse multi-layer features. Then, we introduce the multi-layer maximum decoupling function into the attention modules, and develop a channel attention activation decoupling (CAAD) module and a spatial attention activation decoupling (SAAD) module, which can mine much more useful information for more possible regions' activation. Furthermore, the multi-layer minimum decoupling function is introduced to efficiently fuse and refine multi-layer features, which can suppress the over-activation and background noise. Finally, we develop a joint loss function to train the MDAL network. Experimental results on CUB-200-2011 and ILSVRC2012 demonstrate that our proposed network can provide accurate and complete object localization.

Affiliations: School of Computer and Communication Engineering, University of Science and Technology, Beijing, China; Division of Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong, SAR, China; School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW, Australia; Department of Electrical and Computer Engineering, National University of Singapore, Singapore; Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China

Abstract:
Cross-modal Retrieval (CMR) is formulated for the scenarios where the queries and retrieval results are of different modalities. Existing Cross-modal Retrieval (CMR) studies mainly focus on the common contextualized information between text transcripts and images, and the synchronized event information in audio-visual recordings. Unlike all previous works, in this article, we investigate the geometric correspondence between images and speech recordings captured in the same space and formulate a novel CMR task, called Spatial Image-Acoustic Retrieval (SIAR). To this end, we first design a novel speech encoder that consists of convolution neural networks and transformer layers, to learn space-aware speech representations. Then, to eliminate the cross-modal inherent discrepancy, we propose the Contrastive Speech Image Retrieval (CSIR) method which uses supervised contrastive learning to attract the same-space cross-modal features while repelling the ones from different spaces. Finally, image and speech features are directly compared and we predict the SIAR result with the maximum similarity. Extensive experiments demonstrate that our proposed speech encoder can recognize space from human speeches with superior performance over the other prevailing networks. It also sets our penultimate goal of speech-to-speech retrieval. Furthermore, our CSIR proposal can successfully perform bi-directional SIAR between spatial images and reverberant speeches with promising results. Code and data will be available.

Abstract:
Mirror segmentation, an emerging discipline in the field of computer vision, involves the identification and marking of mirrors in an image. Current mirror segmentation methods rely on fixed mirror elements as features for object segmentation. However, these methods do not account for the varied quality of feature images obtained under complex real-world conditions, leading to inaccurate segmentation results. To address these limitations, we propose a novel uncertainty-aware transformer localization network (UTLNet) for RGB-D mirror segmentation. Our approach draws inspiration from biomimicry, specifically the behavior pattern of human observation. We aim to explore features from different angles and focus on complex features that are challenging to determine during the coding stage. Additionally, we employ graph convolution to construct complementary dual-modal fusion features. Furthermore, we design a multiscale interaction transformer module using the shifted-window self-attention mechanism to acquire precise position information. In our experiments, the proposed UTLNet surpasses the current state-of-the-art mirror segmentation method as well as alternative task-specific methods. It achieves superior performance across various evaluation scenarios.

Abstract:
Existing convolution recurrent neural networks (ConvRNNs)-based memory cells majorly take advantage of gated structures and attention mechanisms to extract discontinuous latent associations for spatial-temporal sequence forecast (STSF) problems, which may lead to serious over-fitting and spurious relationships with correlated noise. It is a consensus that incorporating cause-effect relationships in modeling can alleviate these problems. In this paper, we propose a Causality Attention Unit (CAU) to assist ConvRNNs by complementing the causal inference ability in a plug-and-play way. Specifically, CAU serially consists of the attention module and causality module. The former is constructed by a spatial-channel attention layer, which preliminarily generates the correlated future with the correlations between historical memories and the current state. The latter borrows the idea of transfer entropy (\bmTE) to detect the latent cause-effect relationships and precisely correct the correlated future. A space-time exchange strategy for accelerating the calculation of \bmTE in CAU is also designed. CAU can be easily combined with the existing ConvRNN cells, and we construct a simple general model to predict long-term spatial-temporal series, which consists of encoder/decoder and stacked CAU paralleled to stacked ConvRNN cells. After determining the optimal model structure, we carry out a series of experiments to evaluate model performance, including comparisons with other advanced models, training loss analysis, and multiple ablation and sensitivity studies. Experimental results show that our proposed model can effectively improve the performances of existing ConvRNNs to the state-of-the-are level on representative public datasets, including Moving MNIST, KTH, BAIR, and WeatherBench. The ablation and sensitivity studies verify the superiority of CAU. The learned causal maps precisely distinguish the pixel attributions and motion characteristics in sophisticated entangled scenarios.

Abstract:
Weakly supervised semantic segmentation (WSSS) with image-level labels has witnessed promising advances with the help of class activation maps (CAM). However, CAM is always confined to small discriminative seed regions due to its simple classification loss guided training manner. To handle this problem, recent works introduced specifically designed regularizations and modules to expand the CAM seed regions, serving as the final segmentation masks. In this paper, we surprisingly find that the classification loss could suppress the gains from these regularization and modules in the late training phase, thereby limiting the further growth of CAM, which we call as the explicit supervision disturb (ESD) issue. Interestingly, we find that specific data augmentation (DA) operations (e.g., CutMix) can relieve such ESD issue, and the benefits introduced by different DA operations vary a lot. To maximize the benefits, we propose differentiable data augmentation (DDAug) to automatically search for the proper DA policy. Specifically, we design a multi-level search space to sequentially sample DA operations with different properties. Extensive experiments demonstrate that the proposed DDAug can alleviate the ESD issue and introduce consistent improvements to various popular WSSS methods, achieving the state-of-the-art performance on the MS COCO 2014 and PASCAL VOC 2012 datasets.

Abstract:
Synthesizing the images in line with the given condition is a cardinal issue of image generation. The fine-grained conditional image generation, due to its emphasis on the fidelity of details, is of profound worth to the studies in this field. To learn the conditional distribution of data, the discriminating to class semantic of generated samples is necessitated. Though, most existing methods realize it solely based on the condensed global feature, which potentially impedes the model's focus on the detailed local features and in turn causes the inaccuracy or unstable local appearances in generated images. In this context, we propose PartGAN, which features a novel part perception mechanism to strengthen the model's concentration on the nuts-and-bolts of fine-grained objects. In proposed method, the image given to the discriminator will be deconstructed and encoded into a set of embeddings that represent the semantics of parts. This scheme not only assists the model to capture the discriminative local features more accurately, but also prevents the omission of other general local features. Under the effect of the newly designed condition loss term, every part of generated image is equally encouraged to be closer to the corresponding real part, which helps to ensure that the general parts have a stable appearance that conforms to class semantic. The experiments on the popular benchmarks show that the proposed method significantly improves the effect of the generation for fine-grained images.

Abstract:
On the one hand, the dehazing task is an ill-posedness problem, which means that no unique solution exists. On the other hand, the dehazing task should take into account the subjective factor, which is to give the user selectable dehazed images rather than a single result. Therefore, this paper proposes a multi-output dehazing network by introducing illumination controllable ability, called IC-Dehazing. The proposed IC-Dehazing can change the illumination intensity by adjusting the factor of the illumination controllable module, which is realized based on the interpretable Retinex model. Moreover, the backbone dehazing network of IC-Dehazing consists of a Transformer with double decoders for high-quality image restoration. Further, the prior-based loss function and unsupervised training strategy enable IC-Dehazing to complete the parameter learning process without the need for paired data. To demonstrate the effectiveness of the proposed IC-Dehazing, quantitative and qualitative experiments are conducted. Code is available at https://github.com/Xiaofeng-life/ICDehazing.

Abstract:
Webly supervised fine-grained recognition aims to distinguish subordinate categories (e.g., bird species) with freely available web data. It has significant research and application value for alleviating the costly professional manual annotations' dependence in the fine-grained recognition task. Nevertheless, there exists label noise in web data to decrease the model's recognition performance. Most existing methods attempt to select clean data via loss analyses, which favors easy samples to hinder mining subtle differences contained in hard samples. Inspired by the intrinsic trait of consistent semantic predictions among different hierarchies of clean samples in fine-grained recognition, we propose a hierarchical consistency learning (HCL) approach for detecting noisy samples and capturing multi-hierarchy discriminative clues simultaneously. Specifically, our HCL approach works in a coarse-to-fine order, which first explores the semantic consistency between the image level and object level through prediction distribution conformance analyses. The open-set noise (i.e., samples irrelevant to any fine-grained subcategory) is thus detected, and the visual object information is highlighted with image-object contrastive learning. Then, the semantic consistency between object-level and part-level prediction distributions is utilized for detecting closed-set noise (i.e., samples mislabeled as other fine-grained subcategories), and local discriminative information is enhanced with object-part contrastive learning. Extensive experiments and analyses on three widely-used webly supervised fine-grained benchmark datasets demonstrate that the proposed HCL approach can achieve new state-of-the-art.

Abstract:
Super-resolution is essential in improving the image quality of Magnetic Resonance Imaging (MRI). Existing MRI Super-Resolution methods leverage multi-contrast MRI and achieve satisfied effects. However, these methods perform alignment by calculating the similarity of single-scale semantic features between reference images and low-resolution images, which causes misalignment and limits the performance of MRI Super-Resolution. To tackle this problem, we propose the Flexible Alignment Super-resolution Network (FASR-Net) for multi-contrast MRI Super-resolution, which explores the interaction of multi-scale features. To this end, we first use the feature extractor to generate multi-scale features, including hierarchical features and semantic pyramid features. Subsequently, we introduce the Hierarchical-Feature Alignment (HF) module and the Semantic-Pyramid-Feature Alignment (SF) module to align hierarchical features and semantic pyramid features, respectively. Finally, the Cross-Hierarchical Progressive Fusion (CHPF) module fuses these aligned features at different scales, which further improves the model's performance. Extensive experiments on FastMRI and IXI datasets show that FASR-net achieves the most competitive results over state-of-the-art approaches.

Abstract:
Salient object detection (SOD) has rapidly developed in recent years, and detection performance has greatly improved. However, the price of these improvements is increasingly complex networks that require more computing resources and sacrifice real-time performance. This makes it difficult to deploy these approaches on devices with limited computing resources (such as mobile phones, embedded platforms, etc.). Considering recently developed lightweight SOD models, their detection and real-time performance are always compromised in demanding practical application scenarios. To solve these problems, we propose a novel lightweight SOD method called LARNet and its corresponding extremely lightweight method LARNet^ according to application requirements. These methods balance the relationship between lightweight requirements, detection accuracy and real-time performance. First, we propose a saliency backbone network tailored for SOD, which removes the need for pre-training with ImageNet and effectively reduces feature redundancy. Subsequently, we propose a novel context gating module (CGM), which simulates the physiological mechanism of human brain neurons and visual information processing, and realizes the deep fusion of multi-level features at the global level. Finally, the saliency map is output after fusion of multi-level features. Extensive experiments on popular benchmark datasets demonstrate that the proposed LARNet (LARNet^) achieves 98 (113) FPS on a GPU and 3 (6) FPS on a CPU. With approximately 680 K (90 K) parameters, the model has significant performance advantages over (extremely) lightweight methods, even surpassing some heavyweight models.

Abstract:
The training of depth image-based hand pose estimation model typically relies on real-life datasets which are expected to be 1) largescale and cover a diverse range of hand poses and hand shapes, and 2) always come with high-precision annotations. However, existing datasets in reality are rather limited in the above regards due to multitude practical constraints, with time and cost being the major concerns. This observation motivates us to propose an alternative approach, where hand pose model is primarily trained with synthesized hand depth images that closely mimicking the characteristic noise patterns of a specific depth camera make under consideration. It is achieved by firstly mapping a Gaussian distributed variable to certain specific non-i.i.d. (independent and identically distributed) depth noise pattern, and then transforming a “vanilla” noise-free synthetic depth image to a realistic-looking image. Extensive empirical experiments demonstrate that our approach is capable of generating camera-specific realistic-looking hand depth images with precise annotations; comparing to entirely relying on annotated real images, a hand pose model with better performance is obtained by using only a small fraction (10%) of annotated real images as well as our synthesized images.

Abstract:
Despite the remarkable accomplishments of deep neural networks in computer vision tasks, the inherent opacity of their operations remains a pressing concern. Attribution methods generating visual explanatory maps representing the importance of image pixels for model classification are popular for explaining neural network decisions. However, the small and diverse decision regions in fine-grained or medical images limit the precision and comprehensiveness of the existing attribution methods when explaining decisions made for such a data type. This paper introduces a novel attribution method called hierarchical dynamic masks (HDM) to overcome these concerns to generate saliency maps with high recognition reliability and localization capability. Specifically, we suggest dynamic masks (DM), which enable multiple small-sized benchmark mask vectors to learn the image's critical information roughly through an optimization method. The benchmark mask vectors guide the learning of the large-sized combination mask vectors so that their overlay mask accurately learns detailed pixel importance information. Additionally, we construct the HDM by hierarchically concatenating DM modules. These DM modules search and combine the regions of interest in the remaining neural network classification decisions within the masked image in a learning-based way. Since HDM forces DM to perform importance analysis in different areas, it makes the fused saliency map more comprehensive. The experiments reveal that the proposed method outperforms existing approaches significantly regarding recognition credibility and positioning ability when qualitatively and quantitatively tested on CUB-200-2011 and iChallenge-PM datasets.

Abstract:
Previous fully-supervised defocus deblurring has made significant progress. However, training such deep models requires abundant paired ground truth, which is expensive and error-prone. This paper makes an attempt to train a defocus deblurring model without using paired ground truth and any other unpaired data. Related reblur-to-deblur schemes generally use physics-based reblur or GAN-based reblur, suffering from the robustness of blur kernel and hallucination generated by GAN. Besides, the domain gap between the realistic blurred image and reblurred image hinders deblurring performance. Addressing these challenges, we propose a weakly-supervised defocus deblurring framework via defocus detection attack. On one hand, we build a focused area detection attack (FADA) to enforce the focused area to reblur, thereby reversing its detection result by a pretrained defocus blur detection network. Moreover, we introduce a blur-aware transfer modulated from the defocused region to help FADA render a robust reblurred region. On the other hand, we implement a defocused region detection attack to guide the realistic blurred region to deblur in the process of training deblurring network with simulated-paired areas. Extensive experiments on three widely-used datasets verify the effectiveness of our framework.

Affiliations: Wangxuan Institute of Computer Technology, Peking University, Beijing, China; Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China; Energy Research Institute @ NTU, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore; College of Informatics, Huazhong Agricultural University, Wuhan, China; School of Aerospace Engineering, Huazhong University of Science and Technology, Wuhan, China

Abstract:
Temporal sentence grounding (TSG) aims to locate a semantically related segment of an untrimmed video guided by a sentence query. Since the untrimmed videos are too long, almost all existing TSG works first sparsely down-sample each video into a shorter video of a fixed length and then conduct multimodal interactions with the query sentence for reasoning. However, this video down-sampling process may introduce a challenging issue that confuses the latter grounding process: Due to the video down-sampling, some query-related frames may be filtered out; this process may remove the specific boundary frames of the target segment and take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. Therefore, it is important to keep the grounding consistency (both temporal annotations and boundary predictions) between the original and the sampled videos. To this end, in this paper, we propose a novel Conditional Video Diffusion Network (CVDN) for TSG to learn extra visual semantics to enrich and refine the biased new boundaries, which enables soft-label boundary prediction for fine-grained frame-query reasoning. Specifically, we first construct a conditional video diffusion model which is separately trained to recover the consecutive semantics of the filtered frames between the adjacent sampled frames. Through the designed stochastic interval sampling strategies in the training process, this diffusion model is able to generate absent coherent semantics between the sparsely sampled frames and in turn enrich and refine them, benefiting the integral activity understanding for TSG. In this manner, the incorrect new boundaries will be refined to be closely correlated to the original boundary frames and contain sufficient query-related information, which is crucial for accurate segment prediction. Extensive experiments on three challenging datasets demonstrate the effectiveness of CVDN.

Abstract:
Current salient object detection methods achieve good performance by aggregating multi-level features from fully convolutional network. However, in the process of feature aggregation, the noise will be introduced due to the information difference between different level features. Besides, the semantics of high-level features will be diluted as they pass on the top-down pathway, which makes it difficult for model to separate the salient objects from background completely in complex scenes. To address the above problems, we propose two deep priors, including global location prior (GLP) and local contrast prior (LCP). The GLP is generated from prediction, which can enhance the semantics of aggregated features at each level to locate salient objects. Compared with aggregation-based models that directly use high-level features as enhanced semantics, the proposed GLP contains richer semantics and details. The LCP is inferred based on the weighted differences between center pixel and surrounding pixels in backbone features, which can select discriminative features and suppress the noise from aggregated features by multiplication with residual connection. Based on the two priors, we propose a novel twice-decoding network, where the first decoding is to generate GLP by aggregating multi-level features and LCP, and the second decoding is to refine salient objects by using GLP and LCP. Different from previous methods which use a recurrent structure to merge output into input images, the proposed network only applies the output in decoding to avoid interference of raw images. Comprehensive experiments on five datasets show that the proposed method outperforms state-of-the-art ones on five evaluation metrics.

Abstract:
Few-shot semantic segmentation aims to extract information from few annotated support images to segment unknown class objects in the query image. Traditional algorithms may produce errors and insufficient feature extraction using multi-layer cosine similarity to extract correlation information, due to the large differences in appearance and posture between novel class objects, as well as the similarity in texture and shape among different categories. To address the above issue, we propose a Complementary Feature-Enhanced Network (CFENet). Specifically, we propose a correlation complementary extraction module (CCEM) to facilitate long-range information interaction between query features and support features in the intermediate layer, which contains detailed information. The generated multi-channel correlation information complements the prior information obtained through cosine similarity comparison. In addition, we propose a multi-branch feature enhancement module to capture long-range dependencies in aggregated features which are composed of prior correlation information and query features. The module effectively suppresses noise in the aggregated features and enhances the query target object feature from both global and local perspectives in a complementary way. Experiments of the network on PASCAL-5^i and COCO-20^i datasets validate the effectiveness of our proposed method.

Abstract:
With the rapid growth of activities on the web, large amounts of interaction data on multimedia platforms are easily accessible, including e-commerce, music sharing, and social media. By discovering various interests of users, recommender systems can improve user satisfaction without accessing overwhelming personal information. Compared to graph-based models, hypergraph-based collaborative filtering has the ability to model higher-order relations besides pair-wise relations among users and items, where the hypergraph structures are mainly obtained from specialized data or external knowledge. However, the above well-constructed hypergraph structures are often not readily available in every situation. To this end, we first propose a novel framework named HGRec, which can enhance recommendation via automatic hypergraph generation. By exploiting the clustering mechanism based on the user/item similarity, we group users and items without additional knowledge for hypergraph structure learning and design a cross-view recommendation module to alleviate the combinatorial gaps between the representations of the local ordinary graph and the global hypergraph. Furthermore, we devise a sparse optimization strategy to ensure the effectiveness of hypergraph structures, where a novel integration of the \ell _2,1-norm and optimal transport framework is designed for hypergraph generation. We term the model HGRec with sparse optimization strategy as HGRec++. Extensive experiments on public multi-domain datasets demonstrate the superiority brought by our HGRec++, which gains average 8.1% and 9.8% improvement over state-of-the-art baselines regarding Recall and NDCG metrics, respectively.

Abstract:
Bokeh effect is a natural shallow depth-of-field phenomenon that blurs the out-of-focus part in photography. In recent years, a series of works have proposed automatic and realistic bokeh rendering methods for artistic and aesthetic purposes. They usually employ cutting-edge data-driven deep generative networks with complex training strategies and network architectures. However, these works neglect that the bokeh effect can inevitably affect the subsequent visual intelligent tasks like recognition, and their data-driven nature prevents them from studying the influence of bokeh-related physical parameters (i.e., depth-of-the-field) on the intelligent tasks. To fill this gap, we study a totally new problem, i.e., natural & adversarial bokeh rendering, which consists of two objectives: rendering realistic and natural bokeh and fooling the visual perception models (i.e., bokeh-based adversarial attack). Specifically, we propose the circle-of-confusion predictive network (CoCNet) by taking the all-in-focus image and depth image as inputs to estimate circle-of-confusion parameters for each pixel, which are employed to render the final image through a well-known physical model of bokeh. Moreover, we propose the adversarial bokeh attack by fixing the CoCNet while optimizing the depth map w.r.t. the visual perception tasks. Then, we are able to study the vulnerability of deep neural networks according to the depth variations in the real world. The extensive experiments show that our method produces more realistic bokeh than the state-of-the-art methods while fooling the powerful deep neural networks with a high accuracy drop.

Abstract:
Continual learning is a research field of artificial neural networks to simulate human lifelong learning ability. Although a surge of investigations has achieved considerable performance, most rely only on image modality for incremental image recognition tasks. In this article, we propose a novel yet effective framework coined cross-modal Alternating Learning with Task-Aware representations (ALTA) to make good use of visual and linguistic modal information and achieve more effective continual learning. To do so, ALTA presents a cross-modal joint learning mechanism that leverages simultaneous learning of image and text representations to provide more effective supervision. And it mitigates forgetting by endowing task-aware representations with continual learning capability. Concurrently, considering the dilemma of stability and plasticity, ALTA proposes a cross-modal alternating learning strategy that alternately learns the task-aware cross-modal representations to match the image-text pairs between tasks better, further enhancing the ability of continual learning. We conduct extensive experiments under various popular image classification benchmarks to demonstrate that our approach achieves state-of-the-art performance. At the same time, systematic ablation studies and visualization analyses validate the effectiveness and rationality of our method.

Abstract:
Text-to-image person re-identification (ReID) is a common subproblem in the field of person re-identification and image-text retrieval. Recent approaches generally follow the structure of a dual-stream network, extracting image and text features. There is no deep interaction between images and text in this approach, making it difficult for the network to learn a highly semantic feature representation. In addition, for both image data and text data, the feature extraction process is modeled in a regular way, such as using Transformer to extract sequence embeddings. However, this type of modeling disregards the inherent relationships among multimodal input embeddings. A more flexible approach to mining multimodal data, which uniformly treats the data as graphs, is proposed. In this way, the extraction and interaction of multimodal information are accomplished by means of messages passing between graph nodes. First, a unified multimodal feature extraction and fusion network is proposed based on the graph convolutional network, which enables the progression of multimodal information from ‘local’ to ‘global’. Second, an asymmetric multilevel alignment module, which focuses on more accurate ‘local’ information from a ‘global’ perspective, is proposed to progressively divide the multimodal information at each level. Last, a cross-modal representation matching strategy based on similarity distribution and mutual information is proposed to achieve cross-modal alignment. The proposed algorithm in this paper is simple and efficient, and the testing results on three public datasets (CUHK-PEDES, ICFG-PEDES and RSTPReID) show that it can achieve SOTA-level performance.

Abstract:
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

Abstract:
Advancements in wearable technology and their capacity to interpret user movements, transforming them into interactive actions in virtual environments, have sparked an increased demand for user flexibility within these spaces. A direct outcome of this growing trend is the imperative need for automated cinematography in expansive, open-world scenarios. Nevertheless, the task of interpreting these interactive sequences through automated cinematography in unconstrained environments involves significant computational challenges. In response to this, we introduce the Automated Adaptive Cinematography for Open-world Generative Adversarial Network (AACOGAN) -an innovative solution that addresses these issues. Contrary to traditional models, which require comprehensive prior knowledge about scenes, characters, and objects, AACOGAN identifies and models the relationships among user interactions, object positions, and camera movements during the process of user engagement. This novel approach allows the model to function effectively even in open-world scenarios riddled with numerous uncertain factors. In the experimental phase, we developed and employed the MineStory Dataset, designed specifically for automatic cinematography in open-world scenarios. We devised and implemented novel metrics that are more congruent with the distinctive features of open-world scenarios. These innovative metrics provide a more nuanced understanding of the performance and effectiveness of our proposed method. Experimental findings substantiate that AACOGAN significantly enhances automatic cinematography performance within open-world contexts, including an average augmentation of 73% in the correlation between user interactions and camera trajectories, and an increase of up to 32.9% in the quality of multi-focus scenes. Therefore, AACOGAN emerges as an efficient, and innovative solution for creating appropriate camera shots in a myriad of interactive motions in open-world scenarios.

Abstract:
Monocular 3D human pose estimation is an ill-posed problem in computer vision due to its depth ambiguity. Most existing works supplement the depth information by extracting temporal pose features from video frames, and they have made notable progress. However, these approaches divide a long sequence of video frames into multiple short sequences for separate processing, which leads to the loss of complementary information between sequences. Furthermore, the short-term temporal correlation among frames in a sequence is often not fully exploited. To model temporal dependencies efficiently, we propose the frame-padded multiscale transformer approach, which includes a frame-padded video sequence preprocessing step and a multiscale temporal transformer backbone. Our approach addresses the omission of the temporal features of edge frames in existing approaches by padding video frames in the shallow layer. In addition, we extract the temporal information of 3D human poses using a multiscale transformer to enhance the short-term correlation of human pose skeleton keypoints. Extensive experiments validate the effectiveness of our approach on two popular datasets: Human3.6M and MPI-INF-3DHP. The results show that our approach achieves state-of-the-art performance.

Abstract:
Siamese and Transformer trackers have demonstrated exceptional performance in visual object tracking. These methods utilize initial and potentially online templates to locate the target in subsequent frames. Despite their success, these trackers are vulnerable to changes in the target's appearance due to slow template updates and interference from similar objects, resulting from the absence of scene information. To address these issues, we introduce a reference region within our tracker. The reference region is updated rapidly, providing short-term scene information. By associating the initial template, reference region, and current search region, we enhance the tracker's ability to adapt to changes in target appearance and discriminate between the target and other objects. Additionally, we propose a novel Reference-Enhance (RE) module, which aggregates contextually relevant information from the reference region to enhance the template feature. Extensive experiments show our method achieves state-of-the-art performance on six popular visual object tracking benchmarks while running at over 40 FPS.

Abstract:
Visible-infrared Person Re-identification (VI-ReID) aims to retrieve the images of pedestrian with the same identity from different modalities and cameras given a pedestrian image. To reduce modality discrepancy, existing methods often perform hard partitioning to mine more detail. However, these methods employ only uniform partitioning, without considering pedestrian structure, and lose a lot of pedestrian semantic information. To this end, this paper proposes a structural semantic representation reconstruction (SSRR) method to capture pedestrian semantic information by focusing on pedestrian structure. Specifically, based on the fine-grained features obtained by hard partitioning, we carry out structural reconstruction to obtain the reconstructed features containing semantic information. By adopting the direct link reconstruction structure, the reciprocal learning of fine-grained features and semantic features is ensured. Semantic features are reconstructed based on fine-grained features, and semantic information is beneficial to fine-grained features to better capture pedestrian-related details. In addition, local consistency loss is introduced to ensure the consistency of fine-grained features in the same component location, further enhancing the discriminant of the learned reconstructed representation. Extensive experiments confirm the superiority of our method on two public datasets SYSU-MM01 and RegDB.

Abstract:
Low-light image enhancement tasks demand an appropriate balance among brightness, color, and illumination. While existing methods often focus on one aspect of the image without considering how to pay attention to this balance, which will cause problems of color distortion and overexposure etc. This seriously affects both human visual perception and the performance of high-level visual models. In this work, a novel synergistic structure is proposed which can balance brightness, color, and illumination more effectively. Specifically, the proposed method, so-called Joint Correcting and Refinement Network (JCRNet), which mainly consists of three stages to balance brightness, color, and illumination of enhancement. Stage 1: we utilize a basic encoder-decoder and local supervision mechanism to extract local information and more comprehensive details for enhancement. Stage 2: cross-stage feature transmission and spatial feature transformation further facilitate color correction and feature refinement. Stage 3: we employ a dynamic illumination adjustment approach to embed residuals between predicted and ground truth images into the model, adaptively adjusting illumination balance. Extensive experiments demonstrate that the proposed method exhibits comprehensive performance advantages over 21 state-of-the-art methods on 9 benchmark datasets. Furthermore, a more persuasive experiment has been conducted to validate our approach the effectiveness in downstream visual tasks (e.g., saliency detection). Compared to several enhancement models, the proposed method effectively improves the segmentation results and quantitative metrics of saliency detection.

Abstract:
Co-speech gesture synthesis is a practical yet challenging task that aims to generate body motion sequences in line with speech audio. Most of the existing methods can only generate the gesture sequence with a fixed number of frames, which does not satisfy the high-quality requirement of the virtual speech video in real-world applications. In this paper, we propose a novel Implicit Compositional Generative Network (ICGN) for length-variable co-speech gesture synthesis. In ICGN, the implicit neural representation is captured and optimized for a whole gesture sequence of arbitrary length with temporal embeddings. Moreover, to enforce the synthesized gestures more realistic and consistent, we compositionally generate the gesture sequence through a well-designed asymmetric two-stream network that effectively captures and utilizes the rich correlations between speech audio and human body motions. In this way, the coarse and fine-grained gestures are synthesized, respectively, according to the corresponding content-aware and emotion-aware audio components. Extensive experiments on four widely-used benchmarks demonstrate that the proposed method renders realistic human gestures and achieves the superior performance against several state-of-the-art methods.

Abstract:
Weakly supervised group activity recognition deals with the dependence on individual-level annotations during understanding scenes involving multiple individuals, which is a challenging task. Existing methods either take the trained detectors to extract individual features or utilize the attention mechanisms for partial context encoding, followed by integration to form the final group-level representations. However, the detectors require individual-level annotations during the training phase and have a mis-detection issue, and the partial contexts extracted immediately from the whole complex scene are too ambiguous without the guidance of concrete semantics. In this article, we investigate the hierarchical structure inherent in group-level labels to extract the fine-grained semantics without using detectors for weakly supervised group activity recognition. A multi-hot encoding strategy combined with a semantic encoder is first adopted to get the label semantics embeddings. The semantic and visual scene information are then fused through a semantic decoder to obtain activity-specific features. Lastly, we employ the multi-label classification and integrate the scores of hierarchical activity labels. Experimental results show that our proposed method achieves the state-of-the-art performance on three benchmarks, and the accuracy on the Volleyball dataset exceeds the second-best method by 2%.

Abstract:
RGB-Thermal (RGB-T) pedestrian detection aims to locate pedestrians in RGB-T image pairs to exploit the complementation between the two modalities for improving detection robustness in extreme conditions. Most existing algorithms assume that the RGB-T image pairs are well registered, while in the real world, they are not ideally aligned due to parallax or different field-of-view of the cameras. The pedestrians in misaligned image pairs may be located at different positions in two images, which results in two challenges: 1) how to achieve inter-modality complementation using spatially misaligned RGB-T pedestrian patches and 2) how to recognize unpaired pedestrians at the boundary. To address these issues, we propose a new paradigm for unregistered RGB-T pedestrian detection, which predicts two separate pedestrian locations in RGB and thermal images. Specifically, we propose a cross-modality proposal-guided feature mining (CPFM) mechanism to extract two precise fusion features for representing a pedestrian in the two modalities, even if the given RGB-T image pair is unaligned. It enables us to effectively exploit the complementation between the two modalities. With the CPFM mechanism, we build a two-stream dense detector that predicts two pedestrian locations in the two modalities based on the corresponding fusion features mined by the CPFM mechanism. In addition, we design a data augmentation method, named Homography, to simulate the discrepancy in scales and views between images. We also investigate two non-maximum suppression (NMS) methods for post-processing purposes. Favorable experimental results demonstrate the effectiveness and robustness of our method in addressing unregistered pedestrians with different shifts.

Abstract:
The surge in popularity of 3D scene synthesis has driven the development of diverse methods for assessing the quality of synthesized scenes. While subjective assessment methods are widespread, their time-consuming and labor-intensive nature prompts exploration into more efficient objective alternatives. This paper introduces an objective approach to evaluating scene plausibility, aiming to overcome the limitations associated with subjective methods. To underpin our objective evaluation, we present the 3D-SPAD dataset, comprising plausibility scores for 3000 scenes across 46 object categories. Leveraging this dataset, we propose a graph attention-based network designed to accurately estimate scene plausibility. A comprehensive evaluation of our network is conducted through a series of experiments, showcasing its feasibility and reliability.

Abstract:
Multimodal topic detection is an important social media analysis task with a wide variety of real-world applications. However, modeling data jointly, and inferring their topics, is challenging due to the semantic gaps between different modalities. Our insights are from the psychological findings pretaining to the hierarchical structure in humans' inherent perception of images and texts. In this paper, we propose a Multimodal Hierarchical Reasoning Network (MHRN) to perform multimodal inference for topic detection. The images and texts are represented in a hierarchical model named the Multimodal Part-whole Aware Graph (MPAG). MHRN then performs reasoning for topic inference based on three modules, which include a Bottom-Up Aggregation (BUA) module for encoding the hierarchical connections and sibling relations in MPAG, a Top-Down Guidance (TDG) module for enriching features of the nodes in MPAG guided by their parents, and a Bottom-Up Cross Aggregation (BUCA) module for capturing and aggregating the cross-modality cues to achieve effective multimodal reasoning. Extensive experiments are conducted on two benchmarks, and the results demonstrate the superiority of our approach.

Abstract:
Images corrupted with degradations often result in a performance drop in downstream image recognition models trained on clean images. Previous image restoration (IR) methods either restore the images without delicately considering the semantic recovery, or the training objectives cannot meet unseen recognition models, leading to poor and non-generalizable performance for various downstream recognition tasks. In this paper, we propose a general Image Restoration framework for Visual Recognition (IRVR), which addresses generalized and effective semantic recovery in image restoration for a range of high-level tasks. Concretely, for better generalization, we train the IR models with semantic recovery as the primary objective, and image regression as a regularization term, respectively, where the primary objective gradient is calibrated with the regularization gradient to ensure the generalization of IR to unseen recognition models. For effectiveness, we introduce an intrinsic semantic consistency constraint to match the semantic statistical distribution between restored and clean image pairs. Our IRVR is recognition-agnostic and orthogonal to IR, making it a plug-and-play component that can be incorporated into existing IR methods without adding any computational cost during inference. Extensive experiments demonstrate the effectiveness and generalization of our IRVR for improving the performance of IR in diverse downstream high-level tasks. The IRVR's ability to accurately recover intrinsic semantics in images is instrumental in high-level machine analysis, which ensures the integrity and authenticity of multimedia content.

Abstract:
Sketch-based melody creation systems enable people to compose melodies by converting human-sketched melody contours into coherent melodies that fit the depicted contours. This remains one of the most intuitive approaches to interactive music creation. However, previous studies are still stagnating in limitations regarding usability and interpretability, which hinders effective interactions between people and AI. For one thing, these studies entail additional complex musical conditions as auxiliary inputs (e.g. chord progressions, contextual melodies, and predetermined rhythms), supporting only fixed-length and rule-based melody generation. This makes existing systems less usable, with generated melodies lacking diversity and coherence. Moreover, users without enough musical expertise might find it difficult to define appropriate inputs and to interpret the role of these inputs in guiding melody generation. To address these limitations, we present Drawlody, a novel sketch-based melody creation system with enhanced usability and interpretability. Specifically, Drawlody simplifies user input requirements by excluding all complex musical conditions, using only a simplified melody contour representation named Generalised Melody Contour (GMC) as input. This simplification clarifies the role of user controls, making the system more usable for people without musical training. To guide coherent melody generation from GMC, we propose FlexMIDI music representation, which simulates the tonal structure of melodies and faithfully explains how human-sketched contours guide melody generation. We employ a CNN-Transformer-based architecture as the foundation model to achieve arbitrary-length melody generation. Drawlody is evaluated by both objective and subjective music quality studies, as well as a usability and interpretability study. The results support its enhanced usability, interpretability, and high-quality melody generation capabilities.

Abstract:
Online schemes and nonlocal similarity are two effective approaches for strengthening robust principal component analysis (RPCA) techniques in video denoising. However, their limitations are also evident. The online scheme is usually highly efficient but lacks consideration of regional appearance information, thus it cannot effectively handle videos with complex dynamics such as object movements. On the other hand, nonlocal similarity is used to better utilize regional information but incurs a heavy computational cost. Moreover, these two techniques are incompatible and challenging to work together. To overcome this barrier and harness the advantages of both approaches, this paper proposes a novel online nonlocal RPCA method. 1) A clustering based nonlocal strategy (ClusNonlocal) is adopted, which not only greatly reduces the computation cost, but also forms low-dimensional subspaces for online processing. 2) A new weighted RPCA model is proposed, which regards samples with different importances and improves the performance of subspace pursuit and video recovery. 3) A multi-level subspace updating scheme and weighted projection method is proposed, which keeps the performance of online video data processing at a high level at all time. A series of video denoising experiments are carried out to demonstrate the overall advantages of our procedure over several other ones, in terms of both visual quality and running speed.

Abstract:
Existing natural image segmentation tasks often face some troubles in getting accurate contours or causing over-segmentation of local areas during segmentation, and the multi-feature information mining of images is insufficient. To address these problems, learnable tensor graph fusion framework for natural image segmentation (LTGF-NIS) is proposed in this paper. Firstly, we pre-process the original images by adaptive morphological reconstruction watershed transform, and the multi-feature data is extracted, the multi-feature data matrix contains information about the image. Then we design an adaptive weighted tensor affinity graph fusion method (AWTAGF), which learns the higher-order correlation of the multi-feature information by two coupling tensor, while achieving consistent representation of the multi-feature information by using adaptive weighted graph fusion. Finally, the obtained affinity graph is clustered and segmented using spectral clustering to realize the segmentation of natural images. We test affinity graph learning ability and natural image segmentation effect of the proposed algorithm on several datasets, the experimental data indicates that the segmentation effect of our framework is superior to some existing advanced methods.

Abstract:
Recently, parallax attention based stereo image super-resolution (SR) methods, which can better explore cross-view information, have been widely studied. Despite the impressive performance of these methods, almost all of them calculate parallax attention maps at a single low resolution, which will lead to ambiguous stereo correspondence. Besides, the widely used parallax attention module (PAM) cannot handle the illuminance variations in stereo image pairs, and cannot distinguish the contribution of the captured cross-view features to the reconstruction of the target view. To this end, in this paper, we propose a coarse-to-fine cross-view interaction based network (C2FNet) to achieve more accurate cross-view information capturing. Firstly, in C2FNet, a coarse-to-fine cascaded parallax attention structure (C2F-CPAS), which conforms with the human visual mechanism, is constructed to gradually perform parallax attention from the low-resolution to high-resolution level. Thus, richer textures can be used to learn more reliable stereo correspondence. Meanwhile, a multi-level attention transfer loss is designed to further calibrate the accuracy of stereo correspondence at each level. Secondly, we propose a modified PAM (MPAM) to alleviate the limitations of common PAM so that illuminance-robust stereo correspondence can be learned and more important cross-view information can be selected. Extensive experimental results show that our proposed C2FNet outperforms the state-of-the-art methods on various datasets.

Abstract:
Temporal action localization (TAL), which aims to identify and localize actions in long untrimmed videos, is a challenging task in video understanding. Recent studies have shown that the Transformer and its variants are effective at improving the performance of TAL. The success of the Transformer can be attributed to the use of multi-head self-attention (MHSA) as a token mixer to capture long-term temporal dependencies within the video sequence. However, in the existing Transformer architecture, the features obtained by multiple token mixing (i.e., self-attention) heads are treated equally, which neglects the distinct characteristics of different heads and hampers the exploitation of discriminative information. To this end, we present a new method called the adaptive dual selective Transformer (ADSFormer) for TAL in this paper. The key component in ADSFormer is the dual selective multi-head token mixer (DSMHTM), which integrates multiple feature representations from different token mixing heads by adaptively selecting important features across both the head and channel dimensions. Moreover, we also incorporate our ADSFormer into a pyramid structure so that the multi-scale features obtained can be effectively combined to improve TAL performance. Benefiting from the dual selective multi-head token mixer (DSMHTM) and pyramid feature combination, ADSFormer outperforms several state-of-the-art methods on four challenging benchmark datasets: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100 and ActivityNet-1.3.

Abstract:
Point cloud registration suffers from repeated patterns and low geometric structures in indoor scenes. The recent transformer utilises attention mechanism to capture the global correlations in feature space and improves the registration performance. However, for indoor scenarios, global correlation loses its advantages as it cannot distinguish real useful features and noise. To address this problem, we propose an image-geometry-assisted point cloud registration method by integrating image information into point features and selectively fusing the geometric consistency with respect to reliable salient areas. Firstly, an Intra-Image-Geometry fusion module is proposed to integrate the texture and structure information into the point feature space by the cross-attention mechanism. Initial corresponding superpoints are acquired as salient anchors in the source and target. Then, a selective correlation fusion module is designed to embed the correlations between the salient anchors and points. During training, the saliency location and selective correlation fusion modules exchange information iteratively to identify the most reliable salient anchors and achieve effective feature fusion. The obtained distinctive point cloud features allow for accurate correspondence matching, leading to the success of indoor point cloud registration. Extensive experiments are conducted on 3DMatch and 3DLoMatch datasets to demonstrate the outstanding performance of the proposed approach compared to the state-of-the-art, particularly in those geometrically challenging cases such as repetitive patterns and low-geometry regions.

Abstract:
Stereo image dehazing aims to restore haze-free images by leveraging the complementary information contained in binocular images. Current methods primarily focus on designing image-level modules and pipelines to utilize complementary information between the left and right-view images. However, these image-level cross-view interactions overlook regional differences in haze concentration and stereo image disparity maps. Consequently, we propose a Progressive Stereo Image Dehazing Network via Cross-view Region Interaction, termed PSIDNet, which fully considers the internal characteristics and external manifestation of haze and disparity, and explicitly addresses the stereo image dehazing task by a regional-aware interactive mechanism. Specifically, we divide hazy images into regions and independently interact with left and right-view information at region levels, meaning weights are not shared across regional patches. This approach allows us to treat different regions with different priorities, i.e., concentrate on regional patches with heavier haze concentration and larger disparities, hence enabling more accurate restoration of hazy images. Furthermore, we introduce an effective cross-view region interactive block that extracts information based on the channel dimension of dual views and later adopts matrix multiplication to generate mutual attention maps based on the fused features. Extensive experiments on synthetic and real-scenario datasets demonstrate the efficacy of our method, compared to other related monocular and stereo image dehazing and restoration methods.

Abstract:
Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous to interpret. To address this, we propose a Disentangled Graph Variational Auto-Encoder (\mathttDGVAE) that aims to enhance both model and recommendation interpretability. \mathttDGVAE initially projects multimodal information into textual contents, such as converting images to text, by harnessing state-of-the-art multimodal pre-training technologies. It then constructs a frozen item-item graph and encodes the contents and interactions into two sets of disentangled representations utilizing a simplified residual graph convolutional network. \mathttDGVAE further regularizes these disentangled representations through mutual information maximization, aligning the representations derived from the interactions between users and items with those learned from textual content. This alignment facilitates the interpretation of user binary interactions via text. Our empirical analysis conducted on three real-world datasets demonstrates that \mathttDGVAE significantly surpasses the performance of state-of-the-art baselines by a margin of 10.02%. We also furnish a case study from a real-world dataset to illustrate the interpretability of \mathttDGVAE.

Abstract:
Deep learning-based clustering methods, especially those incorporating deep generative models, have recently shown noticeable improvement on many multimedia benchmark datasets. However, existing generative models still suffer from unstable training, and the gradient vanishes, which results in the inability to learn desirable embedded features for clustering. In this paper, we aim to tackle this problem by exploring the capability of Wasserstein embedding in learning representative embedded features and introducing a new clustering module for jointly optimizing embedding learning and clustering. To this end, we propose Wasserstein embedding clustering (WEC), which integrates robust generative models with clustering. By directly minimizing the discrepancy between the prior and marginal distribution, we transform the optimization problem of Wasserstein distance from the original data space into embedding space, which differs from other generative approaches that optimize in the original data space. Consequently, it naturally allows us to construct a joint optimization framework with the designed clustering module in the embedding layer. Due to the substitutability of the penalty term in Wasserstein embedding, we further propose two types of deep clustering models by selecting different penalty terms. Comparative experiments conducted on nine publicly available multimedia datasets with several state-of-the-art methods demonstrate the effectiveness of our method.

Abstract:
Composed image retrieval is a challenging task in the field of multi-modal learning, aiming at measuring the similarities between target images and query images with modification sentences. Most previous methods either construct feature composition for the query image and modification text or concentrate on extracting cross-modal alignments. However, these methods are prone to neglect the negative impacts of the mismatched correspondences between the hybrid-modal query and target, which could be discriminative when comparing similar instances. Besides, localized textual representations are not fully explored when learning similarities between the query and the target. To overcome the above issues, we propose a Negative-Sensitive Framework with Semantic Enhancement (NSFSE) for mining the adaptive boundaries between matched and mismatched samples with comprehensive consideration of positive and negative correspondences. It can optimize the threshold dynamically based on distributions to explore the intrinsic characteristics of positive and negative correlations, which could further facilitate accurate similarity learning. A text-guided attention mechanism after infusing cross-modal affinities on localized word features is exploited in NSFSE to explore latent semantic-related visual similarity and cross-modal similarity simultaneously. The performance of extensive experiments and comprehensive analysis on three representative datasets CIRR, FashionIQ, and Fashion200 K demonstrate the effectiveness of negative mining of similarity with semantic enhancement in the proposed NSFSE.

Abstract:
Underwater image quality assessment (UIQA) plays a crucial role in monitoring and detecting the quality of acquired underwater images in underwater imaging systems. Currently, the investigation of UIQA encounters two major challenges. First, a lack of large-scale UIQA databases for benchmarking UIQA algorithms remains, which greatly restricts the development of UIQA research. The other limitation is that there is a shortage of effective UIQA methods that can faithfully predict underwater image quality. To alleviate these two challenges, in this paper, we first construct a large-scale UIQA database (UIQD). Specifically, UIQD contains a total of 5369 authentic underwater images that span abundant underwater scenes and typical quality degradation conditions. Extensive subjective experiments are executed to annotate the perceived quality of the underwater images in UIQD. Based on an in-depth analysis of underwater image characteristics, we further establish a novel baseline UIQA metric that integrates channel and spatial attention mechanisms and a transformer. Channel- and spatial attention modules are used to capture the image channel and local quality degradations, while the transformer module characterizes the image quality from a global perspective. Multilayer perception is employed to fuse the local and global feature representations and yield the image quality score. Extensive experiments conducted on UIQD demonstrate that the proposed UIQA model achieves superior prediction performance compared with the state-of-the-art UIQA and IQA methods.

Abstract:
Intuitively, relations among objects assist a model in performing inference under constrained environments. However, the top-down information flow in the Feature pyramid network (FPN) dilutes the relation features contained in the non-adjacent layers. Such a defect reduces the accuracy of detectors, especially for small or obscured objects. To adequately exploit the relations among object instances, we propose the relation-aware feature pyramid network (RaFPN), a simple but effective balanced multi-scale feature module for dense image prediction. RaFPN models the relations among objects by computing the similarity between pixels located on cross-scale features. The result is then delivered to FPN to guide the detector in completing accurate inference. Specifically, we first generate a pair of cross-scale aggregated features based on the channel importance of the output features from FPN. After that, the relation among the cross-scale objects is extracted by a bi-directional interaction mechanism. Finally, relation features are injected directly into each layer of the feature pyramid to avoid dilution. In this way, the relation among instances can adequately guide the detector for dense prediction. Our RaFPN pushes the performance bound of Faster RCNN by 2.0 AP (average precision), outperforming the recent state-of-the-art FPN-based improvements. Notably, for dense prediction tasks such as instance, semantic, and panoptic segmentation, our method brings consistent boosts to them as well.

Abstract:
Detecting objects in large-scale drone-view images is notoriously challenging due to their uneven distribution and scale variation caused by photoing angles. Common approaches promote drone-view object detection by two-step detection (i.e., detecting sub-regions first) and multi-scale input. However, all these methods suffer from onerous computational costs since the high model complexity and input resolution. In this paper, we propose a novel one-step detector, called SDPDet, to enable effective object learning in drone-view images. In particular, a Scale-separated Activation Pyramid (SAP) serves to focus on the regions with objects aggregated at each scale, and a Scale-separated Learnable Proposals (SLP) mechanism learns proposal boxes and corresponding features on these regions. By such design, the quantity of learnable proposals allows dynamic adjustment at each scale separately, which facilitates the objects learning of various distributions and scales with less computational costs. Experiments demonstrate SDPDet can significantly outperform the state-of-the-art one-step detectors on three widely-used benchmarks. On the most challenging VisDrone dataset, SDPDet with ResNet50 gains 5.4% AP and 6.9% AP_s improvements while running 1.9× faster than previous models.

Affiliations: College of Computer and Information Science, College of Software, Southwest University, Chongqing, China; School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China; School of Computer Science, the University of Sydney, Darlington, NSW, Australia; School of Mathematical and Computer Sciences, Shangrao Normal University, Shangrao, China; Faculty of information technology, Monash University, Clayton, VIC, Australia; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzheng, China; School of Computer Science, Sichuan University, Chengdu, China

Abstract:
Knowledge distillation (KD) is a prevalent model compression technique in deep learning, aiming to leverage knowledge from a large teacher model to enhance the training of a smaller student model. It has found success in deploying compact deep models in intelligent applications like intelligent transportation, smart health, and distributed intelligence. Current knowledge distillation methods primarily fall into two categories: offline and online knowledge distillation. Offline methods involve a one-way distillation process, transferring unvaried knowledge from teacher to student, while online methods enable the simultaneous training of multiple peer students. However, existing knowledge distillation methods often face challenges where the student may not fully comprehend the teacher's knowledge due to model capacity gaps, and there might be knowledge incongruence among outputs of multiple students without teacher guidance. To address these issues, we propose a novel reciprocal teacher-student learning inspired by human teaching and examining through forward and feedback knowledge distillation (FFKD). Forward knowledge distillation operates offline, while feedback knowledge distillation follows an online scheme. The rationale is that feedback knowledge distillation enables the pre-trained teacher model to receive feedback from students, allowing the teacher to refine its teaching strategies accordingly. To achieve this, we introduce a new weighting constraint to gauge the extent of students' understanding of the teacher's knowledge, which is then utilized to enhance teaching strategies. Experimental results on five visual recognition datasets demonstrate that the proposed FFKD outperforms current state-of-the-art knowledge distillation methods.

Abstract:
It is challenging to eliminate the domain shift between seen and unseen classes in multimodal zero-shot learning tasks due to the underlying disparity between the data distributions in the seen and unseen domains. In this paper, we propose a progressive placeholder learning network with mixup hallucination and an alternating mixer, denoted as MHAM, to maintain embedding spaces for unseen classes. Utilizing mixup hallucination (MH) on the visual and textual features obtained by BERT and a vision transformer, MHAM generates visual and textual hallucinated representations with pseudo class embeddings as placeholders for the unseen classes. Furthermore, a number of alternating mixer (AM) blocks are stacked to obtain modality-shared representations for the seen classes and hallucinated representations of progressive placeholders for the unseen classes. In particular, modality-shared representations are obtained by a mixer in an AM block by reversing the dimensionality of the modality-specific and raw representations to model intermodal interactions. MHAM exploits a freezing strategy by fixing the weights over the unseen classes in the last fully connected layer; this step acts as a projection from the raw and modality-shared representations to the embedding space of the seen and unseen classes. Experiments conducted on zero-shot datasets and news event datasets demonstrate the superior performance of the proposed MHAM method.

Abstract:
Multi-view learning is a promising research field that aims to enhance learning performance by integrating information from diverse data perspectives. Due to the increasing interest in graph neural networks, researchers have gradually incorporated various graph models into multi-view learning. Despite significant progress, current methods face challenges in extracting information from multiple graphs while simultaneously accommodating specific downstream tasks. Additionally, the lack of a subsequent refinement process for the learned graph leads to the incorporation of noise. To address the aforementioned issues, we propose a method named generative essential graph convolutional network for multi-view semi-supervised classification. Our approach integrates the extraction of multi-graph consistency and complementarity, graph refinement, and classification tasks within a comprehensive optimization framework. This is accomplished by extracting a consistent graph from the shared representation, taking into account the complementarity of the original topologies. The learned graph is then optimized through downstream-specific tasks. Finally, we employ a graph convolutional network with a learnable threshold shrinkage function to acquire the graph embedding. Experimental results on benchmark datasets demonstrate the effectiveness of our approach.

Abstract:
Data sparsity poses a persistent challenge in Recommender Systems (RS), driving the emergence of Cross-Domain Recommendation (CDR) as a potential remedy. However, most existing CDR methods often struggle to circumvent the transfer of domain-specific information, which are perceived as noise in the target domain. Additionally, they primarily concentrate on inter-domain information transfer, disregarding the comprehensive exploration of data within intra-domains. To address these limitations, we propose SUCCDR (Separating User features with Compound samples), a novel approach that tackles data sparsity by leveraging both cross-domain knowledge transfer and comprehensive intra-domain analysis. Specifically, to ensure the exclusion of noisy domain-specific features during the transfer process, user preferences are separated into domain-invariant and domain-specific features through three efficient constraints. Furthermore, the unobserved items are leveraged to generate compound samples that intelligently merge observed and unobserved potential user-item interaction, utilizing a simple yet efficient attention mechanism to enable a comprehensive and unbiased representation of user preferences. We evaluate the performance of SUCCDR on two real-world datasets, Douban and Amazon, and compare it with state-of-the-art single-domain and cross-domain recommendation methods. The experimental results demonstrate that SUCCDR outperforms existing approaches, highlighting its ability to effectively alleviate data sparsity problem.

Affiliations: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, China; Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong, China; Department of Electronic and Information Engineering, Photonics Research Center, The Hong Kong Polytechnic University, Hong Kong, China

Abstract:
Incomplete multi-view clustering (IMVC) aims to leverage complementary information from multi-view data with missing instances to enhance clustering performance. Many existing IMVC methods exhibit limitations in effectively exploiting hidden information and addressing distribution differences between views and modules. To address these challenges, we present a novel IMVC framework that leverages the proposed stack feature-based matrix completion to impute the missing instances, enhancing the exploitation of underlying information. We also incorporate graph consensus to integrate graph structures learned from both completed and observed data. Additionally, we introduce correntropy-induced metric as a flexible measurement to adaptively assign different constraints to various views and modules. Furthermore, we derive an efficient iterative algorithm based on Fenchel conjugate and accelerated block coordinate update (BCU) to solve the joint learning problem. Experimental results on eight benchmark datasets demonstrate the superior performance of our method compared to state-of-the-art IMVC methods across various metrics.

Abstract:
Long-term incorrect sitting undoubtedly will damage physical health. Recognizing bad sitting posture has been of particular interest recently due to the prevailing Internet of Healthcare Things (IoHT). While various sitting posture recognition systems based on wearable devices and cameras are designed, they expose two obvious weaknesses. First, the sensors attached to the body will cause inconvenience to users, and using a camera requires high energy consumption and faces the risk of user privacy leakage. Second, most of these systems require massive training samples to build models, and the recognition performance of certain models on new user data with significant sample distribution differences remains poor. In this work, we propose SmartSit, the first-ever robust sitting posture recognition system with smartphone acoustic sensing. We start by designing a signal detection algorithm to determine the boundary of the sitting posture signal through a series of signal transformation methods. Then we construct the sitting posture recognition module MG-Reptile by modifying the meta-learning method by combining the Distributed Measurement Strategy (DMS) and Generative Adversarial Network (GAN). We show that the designed system is immune to the low generalization performance with only a few training samples. The observed testing results further validate the effectiveness and robustness of SmartSit.

Abstract:
Video Language Grounding is one of the most challenging cross-modal video understanding tasks. This task aims to localize a target moment semantically corresponding to a given language query in an untrimmed video. Many existing VLG methods rely on the proposal-based framework, despite the dominant performance achieved, they usually focus on interacting a few internal frames with the query to score segment proposals, trapping in the long-range dependencies when the proposal feature is limited. Meanwhile, adjacent proposals share similar visual semantics, making VLG models hard to align the accurate semantics of video-query contents and degenerating the ranking performance. To remedy the above limitations, we propose VLG-CRF by introducing the conditional random fields (CRFs) to handle the discrete yet indistinguishable proposals. Specifically, VLG-CRF consists of two cascade CRF-based modules. The AttentiveCRFs is developed for multi-modal feature fusion to better integrate temporal and semantic relation between modalities. We also devise a new variant of ConvCRFs to capture the relation of discrete segments and rectify the predicting scores to make relatively high prediction scores clustered in a range. Experiments on three benchmark datasets, i.e., Charades-STA, ActivityNet-Caption, and TACoS, show the superiority of our method and the state-of-the-art performance is achieved.

Abstract:
Multimodal sentiment analysis remains a big challenge due to the lack of effective fusion solutions. An effective fusion is expected to obtain the correct semantic representation for all modalities, and simultaneously thoroughly explore the contribution of each modality. In this paper, we propose a dominant SIngle-Modal SUpplementary Fusion (SIMSUF) approach to perform effective multimodal fusion for sentiment analysis. The SIMSUF is composed of three major components, a dominant modality supplementary module, a modality enhancement module, and a multimodal fusion module. The dominant modality supplementary module realizes dominant modality determination by estimating mutual dependence between every two modalities, and then the dominant modality is adopted to supplement other modalities for representative feature learning. To further explore the modality contribution, we propose a two-branch modality enhancement module, where one branch learns common representation distribution for multiple modalities, and simultaneously a specific modality enhancement branch is presented to perform semantic difference enhancement and distribution difference enhancement for each modality. Finally, a dominant modality leading fusion module is designed to fuse multimodal representations of two branches for sentiment analysis. Extensive experiments are evaluated on the CMU-MOSEI and CMU-MOSI datasets. Experiment results certify that our approach is superior to the state-of-the-art approaches.

Abstract:
Few-shot action recognition in videos is challenging as the lack of supervision makes it extremely difficult to generalize well to unseen actions. To address this challenge, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt-tune a powerful pre-trained vision-language model for few-shot classification. To that end, we first collect a large-scale corpus of language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in a handcraft sentence template with an external action-related corpus or by extracting action-related phrases from captions of Web instruction videos. Next, we feed these text proposals to a pre-trained vision-language model along with video frames to generate matching scores of the proposals for each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification. Extensive experiments on six benchmark datasets demonstrate that our method generally achieves state-of-the-art performance while reducing the training computational cost to 0.1% of the existing methods.

Abstract:
Multi-modality pre-trained models (PTMs) have considerably boosted the performance on a broad range of computer vision topics. Still, they have not been explored purposefully in open set recognition (OSR) scenarios when applying PTMs to downstream recognition tasks. Directly fine/prompt tuning PTMs on closed-set classification tasks will inevitably suffer from data bias and always learn more or less target class-irrelevant co-occurring contextual information, which leads to over-confident predictions on unknown samples. In this paper, we propose a simple yet effective approach, termed Anti-Associative Prompt Tuning (A^2Pt), toward learning compact and accurate class-related representation with few class-irrelevant associations from context using multi-modal priors. Specifically, a cross-modal guided activation module is adopted to refine the class-aware representation and suppress the associations from co-occurring contexts by involving text-modal information. We further design an anti-association calibration module to obtain compact class-aware and class-irrelevant representations, respectively, by introducing two additional object functions. Extensive experiments on publicly available benchmarks, including CIFAR series, TinyImageNet, and ImageNet-21K-P, show that the proposed A^2Pt achieves substantial and consistent performance gains compared with both SOTA OSR and PTM prompt tuning approaches.

Abstract:
Large pre-trained vision-language models, such as CLIP [Radford et al. 2021], have demonstrated remarkable performance in few-shot image classification. To facilitate the rapid adaptation of CLIP in downstream tasks with limited visual samples, two primary frameworks have been proposed. The first framework centers on the image encoder and introduces a trainable visual classifier after the backbone to generate logits for each object class. Nevertheless, this framework heavily depends on limited visual features extracted by the pre-trained visual encoder, which can result in over-fitting issues. The second framework aims to optimize the text encoder by using trainable soft language prompts and computing logits for each class based on the similarity between image features and optimized prompt features. However, this framework encounters the issue of imperfect alignment between the representations extracted by the image and text encoders, making it difficult to fine-tune the language prompts using visual samples. This paper proposes a Multi-Modal Prototype Regularization (MMPR) method for CLIP-based few-shot fine-tuning for image classification. MMPR can address the challenges of effectively utilizing both image and text features. MMPR fine-tunes a classifier and regularizes its weights using both image-based (ImgPR) and text-based (TexPR) prototypes. ImgPR represents the mean of image representations within the same class, derived from the image encoder, to distill specific visual distribution knowledge for classifier adaptation. TexPR represents the hand-crafted prompt associated with the class, derived from the text encoder, to incorporate general encyclopedic knowledge and mitigate visual over-fitting. MMPR significantly leverages both image and text information without increasing computational complexity during the inference stage compared to existing methods. Experimental results on various challenging public benchmarks demonstrate the superiority of the proposed MMPR method over state-of-the-art methods.

Abstract:
Self-supervised learning has not been extensively investigated in the context of point cloud analysis. Current frameworks are predominantly rely on point cloud reconstruction. Given only 3D coordinates, such approaches tend to learn local geometric structures and contours but struggle to comprehend high-level semantic content. Consequently, they achieve unsatisfactory performance in downstream tasks such as classification, segmentation, etc. To fill this gap, we propose a generic Contour-Perturbed Reconstruction Network (CP-Net), which can effectively guides self-supervised reconstruction to learn semantic content in the point cloud, and thus promote discriminative power of point cloud representation. Initially, we introduce a concise contour-perturbed augmentation module for point cloud reconstruction. With guidance of geometry disentangling, we divide point cloud into contour and content components. Subsequently, we perturb the contour components and preserve the content components on the point cloud. As a result, self supervisor can effectively focus on semantic content, by reconstructing the original point cloud from such perturbed one. Next, we use this perturbed reconstruction as an assistant branch, to guide the learning of basic reconstruction branch via a distinct dual-branch consistency loss. In this case, our CP-Net not only captures structural contour but also learn semantic content for discriminative downstream tasks. Finally, we perform extensive experiments on a number of point cloud benchmarks. Part segmentation results demonstrate that our CP-Net (81.5% of mean Intersection over union) outperforms the previous self-supervised models, and narrows the gap with the fully-supervised methods. For classification, we get a competitive result with the fully-supervised methods on ModelNet40 (92.5% accuracy) and ScanObjectNN (87.9% accuracy).

Abstract:
Proper inference of semantics is necessary for realistic image inpainting. Most image inpainting methods use deep generative models, which require large image datasets to predict and generate content. However, predicting the missing regions and generating coherent content is difficult due to limited control. Existing approaches include image-guided or text-guided image inpainting, but none of them has taken both image and text as the guidance signals, as far as we know. To fill this gap, we propose a multi-modality guided (MMG) image inpainting approach based on the diffusion model. This MMGInpainting method uses both image and text as guidance for generating content within the target area for inpainting, effectively integrating the semantic information conveyed by the guiding image or text into the content of the inpainted region. To construct MMGInpainting, we start by enhancing the U-Net backbone with a customized Nonlinear Activation Free Network (NAFNet). This adapted NAFNet incorporates an Anchored Stripe Attention mechanism, which utilizes anchor points to effectively model global contextual dependencies. To regulate inpainting, we use a Semantic Fusion Encoder to guide the inverse process of the diffusion model. The process is iteratively executed to denoise and generate the desired inpainting result. Additionally, we explore how different modes of meaning interact and coordinate to offer users useful guidance for a more manageable inpainting procedure. Experimental results demonstrate that our approach produces faithful results adhering to the guiding information, while significantly improving computational efficiency.

Abstract:
Multimodalrecommendation is an emerging task with the goal of improving the effectiveness of the recommendation system by utilizing multimodal data (images, texts, etc.). Most previous methods have struggled with the ability to mine item semantic relationships while guaranteeing accurate modeling of user modality preferences, resulting in low recommendation accuracy. To address this issue, this paper proposes a novel and effective Self-suPervised duAl preference enhanCing nEtwork for multimodal recommendation, named SPACE, which further mines user preferences towards historical interactions and multimodal features of items to obtain more precise user and item representation. Specifically, we design an interaction preference enhancing module to learn both interactive and latent semantic relationships between users and items. Then, a modality preference enhancing module is established by introducing self-supervised learning (SSL), which aims to strengthen the role of dominant modality-specific representation of items. Finally, the enhanced interaction and modality representations are fused, and the recommendation performance is largely improved by utilizing dual joint prediction. Extensive experiments are conducted on three real-world datasets, and the simulation results demonstrate that the proposed SPACE model outperforms the state-of-the-art multimodal recommendation methods.

Abstract:
Existing representation learning approaches lie predominantly in designing models empirically without rigorous mathematical guidelines, neglecting interpretation in terms of modeling. In this work, we propose an optimization-derived representation learning network that embraces both interpretation and extensibility. To ensure interpretability at the design level, we adopt a transparent approach in customizing the representation learning network from an optimization perspective. This involves modularly stitching together components to meet specific requirements, enhancing flexibility and generality. Then, we convert the iterative solution of the convex optimization objective into the corresponding feed-forward network layers by embedding learnable modules. These above optimization-derived layers are seamlessly integrated into a deep neural network architecture, allowing for training in an end-to-end fashion. Furthermore, extra view-wise weights are introduced for multi-view learning to discriminate the contributions of representations from different views. The proposed method outperforms several advanced approaches on semi-supervised classification tasks, demonstrating its feasibility and effectiveness.

Abstract:
Multi-modal data presents a promising opportunity for improving multimedia recommendation models, but it also introduces task-irrelevant noise that can reduce model robustness. In this paper, we propose a robust multi-modal recommendation approach that accounts for different levels of task-irrelevant noise across modalities. We explicitly consider the uncertainty associated with each modality and perform stochastic sampling-based fusion according to the precision of different modalities, which serves as a measure of uncertainty. The influence of noisy modalities with high uncertainty is removed, filtering out task-irrelevant noise, and therefore a noise-robust multi-modal recommendation is achieved. Moreover, the stochastic sampling strategy intrinsically considers and simulates scenarios with absent modalities during multi-modal fusion. Consequently, it incorporates additional randomness into the training process, which enables the model to handle the problem of modality missing. Furthermore, the proposed fusion approach integrates the noise robustness of the Product-of-Experts (PoE) framework when modeling with Gaussian distributions, along with the flexibility of the Mixture-of-Experts (MoE) technique to represent diverse distributions of latent variables. This integration allows the proposed approach to achieve noise-robust modeling with non-Gaussian variables. Specifically, we derive a solvable evidence lower bound for the proposed variational mixture of stochastic experts (VMoSE) auto-encoder, where both Gaussian and Student-T distributions are used to model the latent variables. Constraints are added to match the similarities between the ID embeddings and the multi-modal joint embeddings by utilizing an Expectation maximization (EM)-style algorithm for better model optimization. Extensive experiments demonstrate the effectiveness of the proposed method in multi-modal fusion and the robustness to modality noise and modality missing.

Abstract:
Video prediction is a challenging spatiotemporal prediction task that generates future frames based on historical observations. Although recently proposed deep learning-based methods significantly outperform legacy approaches, there still exist gaps between prediction and ground truth, primarily rooted in edge and motion blurring. On the one hand, since conventional performance metrics like Mean Square Error (MSE) and Structure Similarity Index Measure (SSIM) cannot decently evaluate this deficiency, we design a 3D Frequency Loss (3DFL) metric to better assess the similarity of predicted video frames. On the other hand, edge and motion blurring is mainly attributed to the predictive model's insufficient attention to high spatial frequency arising from rapid pixel value variations at object edges, and it is observed that shallow networks are more adept at capturing high spatial frequency information. Therefore, aiming to alleviate edge and motion blurring, we propose a novel video prediction model termed SDFNet that can extract and integrate both spatially encoded shallow and deep-level features. To accommodate SDFNet's multi-branch input structure, a frequency adaptive translator (FATranslator) is derived, which leverages involution operators to adaptively extract inter-frame temporal dependencies from different spatial encoding layers, and further mitigates motion blurring. Extensive experiments demonstrate that our proposed model achieves significant improvements in prediction accuracy and temporal consistency over the current state-of-the-art models on various benchmarks. The results highlight the importance of spatial frequency modeling for enhancing video prediction performance, contributing to the advancement of multimedia technologies.

Abstract:
Unsupervised cross-domain 3D model retrieval aims to retrieve unlabeled 3D models (target domain) using labeled 2D images (source domain). Domain adaptation approaches have shown impressive performance for cross-domain 3D model retrieval. However, conventional methods typically represent samples from different domains as deterministic points, overlooking the diversity in sample characteristics and relationships. These approaches lead to challenges in achieving a robust representation of both samples and categories. To address above challenges, we propose a dual-stage uncertainty modeling (DSUM) for unsupervised cross-domain 3D model retrieval, which utilizes Gaussian distribution to effectively model the uncertainty characteristics in both sample and class and obtain the robust and domain-invariant representations. Specifically, in the multi-view uncertainty encoding stage, we discard the conventional pooling operations and utilize the uncertainty modeling among multiple views to fuse the common and specific information of 2D images and 3D models. In the cross-domain feature alignment stage, we adopt the Gaussian distribution of samples belonging to the same category, which can well maintain the sample diversity as well as facilitate to eliminate the domain discrepancy. Our method achieves improvements of 2.61% and 2.65% in terms of FT on two cross-domain datasets, respectively, verifying its superiority through extensive qualitative and quantitative experiments.

Abstract:
Optical flow estimation is a fundamental task in computer vision. The all-pairs correlation volume has enabled state-of-the-art performance in many optical flow estimation methods. However, all-pairs correlations provide only local matching clues, and lack global context, which could lead to mismatches in textureless and occluded regions. In this paper, we propose a novel all-pairs correlation volume aggregation (APCA) method which includes two key innovations. The first is a cost volume splitting and reassembling approach which partitions the full cost volume into smaller blocks and re-arranges those blocks to allow the use of 2D and 3D convolutions for cost volume aggregation. The second is hierarchical aggregation which performs 2D convolutions within blocks for local matching aggregation and 3D convolutions across blocks for global matching aggregation. We further design a novel optical flow estimation network APCAFlow based on APCA. APCAFlow achieves comparable performance to the most advanced approach, FlowFormer, but with significantly lower complexity. Specifically, APCAFlow reduces the model parameters, inference time, and memory consumption by 24.1%, 35.5%, and 21.6%, respectively, compared to FlowFormer. Furthermore, APCA can be easily integrated into several existing all-pairs cost volume-based methods for performance improvement.

Abstract:
Emotion recognition is essential in the diagnosis and rehabilitation of various mental diseases. In the last decade, electroencephalogram (EEG)-based emotion recognition has been intensively investigated due to its prominative accuracy and reliability, and graph convolutional network (GCN) has become a mainstream model to decode emotions from EEG signals. However, the electrode relationship, especially long-range electrode dependencies across the scalp, may be underutilized by GCNs, although such relationships have been proven to be important in emotion recognition. The small receptive field makes shallow GCNs only aggregate local nodes. On the other hand, stacking too many layers leads to over-smoothing. To solve these problems, we propose the pyramidal graph convolutional network (PGCN), which aggregates features at three levels: local, mesoscopic, and global. First, we construct a vanilla GCN based on the 3D topological relationships of electrodes, which is used to integrate two-order local features; Second, we construct several mesoscopic brain regions based on priori knowledge and employ mesoscopic attention to sequentially calculate the virtual mesoscopic centers to focus on the functional connections of mesoscopic brain regions; Finally, we fuse the node features and their 3D positions to construct a numerical relationship adjacency matrix to integrate structural and functional connections from the global perspective. Experimental results on four public datasets indicate that PGCN enhances the relationship modelling across the scalp and achieves state-of-the-art performance in both subject-dependent and subject-independent scenarios. Meanwhile, PGCN makes an effective trade-off between enhancing network depth and receptive fields while suppressing the ensuing over-smoothing.

Abstract:
As a pivotal branch of intelligent human-computer interaction, visual dialog is a technically challenging task that requires artificial intelligence (AI) agents to answer consecutive questions based on image content and history dialog. Despite considerable progresses, visual dialog still suffers from two major problems: (1) how to design flexible cross-modal interaction patterns instead of over-reliance on expert experience and (2) how to infer underlying semantic dependencies between dialogues effectively. To address these issues, an end-to-end framework employing dynamic interaction and hybrid graph reasoning is proposed in this work. Specifically, three major components are designed and the practical benefits are demonstrated by extensive experiments. First, a dynamic interaction module is developed to automatically determine the optimal modality interaction route for multifarious questions, which consists of three elaborate functional interaction blocks endowed with dynamic routers. Second, a hybrid graph reasoning module is designed to explore adequate semantic associations between dialogues from multiple perspectives, where the hybrid graph is constructed by aggregating a structured coreference graph and a context-aware temporal graph. Third, a unified one-stage visual dialog model with an end-to-end structure is developed to train the dynamic interaction module and the hybrid graph reasoning module in a collaborative manner. Extensive experiments on the benchmark datasets of VisDial v0.9 and VisDial v1.0 demonstrate the effectiveness of the proposed method compared to other state-of-the-art approaches.

Affiliations: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing, China; Video Technology Research and Development Center, CCTV International Network Company Ltd., Beijing, China; Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, China; Institute of Digital Media, Peking University, Beijing, China

Abstract:
Vehicle behavior analysis has gradually developed by utilizing trajectories and motion features to characterize on-road behavior. However, the existing methods analyze the behavior of each vehicle individually, ignoring the interaction between vehicles. According to the theory of interactive cognition, vehicle-to-vehicle interaction is an indispensable feature for future autonomous driving, just as interaction is universally required for traditional driving. Therefore, we place the vehicle behavior analysis in the context of the vehicle interaction scene, where the self-vehicle should observe the behavior category and degree of the other-vehicle that is about to interact with itself, in order to predict whether the other-vehicle will pass through the intersection first or later, and then decide to pass through or wait. Inspired by the interactive cognition, we develop a general framework of Structured Vehicle Behavior Analysis (StruVBA) and derive a new model of Structured Fully Convolutional Networks (StruFCN). Moreover, both Intersection over Union (IoU) and False Negative Rate (FNR) are adopted to measure the similarity between the predicted behavior degree and the ground truth. Experimental results illustrate that the proposed method achieves higher prediction accuracy than most existing methods, while predicting vehicle behavior with richer visual meaning. In addition, it also provides an example of modeling the interaction between vehicles and a verification for interaction cognition theory as well.

Abstract:
The stereo event-intensity camera setup is widely applied to leverage the advantages of both event cameras with low latency and intensity cameras that capture accurate brightness and texture information. However, such a setup commonly encounters cross-modality parallax that is difficult to be eliminated solely with stereo rectification especially for real-world scenes with complex motions and varying depths, posing artifacts and distortion for existing Event-based Video Frame Interpolation (E-VFI) approaches. To tackle this problem, we propose a novel Stereo Event-based VFI (SE-VFI) network (SEVFI-Net) to generate high-quality intermediate frames and corresponding disparities from misaligned inputs consisting of two consecutive keyframes and event streams emitted between them. Specifically, we propose a Feature Aggregation Module (FAM) to alleviate the parallax and achieve spatial alignment in the feature domain. We then exploit the fused features accomplishing accurate optical flow and disparity estimation, and achieving better interpolated results through flow-based and synthesis-based ways. We also build a stereo visual acquisition system composed of an event camera and an RGB-D camera to collect a new Stereo Event-Intensity Dataset (SEID) containing diverse scenes with complex motions and varying depths. Experiments on public real-world stereo datasets, i.e., DSEC and MVSEC, and our SEID dataset, demonstrate that our proposed SEVFI-Net outperforms state-of-the-art methods by a large margin.

Abstract:
Few-shot point cloud classification is currently an under-explored problem which aims to learn a point cloud classifier for novel categories given a few annotated training data. Most existing methods achieve classification by matching a query point cloud to the most similar support category at the global representation level. However, due to the complicated structure of the point cloud and scarce available training data, the global representations of the point clouds and categories are of low quality, limiting the matching accuracy. Therefore, in this paper, we propose the Component-Aware Matching Network (CMNet) that matches the point clouds at the component level in addition to the global level. Specifically, we construct a component set for each query point cloud and support category and develop a metric to measure the similarity between two component sets. The final prediction is the weighted sum of the global and component matching probabilities. Besides, we carefully devise a component matching pretraining scheme for CMNet to enhance its ability to extract component features, further improving its performance. To evaluate the effectiveness of our design, we conduct comprehensive experiments on three benchmarks, namely ModelNet-FS, ShapeNet-FS and Shrec-FS. As a result, CMNet consistently outperforms the existing methods with significant margins in all the experiments of the three benchmarks and sets new state-of-the-art performance.

Abstract:
Recently, many underwater image enhancement (UIE) methods have been proposed. Although much progress has been made, they still face two issues: (1) There exists a significant region-wise quality difference in a single underwater image due to the underwater imaging process, especially in regions with different scene depths. However, existing methods neglect this internal characteristic, yielding inferior performance; (2) Due to the unique acquisition approach, underwater image acquisition tools usually capture multiple images in the same or similar scenes. The underwater images to be enhanced in practical usage are highly correlated. However, when processing a single image, existing methods ignore the rich external information provided by related images. There is still room for improvement in their performance. This paper proposes an internal-external representation learning (UIERL) network to better perform UIE tasks with internal and external information, simultaneously. In the internal representation learning stage, a depth-based region feature guidance network is designed, including a region segmentation based on scene depth to sense regions with different quality levels, and a region-wise space encoder for region-specific feature learning, which provides an effective guidance for global features to guide intra-image differentiated enhancement. In the external representation learning stage, we first propose an external information extraction network to mine rich external information from related images. Then, internal and external features interact with each other via the proposed external-assist-internal module and internal-assist-external module, fully exploiting the rich internal and external information to better enhance a single image. Extensive experiments demonstrate the superiority of our UIERL.

Abstract:
In this paper, we focus on estimating six degrees of freedom (6DoF) pose of a face from a single RGB image, which is an important but under-investigated problem in 3D face applications such as face reconstruction, forgery detection and virtual try-on. This problem is different from traditional face pose estimation and 3D face reconstruction since the distance from camera to face should be estimated, which can not be directly regressed due to the non-linearity of the pose space. To solve the problem, we follow Perspective-n-Point (PnP) and predict the correspondences between 3D points in canonical space and 2D facial pixels on the input image to solve the 6DoF pose parameters. In this framework, the central problem of 6DoF estimation is building the correspondence matrix between a set of sampled 2D pixels and 3D points, and we propose a Correspondence Learning Transformer (CLT) to achieve this goal. Specifically, we build the 2D and 3D features with local, global, and semantic information, and employ self-attention to make the 2D and 3D features interact with each other and build the 2D–3D correspondence. Besides, we argue that 6DoF estimation is not only related with face appearance itself but also the facial external context, which contains rich information about the distance to camera. Therefore, we extract global-and-local features from the integration of face and context, where the cropped face image with smaller receptive fields concentrates on the small distortion by perspective projection, and the whole image with large receptive field provides shoulder and environment information. Experiments show that our method achieves a 2.0% improvement of MAE_r and ADD on ARKitFace and a 4.0%/0.7% improvement of MAE_t on ARKitFace/BIWI.

Abstract:
Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with Angular Reconstructive Text embeddings (ART), generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.

Abstract:
Outfit collocation requires considering the interrelationship and adaptability among the attributes of component items. However, with the numerous and diverse attributes of fashion items, accurately capturing attribute features and modeling the complex relationships between attributes become the key challenges. To address these challenges, we propose a novel scheme Decoupling-driven Multi-level Attribute Parsing for interpretable outfit collocation. First, we decouple a series of attribute features from the item's visual feature by fully supervised, which can improve the robustness of the model in processing both relevant and irrelevant attributes of items. Furthermore, employing a deep deconvolution neural network with attention mechanisms to reconstruct the decoupled attribute features into a visual image that is close to the original item image. It ensures all attribute features can be combined to contain complete item information. Next, graph attention networks are constructed to parse multi-level attribute compatibility relationships from three perspectives: intra-attribute, inter-attribute, and item integration relationships. Finally, we use multi-layer perceptrons to fuse the score distributions of the three and output the outfit compatibility score. Experiments conducted on the IQON3000 dataset demonstrate that our model outperforms existing state-of-the-art methods and exhibits good interpretability.

Abstract:
Due to the limited number of stable image feature descriptors and the simplistic concatenation approach to hash generation, existing hashing methods have not achieved a satisfactory balance between robustness and discrimination. To this end, a novel perceptual hashing method is proposed in this paper using feature fusion of fractional-order continuous orthogonal moments (FrCOMs). Specifically, two robust image descriptors, i.e., fractional-order Chebyshev Fourier moments (FrCHFMs) and fractional-order radial harmonic Fourier moments (FrRHFMs), are used to extract global structural features of a color image. Then, the canonical correlation analysis (CCA) strategy is employed to fuse these features during the final hash generation process. Compared to direct concatenation, CCA excels in eliminating redundancies between feature vectors, resulting in a shorter hash sequence and higher authentication performance. A series of experiments demonstrate that the proposed method achieves satisfactory robustness, discrimination and security. Particularly, the proposed method exhibits better tampering detection ability and robustness against combined content-preserving manipulations in practical applications.

Abstract:
It is very challenging to fully use cross-view information for stereo image super-resolution. Previous methods using pixel-based parallax-attention mechanisms do not consider neighborhood pixels. Also, they typically use convolutions for basic feature extraction, which may not be as effective as modern self-attention mechanisms in transformers. To address these limitations, we propose an efficient hybrid feature interaction network for stereo image super-resolution. Specifically, we propose a shifted cross-view interaction block that integrates neighborhood pixels and imposes constraints on the disparity range during cross-view interactions. In addition, we propose a hybrid feature interaction block consisting of local and global interaction branches for extracting intra-view features efficiently. In this block, we propose a design that incorporates lightweight attention connections and a partial downsampling operation to enhance spatial and channel feature interaction with high efficiency. Additionally, a dilated efficient channel attention mechanism is proposed to obtain cross-channel interactions within features. Experimental results evaluated on various metrics (PSNR, SSIM, and LPIPS) demonstrate that the proposed method achieves state-of-the-art stereo image super-resolution performance at relatively low computational cost. Moreover, the super-resolution images obtained by the proposed method achieve the smallest stereo matching errors compared to other methods.

Abstract:
Unsupervised video anomaly detection (UVAD) has gained significant attention due to its label-free nature. Typically, UVAD methods can be categorized into two branches, i.e. the one-class classification (OCC) methods and fully UVAD ones. However, the former may suffer from data imbalance and high false alarm rates, while the latter relies heavily on feature representation and pseudo-labels. In this paper, a novel feature reconstruction and disruption model (FRD-UVAD) is proposed for effective feature refinement and better pseudo-label generation in fully UVAD, based on cascade cross-attention transformers, a latent anomaly memory bank and an auxiliary scorer. The clip features are reconstructed using the space-time intra-clip information, as well as cross-inter-clip knowledge. Moreover, instead of blindly reconstructing all training features as OCC methods, a new disruption process is proposed to cooperate with the feature reconstruction simultaneously. Using the collected pseudo anomaly samples, it is able to emphasize the feature differences between normal and abnormal events. Additionally, a pre-trained UVAD scorer is utilized as a different criteria for anomaly prediction, which further refines the pseudo-labels. To demonstrate its effectiveness, comprehensive experiments and detailed ablation studies are conducted on three video benchmarks, namely CUHK Avenue, ShanghaiTech and UCF-Crime. Our proposed model (FRD-UVAD) achieves the best AUC performance (91.23%, 80.14%, and 82.12%) on all three datasets, surpassing other state-of-the-art OCC and fully UVAD methods. Furthermore, it obtains the lowest false alarm rate with a lower scene dependency, compared with other OCC methods.

Abstract:
Self-supervised skeleton action recognition has gained notable attention for its reduced reliance on annotated data. Contrastive learning methods, in particular, have emerged as prominent approaches. These works typically utilize a spatial-temporal backbone to extract features from action sequences for contrast in the feature space. Yet, they often rely on average pooling for temporal feature aggregation, neglecting the intricate higher-order temporal dynamics of the sequences. In this work, we introduce Koopman Temporal Contrastive Learning (KTCL), a Koopman theory inspired contrastive learning framework, which focuses on the localized latent dynamics of the sequence by learning discriminative linear system dynamics. Given an action sequence, we first map it into a new space where the temporal evolution becomes linear. A dynamics-oriented contrastive loss is used to enforce the dynamics of positive (or negative) samples more similar (or dissimilar). To tackle the diverse dynamics across different action phases within one sequence, we further introduce segment-level localized linear dynamics, accompanied by a cross-matching mechanism for alignment. Additionally, a cross-order contrastive loss is proposed to further amplify the effect of contrast across features of different orders. Intensive experiments on four benchmark datasets show that the proposed methods achieve superior performance than competing methods.

Abstract:
A wonderful piece of music is the essence and soul of dance, which motivates the study of automatic music generation for dance. To create appropriate music from dance, cross-modal correlations between dance and music such as rhythm and style, should be considered. However, existing dance-to-music methods have difficulties in achieving rhythmic alignment and stylistic matching simultaneously. Additionally, the diversity of generated samples is limited due to the lack of available paired data. To address these issues, we propose DanceComposer, a novel dance-to-music framework, which generates rhythmically and stylistically consistent multi-track music from dance videos. DanceComposer features a Progressive Conditional Music Generator (PCMG) that gradually incorporates rhythm and style constraints, enabling both rhythmic alignment and stylistic matching. To enhance style control, we introduce a Shared Style Module (SSM) that learns cross-modal features as stylistic constraints. This allows the PCMG can be trained on extensive music-only data and diversifies generated pieces. Quantitative and qualitative results show that our method surpasses the state-of-the-art in overall music quality, rhythmic consistency, and stylistic consistency.

Abstract:
Based on deterministic single-point embedding, most extant image-text retrieval methods only focus on the match of ground truth while suffering from one-to-many correspondence, where besides annotated positives, many similar instances of another modality should be retrieved by a given query. Recent solutions of probabilistic embedding and rectangle mapping still encounter some drawbacks, albeit their promising effectiveness at multiple matches. Meanwhile, the exploration of one-to-many correspondence is still insufficient. Therefore, this paper proposes a novel geometric representation to Estimate the Semantics of heterogeneous data via Sector Embedding (dubbed ESSE). Specifically, a given image/text can be projected as a sector, where its symmetric axis represents mean semantics and the aperture estimates uncertainty. Further, a sector matching loss is introduced to better handle the multiplicity by considering the sine of included angles as distance calculation, which encourages candidates to be contained by the apertures of a query sector. The experimental results on three widely used benchmarks CUB, Flickr30 K and MS-COCO reveal that sector embedding can achieve competitive performance on multiple matches and also improve the traditional ground-truth matching of the baselines. Additionally, we also verify the generalization to video-text retrieval on two extensively used datasets of MSRVTT and MSVD, and to text-based person retrieval on CUHK-PEDES. This superiority and effectiveness can also demonstrate that the bounded property of the aperture can better estimate semantic uncertainty when compared to prior remedies.

Abstract:
Recent advanced image restoration (IR) methods typically stack homogeneous operators hierarchically in the UNet architecture. To achieve higher accuracy, these models are now going deeper and more complex, making them resource-intensive. After comprehensively reviewing different operators within modern networks, we provide an in-depth analysis of their individual favorable properties and invent a novel efficient IR network by redesigning the UNet architecture (RUN) with heterogeneous operators. Specifically, we propose three heterogeneous operators for different relational interactions concerning the specificity of different hierarchical features of the UNet architecture. First, the spatial self-attention block (SSA Block) processes high-resolution top-level features by modeling pixel interactions from the spatial dimension. Second, the channel self-attention block (CSA Block) performs channel recalibration and information transmission for the bottom-level features with rich channels. Finally, a simple and efficient convolution block (Conv Block) is used to facilitate middle-order information propagation, which complements the self-attention mechanism to achieve local-global coupling. Based on these designs, our RUN enables more comprehensive information dissemination and interaction regardless of topological distance, thus achieving superior performance while maintaining desirable computational budgets. Extensive experiments show that our RUN achieves state-of-the-art results for a variety of IR tasks, including image deblurring, image denoising, image deraining, and low-light image enhancement.

Abstract:
Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.

Abstract:
In this paper, we present a novel approach named correlation-guided distribution and geometry alignments (CDGA) for heterogeneous domain adaptation. Unlike existing methods that typically combine feature alignment and domain alignment into a single objective function, our proposed CDGA separates the two alignments into distinct steps. The two adaptation steps are: paired canonical correlation analysis (PCCA) and distribution and geometry alignments (DGA). In the PCCA step, CDGA focuses on maximizing the within-category correlation between source and target samples to produce the dimension-aligned feature representations for the next adaptation step. In the DGA step, CDGA is responsible for learning a classifier that incorporates both distribution and geometry alignments. Furthermore, during this step, the highly confident pseudo labeled samples are carefully selected for the next iteration of PCCA, establishing a beneficial coupling between PCCA and DGA to improve the adaptation performance in an iterative manner. Experimental results on various visual cross-domain benchmarks demonstrate that CDGA achieves remarkable performance compared to the existing shallow heterogeneous domain adaptation methods and even exhibits superiority over the state-of-the-art neural network-based approaches.

Abstract:
Aremarkable number of backdoor attack methods have been proposed in the literature on deep neural networks (DNNs). However, it hasn't been sufficiently addressed in the existing methods of achieving true senseless backdoor attacks that are visually invisible and label-consistent. In this paper, we propose a new backdoor attack method where the labels of the backdoor images are perfectly aligned with their content, ensuring label consistency. Additionally, the backdoor trigger is meticulously designed, allowing the attack to evade DNN model checks and human inspection. Our approach employs an auto-encoder (AE) to conduct representation learning of benign images and interferes with salient classification features to increase the dependence of backdoor image classification on backdoor triggers. To ensure visual invisibility, we implement a method inspired by image steganography that embeds trigger patterns into the image using the DNN and enable sample-specific backdoor triggers. We conduct comprehensive experiments on multiple benchmark datasets and network architectures to verify the effectiveness of our proposed method under the metric of attack success rate and invisibility. The results also demonstrate satisfactory performance against a variety of defense methods.

Abstract:
Low-light images often suffer from severe detail lost in darker areas and non-uniform illumination distribution across distinct regions. Thus, structure modeling and region-specific illumination manipulation are crucial for high-quality enhanced image generation. However, previous methods encounter limitations in exploring robust structure priors and lack adequate modeling of illumination relationships among different regions, resulting in structure artifacts and color deviations. To alleviate this limitation, we propose a Segmentation-Guided Framework (SGF) which integrates the constructed robust segmentation priors to guide the enhancement process. Specifically, SGF first constructs a robust image-level edge prior based on the segmentation results of the Segment Anything Model (SAM) in a zero-shot manner. Then, we generate lighted-up region-aware feature-level prior by incorporating region-aware dynamic convolution. To adequately model long-distance illumination interactions across distinct regions, we design a segmentation-guided transformer block (SGTB), which utilizes the lighted-up region-aware feature-level prior to guide self-attention calculation. By arranging the SGTBs in a symmetric hierarchical structure, we derive a segmentation-guided enhancement module that operates under the guidance of both the image and feature-level priors. Comprehensive experimental results show that our SGF performs remarkably in both quantitative evaluation and visual comparison.

Abstract:
In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.

Abstract:
Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.

Affiliations: Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Department of Artificial Intelligence, School of Computing, Kyung Hee University, Yongin-si, South Korea; Center for Future Media, and the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Center for Future Multimedia and the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
The vulnerability of deep neural networks to adversarial examples has raised huge concerns about the security of these algorithms. Black-box adversarial attacks have received a lot of attention as an influential method for evaluating model robustness. While various sophisticated adversarial attack methods have been proposed, the success rate in the black-box scenario still needs to be improved. To address these issues, we develop an Adaptive Multi-scale Degradation-based Attack method called AMDA. The intuitive motivation behind our approach is that different models tend to have similar attention regions for low-scale images. Specifically, AMDA uses degraded images to generate perturbations at different scales and fuses these perturbations to generate adversarial examples that are insensitive to model changes. Furthermore, we design an adaptive multi-scale perturbation fusion that evaluates the transferability of perturbations at different scales based on noise and adaptively allocates fusion weights to prioritize strong transferability attacks and avoid being compromised by local optima. Extensive experimental results on the ImageNet, CIFAR-100, and CIFAR-10 datasets demonstrate that the proposed AMDA algorithm exhibits competitive performance for both normally trained models and defense models.

Abstract:
Supervised human action recognition methods based on skeleton data have achieved impressive performance recently. However, many current works emphasize the design of different contrastive strategies to gain stronger supervised signals, ignoring the crucial role of the model's encoder in encoding fine-grained action representations. Our key insight is that a superior skeleton encoder can effectively exploit the fine-grained dependencies between different skeleton information (e.g., joint, bone, angle) in mining more discriminative fine-grained features. In this paper, we devise an innovative hierarchical aggregated graph neural network (HA-GNN) that involves several core components. In particular, the proposed hierarchical graph convolution (HGC) module learns the complementary semantic information among joint, bone, and angle in a hierarchical manner. The designed pyramid attention fusion mechanism (PAFM) fuses the skeleton features successively to compensate for the action representations obtained by the HGC. We use the multi-scale temporal convolution (MSTC) module to enrich the expression capability of temporal features. In addition, to learn more comprehensive semantic representations of the skeleton, we construct a multi-task learning framework with simple contrastive learning and design the learnable data-enhanced strategy to acquire different data representations. Extensive experiments on NTU RGB+D 60/120, NW-UCLA, Kinetics-400, UAV-Human, and PKUMMD datasets prove that the proposed HA-GNN without contrastive learning achieves state-of-the-art performance in skeleton-based action recognition, and it achieves even better results with contrastive learning.

Abstract:
Melody-to-lyrics generation, which is based on syllable-level generation, is an intriguing and challenging topic in the interdisciplinary field of music, multimedia, and machine learning. Many previous research projects generate word-level lyrics sequences due to the lack of alignments between syllables and musical notes. Moreover, controllable lyrics generation from melody is also less explored but important for facilitating humans to generate diverse desired lyrics. In this work, we propose a controllable melody-to-lyrics model that is able to generate syllable-level lyrics with user-desired rhythm. An explicit n-gram (EXPLING) loss is proposed to train the Transformer-based model to capture the sequence dependency and alignment relationship between melody and lyrics and predict the lyrics sequences at the syllable level. A prior attention mechanism is proposed to enhance the controllability and diversity of lyrics generation. Experiments and evaluation metrics verified that our proposed model has the ability to generate higher-quality lyrics than previous methods and the feasibility of interacting with users for controllable and diverse lyrics generation. We believe this work provides valuable insights into human-centered AI research in music generation tasks. The source codes for this work will be made publicly available for further reference and exploration.

Abstract:
Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point clouds. Experiments demonstrate that our 3D-VLAP achieves comparable results with current fully supervised methods, meanwhile alleviating the data annotation pressure.

Abstract:
Deep learning techniques are increasingly integrated into rescaling-based video compression frameworks and have shown great potential in improving compression efficiency. However, existing methods achieve limited performance because 1) they treat context priors generated by codec as independent sources of information, ignoring potential interactions between multiple priors in rescaling, which may not effectively facilitate compression; 2) they often employ a uniform sampling ratio across regions with varying content complexities, resulting in the loss of important information. To address the above two issues, this paper proposes a spatial multi-prior driven resolution rescaling framework for intra-frame coding, called MP-RRF, consisting of three sub-networks: a multi-prior driven network, a downscaling network, and an upscaling network. First, the multi-prior driven network employs complexity and similarity priors to smooth the unnecessarily complicated information while leveraging similarity and quality priors to produce high-fidelity complementary information. This interaction of complexity, similarity and quality priors ensures redundancy reduction and texture enhancement. Second, the downscaling network discriminatively processes components of different granularities to generate a compact, low-resolution image for encoding. The upscaling network aggregates a complementary set of contextual multi-scale features to reconstruct realistic details while combining variable receptive fields to suppress multi-scale compression artifacts and resampling noise. Extensive experiments show that our network achieves a significant 23.84% Bjøntegaard Delta Rate (BD-Rate) reduction under all-intra configuration compared to the codec anchor, offering the state-of-the-art coding performance.

Abstract:
Domain adaptive object detection (DAOD) aims to develop a detector trained on labeled source domains to identify objects in unlabeled target domains. A primary challenge in DAOD is the domain shift problem. Most existing methods learn domain-invariant features within single domain embedding space, often resulting in heavy model biases due to the intrinsic data properties of source domains. To mitigate the model biases, this paper proposes VLDadaptor, a domain adaptive object detector based on vision-language models (VLMs) distillation. Firstly, the proposed method integrates domain-mixed contrastive knowledge distillation between the visual encoder of CLIP and the detector by transferring category-level instance features, which guarantees the detector can extract domain-invariant visual instance features across domains. Then, VLDadaptor employs domain-mixed consistency distillation between the text encoder of CLIP and detector by aligning text prompt embeddings with visual instance features, which helps to maintain the category-level feature consistency among the detector, text encoder and the visual encoder of VLMs. Finally, the proposed method further promotes the adaptation ability by adopting a prompt-based memory bank to generate semantic-complete features for graph matching. These contributions enable VLDadaptor to extract visual features into the visual-language embedding space without any evident model bias towards specific domains. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on Pascal VOC to Clipart adaptation tasks and exhibits high accuracy on driving scenario tasks with significantly less training time.

Abstract:
Semi-supervised temporal action segmentation (SS-TAS) aims to perform frame-wise classification in long untrimmed videos, where only a fraction of videos in the training set have labels. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. However, learning the representation of each frame by unsupervised contrastive learning for action segmentation remains an open and challenging problem. In this paper, we propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations for SS-TAS. Specifically, for representation learning, SMC is first used to explore intra- and inter-information variations in a unified and contrastive way, based on action-specific semantic information and temporal information highlighting relations between actions. Then, the NCA module, which is responsible for enforcing spatial consistency between neighbourhoods centered at different frames to alleviate over-segmentation issues, works alongside SMC for semi-supervised learning (SSL). Our SMC outperforms the other state-of-the-art methods on three benchmarks, offering improvements of up to 17.8% and 12.6% in terms of Edit distance and accuracy, respectively. Additionally, the NCA unit results in significantly better segmentation performance in the presence of only 5% labelled videos. We also demonstrate the generalizability and effectiveness of the proposed method on our Parkinson's Disease Mouse Behaviour (PDMB) dataset.

Abstract:
Image-Text Matching (ITM) aims to establish the correspondence between images and sentences. ITM is fundamental to various vision and language understanding tasks. However, there are limitations in the way existing ITM benchmarks are constructed. The ITM benchmark collects pairs of images and sentences during construction. Therefore, only samples that are paired at collection are annotated as positive. All other samples are annotated as negative. Many correlations are missed in these samples that are annotated as negative. For example, a sentence matches only one image at the time of collection. Only this image is annotated as positive for the sentence. All other images are annotated as negative. However, these negative images may contain images that correspond to the sentences. These mislabeled samples are called false negatives. Existing ITM models are optimized based on annotations containing mislabels, which can introduce noise during training. In this paper, we propose an ITM framework integrating Language Guidance (LG) for correcting false negatives. A language pre-training model is introduced into the ITM framework to identify false negatives. To correct false negatives, we propose language guidance loss, which adaptively corrects the locations of false negatives in the visual-semantic embedding space. Extensive experiments on two ITM benchmarks show that our method can improve the performance of existing ITM models. To verify the performance of correcting false negatives, we conduct further experiments on ECCV Caption. ECCV Caption is a verified dataset where false negatives in annotations have been corrected. The experimental results show that our method can recall more relevant false negatives.

Abstract:
Presently, to obtain a more accurate density map and crowd number, existing methods often count by combining training RGB images and depth images. However, these methods are not ideal for capturing and fusing complementary features in RGB-D. Therefore, to solve the above problems, we propose a collaborative cross-modal attention network named CCANet for accurate RGB-D crowd counting. CCANet is mainly composed of the collaborative cross-modal attention module (CCAM) and the collaborative cross-modal fusion module (CCFM). Specifically, CCAM focuses on adaptive, interleaved RGB-D information through channel and spatial cross-modal attentions to fully capture complementary features in different modes. CCFM can adaptively integrate these features by weighing the importance of the above complementary features. A large number of experiments on the ShanghaiTechRGBD and MICC benchmarks have proven the effectiveness of CCANet in RGB-D crowd counting. In addition, our CCANet is generally applicable to multimodal crowd counting and has achieved superior counting performance on the RGBT-CC benchmark.

Abstract:
The study of identifying human subgroups from videos is a significant topic, which has received a lot of attention in multiple disciplines. So far, however, there has been little consideration about combining it with relevant conceptions in network science. Therefore, this article proposes a novel method for the automatic identification of human subgroups in dynamic pedestrian flows. The spatial proximity and temporal continuity are combined to calculate the interaction intensity between pedestrians, by which a time-dependent pedestrian flow network is constructed. Based on the objective function of weighted partition density, the optimal threshold is used to determine community structures that correspond to human subgroups in frame images. Numerical experiments demonstrate that our method achieves high identification accuracy under various evaluation datasets, and exhibits better performance than existing methods in terms of different crowd densities, various numbers of subgroup members, and certain levels of trajectory noise. Furthermore, this work provides valuable implications for the understanding of subgroup behaviors and the modeling of subgroup movements.

Abstract:
Fake news detection has gotten continuous attention during these years with more and more people have been posting and reading news online. To enable fake news detection, existing researchers usually assume labeled posts are provided for two classes (true or false) so that the model can learn a discriminative classifier from the labeled data. However, this supposition may not hold true in reality, as most users may only label a small number of posts in a single category that they are interested in. Furthermore, most existing methods fail to mask the noise or irrelevant context (i.e., regions or words) between different modalities to assist in strengthening the correlations between relevant contexts. To tackle these issues, we present a curriculum-based multi-modal masked transformer network (CMMTN) for positive unlabeled multi-modal fake news detection by jointly modeling the inter-modality and intra-modality relationships of multi-modal information and masking the irrelevant context between modalities. In particular, we adopt BERT and ResNet to obtain better representations for texts and images, separately. Then, the extracted features of images and texts are fed into a multi-modal masked transformer network to fuse the multi-modal content and mask the irrelevant context between modalities by calculating the similarity between inter-modal contexts. Finally, we design a curriculum-based PU learning method to handle the positive and unlabeled data. Massive experiments on three public real datasets prove the effectiveness of the CMMTN.

Abstract:
Previous studies have shown that there is a strong correlation between radiologists' diagnoses and their gaze when reading medical images. The extent to which gaze is attracted by content in a visual scene can be characterised as visual saliency. There is a potential for the use of visual saliency in computer-aided diagnosis in radiology. However, little is known about what methods are effective for diagnostic images, and how these methods could be adapted to address specific applications in diagnostic imaging. In this study, we investigate 20 state-of-the-art saliency models including 10 traditional models and 10 deep learning-based models in predicting radiologists' visual attention while reading 196 mammograms. We found that deep learning-based models represent the most effective type of methods for predicting radiologists' gaze in mammogram reading; and that the performance of these saliency models can be significantly improved by transfer learning. In particular, an enhanced model can be achieved by pre-training the model on a large-scale natural image saliency dataset and then fine-tuning it on the target medical image dataset. In addition, based on a systematic selection of backbone networks and network architectures, we proposed a parallel multi-stream encoded model which outperforms the state-of-the-art approaches for predicting saliency of mammograms.

Abstract:
Group activity analysis has attracted remarkable attention recently due to the widespread applications in security, entertainment and military. This article targets at learning group activity representations with self-supervision, which differs from the majorities relying heavily on manually annotated labels. Moreover, existing Self-Supervised Learning (SSL) methods for videos are sub-optimal to generate such representations because of the complex context dynamics in group activities. In this article, an end-to-end framework termed Contextualized Relation Predictive Model (Con-RPM) is proposed for self-supervised group activity representation learning with predictive coding. It involves the Serial-Parallel Transformer Encoder (SPTrans-Encoder) to model the context of spatial interactions and temporal variations, and the Hybrid Context Transformer Decoder (HConTrans-Decoder) to predict the future spatio-temporal relations guided by holistic scene context. Additionally, to improve the discriminability and consistency of prediction, we introduce a united loss integrating group-wise and person-wise contrastive losses in frame-level as well as the adversarial loss in global sequence-level. Consequently, our Con-RPM learns robust group representations via describing temporal evolutions of individual relationships and scene semantics explicitly. Extensive experimental results on downstream tasks indicate the effectiveness and generalization of our model in self-supervised learning, and present state-of-the-art performance on the Volleyball, Collective Activity, VolleyTactic, and Choi's New datasets.

Abstract:
Weakly-supervised phrase grounding aims to localize a specific region in an image that corresponds to the given textual phrase, where the mapping between noun phrases and image regions is not available in the training stage. Previous methods typically exploit an additional proxy task (e.g., phrase reconstruction or image-phrase alignment) to provide supervision for training, since the lack of region-level annotations in the weakly-supervised setting. However, there exists a significant gap in optimization objectives between the proxy tasks and the target grounding task, which may result in low-efficient optimization for the target model. Therefore, in this paper, we propose a novel dual reinforcement learning framework to directly optimize the phrase grounding model. Specifically, we consider the duality of phrase grounding and phrase generation tasks. These two tasks form a closed loop that can provide quality feedback signals to measure the performance of each other. In this way, we can measure the correctness of the localized regions and thus be able to optimize the grounding model directly. We design two reward functions to quantify the feedback signals and train the models via reinforcement learning. In addition, to relieve the training difficulty of our framework, we present a heuristic algorithm to generate pseudo region-phrase pairs to warm-start our models. We perform experiments on two popular phrase grounding datasets: ReferItGame and Flickr30K Entities, and the results demonstrate that our method outperforms the previous methods by a large margin.

Abstract:
Dialogue emotion detection is always challenging due to human subjectivity and the randomness of dialogue content. In a conversation, the emotion of each person often develops via a cumulative process, which can be influenced by many elements of uncertainty. Much commonsense knowledge influences people's emotions imperceptibly, such as experiential or habitual knowledge. In the process of conversation, this commonsense knowledge information can be used to enrich the semantic information of each utterance and improve the accuracy of emotion recognition. In this paper, we propose a growing graph model for dialogues emotion detection based on retrieval of external knowledge atlas ATOMIC from local and global respectively, which can effectively represent the dialogues as a process variable in a sequence and the correlation among utterances also can be represented by the graph model. In particular, 1) we introduce a common sense knowledge graph for linking the commonsense knowledge retrieved from external knowledge atlas ATOMIC, which can effectively add auxiliary information to improve the performance of each utterance's representation. 2) We propose a novel self-supervised learning method for extracting the latent topic of each dialogue. Based on this design, we also propose an effective optimization mechanism to make the representation (embedding) of latent topic has a better distinction for the next operation. 3) Finally, the cross-attention module is utilized to combine the utterances' features and the latent conversation topic information. The attention mechanism can effectively use topic information to supplement the representation of utterances and improve recognition performance. The model is tested on three popular datasets in dialogue emotion detection and is empirically demonstrated to outperform the state-of-the-art approaches. Meanwhile, to demonstrate the performance of our approach, we also build a long dialogue dataset. The average length of each conversation is over 50 utterances. The final experimental results also demonstrate the superior performance of our approach.

Abstract:
As an inevitable phenomenon in real-world applications, data imperfection has emerged as one of the most critical challenges for multimodal sentiment analysis. However, existing approaches tend to overly focus on a specific type of imperfection, leading to performance degradation in real-world scenarios where multiple types of noise exist simultaneously. In this work, we formulate the imperfection with the modality feature missing at the training period and propose the noise intimation based adversarial training framework to improve the robustness against various potential imperfections at the inference period. Specifically, the proposed method first uses temporal feature erasing as the augmentation for noisy instances construction and exploits the modality interactions through the self-attention mechanism to learn multimodal representation for original-noisy instance pairs. Then, based on paired intermediate representation, a novel adversarial training strategy with semantic reconstruction supervision is proposed to learn unified joint representation between noisy and perfect data. For experiments, the proposed method is first verified with the modality feature missing, the same type of imperfection as the training period, and shows impressive performance. Moreover, we show that our approach is capable of achieving outstanding results for other types of imperfection, including modality missing, automation speech recognition error and attacks on text, highlighting the generalizability of our model. Finally, we conduct case studies on general additive distribution, which introduce background noise and blur into raw video clips, further revealing the capability of our proposed method for real-world applications.

Abstract:
The higher requirements for deep neural networks are driving researchers to have a deeper understanding of the internals of neural networks. The class activation map (CAM) based methods can provide a convincing interpretation of the features extracted by the neural network from both visual and quantitative perspectives. However, the existing CAM methods do not take into account that the non-target region also contains target-related activation, which results in the generated saliency map containing noise from unrelated regions. In addition, the soft mask with continuous value not only contains more non-target regions for gradient-free CAM, but also causes the characteristics and distribution of the target region to be disturbed. This paper proposed a novel CAM method named Bipolar Information CAM (BI-CAM) to interpret convolutional neural networks (CNNs) and graph convolutional networks (GCNs). Firstly, dual-stream information is proposed to precisely quantify the relationship between the target region and the non-target region for an image/graph. Secondly, binary reformation is also proposed to generate a hard mask that can retain the original features and regions. Finally, we propose to use concise and effective Point-wise Mutual Information (PMI) to measure the quantitative relationship between the image and the local region with respect to the label. The results of the experiment show that the proposed BI-CAM achieves significantly better performance in the faithfulness evaluation from the perspectives of visualization and quantitative analysis than other competitive interpretation methods.

Abstract:
Continuous sign language recognition (CSLR) aims to map a sign video into a sentence of text words in the same order as the signs. Generally, word error rate (WER), i.e., editing distance, is adopted as the main evaluation metric. Since this metric is not differentiable, current deep-learning-based CSLR methods usually resort to connectionist temporal classification (CTC) loss during optimization, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap between CTC loss and WER, the decoded sequence with the maximum probability in CTC may not be the one with the lowest WER. To tackle this issue, we propose a novel prior-aware cross modality augmentation learning method. In our approach, we first generate the pseudo video-text pair by cross modality editing, i.e., substitution, deletion and insertion on the paired real video-text data. To ensure the pseudo data quality, we guide the editing with both textual grammar prior and visual pose transition consistency prior. In this way, the generated pseudo video and text sentence follow the underlying distribution of the sign language data, and sever as more genuine hard examples for the cross modality representation learning of our CSLR task. Based on the real and generated pseudo data, we optimize our CSLR framework with three loss terms. We evaluate our approach on popular large-scale CSLR datasets and extensive experiments demonstrate the effectiveness of our method.

Abstract:
Human pose forecasting that aims to predict the body poses happening in the future is an important task in computer vision. However, long-term pose forecasting is particularly challenging because modeling long-range dependencies across the spatial-temporal level is hard for joint-based representation. Another challenge is uncertainty prediction since the future prediction is not a deterministic process. In this article, we present a novel Bayesian Spatial-Temporal Graph Transformer (BSTG-Trans) for predicting accurate, diverse, and uncertain future poses. First, we apply a spatial-temporal graph transformer as an encoder and a temporal-spatial graph transformer as a decoder for modeling the long-range spatial-temporal dependencies across pose joints to generate the long-term future body poses. Furthermore, we propose a Bayesian sampling module for uncertainty quantization of diverse future poses. Finally, a novel uncertainty estimation metric, namely Uncertainty Absolute Error is introduced for measuring both the accuracy and uncertainty of each predicted future pose. We achieve state-of-the-art performance against other baselines on Human3.6 M and HumanEva-I in terms of accuracy, diversity, and uncertainty for long-term pose forecasting. Moreover, our comprehensive ablation studies demonstrate the effectiveness and generalization of each module proposed in our BSTG-Trans.

Abstract:
Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.

Abstract:
Spatial audio is a crucial component of omnidirectional videos (ODVs), which can provide an immersive experience by enabling viewers to perceive sound sources in all directions. However, most visual attention modeling works for ODVs focus only on visual cues, and audio modality is rather rarely considered. Additionally, the existing audio-visual saliency models for ODVs lack spatial audio location-awareness (i.e. sound source location-agnostic) and audio content attributes discriminability (i.e. audio content attributes-agnostic). To this end, we propose a novel audio-visual perception saliency (AVPS) model with spatial audio location-awareness and audio content attributes-adaptive to efficiently address the problem of fixation prediction in ODVs. Specifically, we first utilize the improved group equivariant convolutional neural network (G-CNN) with eidetic 3D LSTM (E3D-LSTM) to extract spatial-temporal visual features. Then we perceive sound source locations by computing the audio energy map (AEM) of the audio information in ODVs. Subsequently, we introduce SoundNet to extract audio features with multiple attributes. Finally, we develop an audio-visual feature fusion module to adaptively integrate spatial-temporal visual features and spatial auditory information to generate the final audio-visual saliency map. Extensive experiments in three audio modalities validate the effectiveness of the proposed model. Meanwhile, the performance of the proposed model is superior to the other 10 state-of-the-art saliency models.

Abstract:
In this paper, we tackle a special face completion task, facial displacement completion, which can offer a key component for many single image 3D face reconstruction systems. To produce a detailed 3D face with ear-to-ear complete displacement UV map, we propose a novel Displacement Completion method based on Transformer (DCT). Current transformer based image inpainting methods usually follow a two-stage scheme, which firstly recovers the masked pixels in low resolution with transformer, and then replenishes the inpainting result in high resolution with GAN. Although these methods have achieved great success, they suffer from information loss from two aspects when applied in face completion: 1) The downsampling operation makes transformer only produce a coarse appearance prior for GAN, incurring middle and low level information loss. 2) Some meaningful facial semantics can be well captured by transformer and further benefit the completion, but it's has not yet been explored. Motivated by the above considerations, we come up with three key designs in the proposed DCT: PCA tokenization, BERT-style learning, and style modulation. Firstly, we use PCA tokenization to replace the downsampling in transformer to preserve more meaningful structures. Secondly, we make transformer simulate the two tasks in BERT, Masked Language Model (MLM) and Next Sentence Prediction (NSP), for both masked pixels and facial attributes recovery. Thirdly, we encode the outcome of transformer as the latent code to guide an image translation network in the StyleGAN2 modulation way. Experments on both FaceScape dataset and in-the-wild data demonstrate DCT's better performance compared with other transformer based or GAN based completion methods.

Abstract:
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make decisions based on the same image regions. We apply this method to both conventional attention and the recent Transformer models and carry out extensive experiments on the VCR benchmark dataset. The results demonstrate that with the attention alignment module, our method achieves a considerable improvement over the baseline methods, evidently revealing the feasibility of the coupling of the two processes as well as the effectiveness of the proposed method.

Abstract:
In tile-based 360° video streaming, the users employ the tile rate allocation algorithm to select appropriate bitrate to maximize the quality of experience (QoE). The preferences and viewports, however, can vary significantly across the different users. Since the users independently choose their bitrate according to their own preferences and viewports, it is hard to ensure QoE fairness for users under the constraint of available bandwidth. In this article, we propose a QoE-fairness aware bitrate allocation algorithm for multi-users (QBAM) to reduce difference of user QoE. According to the trajectory of the user viewpoint and user preferences for video quality, rebuffer time and quality switching, we leverage multi-agent reinforcement learning to train the bitrate allocation strategy. The experimental results show, compared with the current tile rate allocation algorithm, QBAM effectively improves the QoE fairness.

Abstract:
Joint detection and tracking, which solves two fundamental vision challenges in a unified manner, is a challenging topic in computer vision. In this area, the proper use of spatial-temporal information in videos can help reduce local defects and improve the quality of feature representations. Although modeling low-level (usually pixel-wise) spatial-temporal information has been studied, instance-level spatial-temporal correlations (i.e., relations between semantic regions in which instances have occurred) have not been fully exploited. In comparison, modeling instance-level correlation is a more flexible and reasonable way to enhance feature representations. However, we have found that conventional instance-level relation learning that works for the separate tasks of detection or tracking is not effective in joint tasks in which a variety of scenarios may be presented. To try to resolve this problem, in this study, we effectively exploited instance-level spatial-temporal semantic information for joint detection and tracking via a joint relation learning pipeline with a novel relation learning mechanism called Similarity- and Quality-Guided Attention (SQGA). Specifically, we added task-specific SQGA relation modules before the corresponding task prediction heads to refine the instance feature representation using features of other reference instances in the neighboring frames; these features are aggregated on the basis of relational affinities. In particular, in SQGA, relational affinities were factorized to similarity and quality terms so that fine-grained supervision rules could be applied. Then we added task-specific attention losses for each SQGA relation module, resulting in a better feature aggregation for the corresponding task. Quantitative experiments based on several challenging multi-object tracking benchmarks showed that our approach was more effective than the baselines and provided competitive results compared with recent state-of-the-art methods.

Abstract:
After several years of development, deep synthesis technology has made significant progress in image and video synthesis. Deep forgery represented by Deepfakes has become a research hotspot, which is used as a tool for disinformation attacks. The current strongly discriminative models can have good performance on specific datasets, even close to 100% accuracy. Unfortunately, since a specific discriminative method only fits a specific data distribution, and different forgery methods or datasets have different data distributions. These methods fail to achieve high performance in cross-dataset detection. In response to this problem and focusing on the actual situation, we adjust the strong generalization detection across the dataset to the generalization detection of unseen fake video. We propose Multi-Crise-Cross Attention and StyleGANv2 Generative Adversarial Network (MCS-GAN). Firstly, we built a Generative Adversarial Network (GAN) framework to learn the distribution of real face data and generate corresponding face images. Secondly, to break the high stitch between the fake region and the background, the model needs to have strong enough feature analysis and pixel restoration capabilities. Therefore, we propose a generator consisting of a Multi-Crise-Cross-Attention (MC) encoder and a StyleGANv2 (SG2) decoder. Finally, to avoid the situation where as long as a face is normal or different faces are abnormal, we set a latent space encoding discriminator and increase the ratio of latent space vector, so as to detect anomaly generated by the forgery operation acting on latent space. We conduct some model generalization experiments on videos on the Internet and some popular deepfake databases. The results show that the accuracy of our method is better compared with the best methods.

Abstract:
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both the concept level and context level by self-supervised contrastive learning. It doesn't require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pre-training independent-embedding models respectively by 9.1%, 4.2%, and 6.6% in terms of R@sum score on Flickr30 K, MS-COCO 1 K and MS-COCO 5 K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

Abstract:
Unsupervised domain adaptation (UDA) is extremely effective for transferring knowledge from a label-rich source domain to a label-scarce target domain. Because the target domain is unlabeled and may contain additional novel classes, open-set domain adaptation (ODA) has been suggested as a possible solution to detect these novel classes in the training phase. However, existing ODA methods rely heavily on abundant fully labeled source data, which are expensive to collect in specific applications and may also contain novel classes. In this study, we propose a novel self-labeling framework with prototypical contrastive learning and mutual information maximization to achieve ODA even when the amount of labeled data is very small, which is a new problem setting named few-shot ODA (FODA). We use self-supervised prototypical contrastive learning to train the network to learn the representations of source and target samples and maximize the mutual information between labels and input data to simultaneously recognize known and novel classes in the source and target domains. We evaluated our strategy in several domain adaptation environments and found that our method performed far better than existing approaches.

Abstract:
In this research, we consider the evolving threshold access structure, denoted as (k, \infty), for size invariant visual cryptography scheme (SIVCS). The so-called (k, \infty) threshold indicates the number of participants is supposed to be infinite and the access structure would be dynamically adjusted at any time by adding or deleting participants. First of all, the concept and definition of (k, \infty)-SIVCS are described. Shadow construction, constituted by random number generators and their choosing probabilities, for the (k, \infty)-SIVCS is then given. A contrast-maximizing problem for determining the generators and choosing probabilities is built based on the (k, \infty)-SIVCS. A simulated annealing-based algorithm is introduced to solve the optimization problem. The best solution from the simulated annealing-based algorithm forms a feasible (k, \infty)-SIVCS. To further improve the visual quality, a (k, \infty)-SIVCS using Boolean XOR decryption is also presented. Experimental results and comparisons are shown, demonstrating that the proposed techniques are feasible and advanced in the aspects of shadow size and contrast.

Abstract:
Image segmentation can reveal the semantic structure information in an image, which is helpful guidance information for image inpainting. Notably, it can help mitigate the artifacts on the boundaries of different semantic regions during the inpainting process. Existing semantic guidance-based image inpainting provides one-way guidance from the semantic segmentation task to the image inpainting task. There is no feedback from the inpainting results to adjust the guidance process, which causes inferior performance. To tackle this issue, this work proposes mutual dual-task generators to establish the interaction between image segmentation and image inpainting tasks. Thus, semantic segmentation guides image inpainting and also receives feedback from image inpainting. These two processes interact with each other and progressively improve the inpainting quality. The mutual dual-task generator consists of a shared encoder and mutual decoders with the bidirectional Cross-domain Feature DeNormalization (CFDN) module inside, which hierarchically models the Segmentation-guided image Texture (ST) generation and Texture-guided semantic Segmentation (TS) generation. At the end of mutual decoders, an Adaptive Attention Fusion (AAF) module is proposed to augment the texture and semantic class affinity between pixels, further refining the inpainted results. Experimental results demonstrate that the proposed mutual dual-task generator pipeline achieves superior inpainting performances over the state of the arts on three public datasets.

Abstract:
In the Face Super-Resolution (FSR) task, it is important to precisely recover facial textures while maintaining facial contours for realistic high resolution faces. Although several CNN-based FSR methods have achieved great performance, they fail in restoring the facial contours due to the limitation of local convolutions. In contrast, Transformer-based methods which use self-attention as the basic component, are expert in modeling long-range dependencies between image patches. However, learning long-range dependencies often deteriorates facial textures due to the lack of locality. Therefore, a question is naturally raised: how to effectively combine the superiority of CNN and Transformer for better reconstructing faces? To address this issue, we propose an Efficient Latent Style guided Transformer-CNN framework for FSR called ELSFace, which can sufficiently integrate the advantages of CNN and Transformer. The framework consists of a Feature Preparation Stage and a Feature Carving Stage. Basic facial contours and textures are generated in the Feature Preparation Stage, and separately guided by latent styles, so that facial details are better represented in reconstruction. CNN and Transformer streams in the Feature Carving Stage are used to individually restore facial textures and facial contours, respectively in a parallel recursive way. Considering the negligence of high-frequency features when learning the long-range dependencies, we design the High-Frequency Enhancement Block (HFEB) in the Transformer stream. The Sharp Loss is also proposed for better perceptual quality in optimization. Extensive experimental results demonstrate that our ELSFace can achieve the best results among all metrics compared to the state-of-the-art CNN and Transformer-based methods on commonly used datasets and real-world tasks. Meanwhile, our ELSFace method has the least model parameters and running time. The codes are released at https://github.com/FVL2020/ELSFace.

Affiliations: School of Computer Science and Technology, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi'an, China; School of Electronic Engineering, Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, Xi'an, China; Department of Computer Science and Software Engineering, Swinburne University of Technology, Hawthorn, VIC, Australia; School of Artificial Intelligence, Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi'an, China

Abstract:
Learning effective representations from unlabeled data is a challenging task for point cloud understanding. As the human visual system can map concepts learned from 2D images to the 3D world, and inspired by recent multimodal research, we introduce data from point cloud modality and image modality for joint learning. Based on the properties of point clouds and images, we propose CrossNet, a comprehensive intra- and cross-modal contrastive learning method that learns 3D point cloud representations. The proposed method achieves 3D-3D and 3D-2D correspondences of objectives by maximizing the consistency of point clouds and their augmented versions, and with the corresponding rendered images in invariant space. We further distinguish the rendered images into RGB and grayscale images to extract color and geometric features, respectively. These training objectives combine feature correspondences between modalities to combine rich learning signals from point clouds and images. Our CrossNet is simple: we add a feature extraction module and a projection head module to the point cloud and image branches, respectively, to train the backbone network in a self-supervised manner. After the network is pretrained, only the point cloud feature extraction module is required for fine-tuning and directly predicting results for downstream tasks. Our experiments on multiple benchmarks demonstrate improved point cloud classification and segmentation results, and the learned representations can be generalized across domains.

Abstract:
Visual Dialog (VD) requires an agent to answer the current question by engaging in a conversation with humans referring to an image. Despite the recent progress, it is beneficial to introduce external commonsense knowledge to fully understand the given image and dialog history. However, the existing knowledge-based VD models are inclined to rely on severe learning bias brought by commonsense, e.g., the retrieved < \mathttbus, \mathttcapable\;of, \mathtttransport\;people>, < \mathttbus, \mathttis\;a, \mathttpublic\;transport>, and < \mathttbus, \mathttis\;a, \mathttcar> can induce a spurious correlation between the question “What is the bus used for?” and the false answer “City bus”. There are two challenges to make commonsense learning more robust against spurious correlations: 1) how to disentangle the true effect of “good” commonsense knowledge from the whole, and 2) how to estimate and remove the effect of “bad” commonsense bias on answers. In this article, we propose a novel CounterFactual Commonsense learning scheme for the Visual Dialog task (CFC-VD). First, comparing with the causal graph of existing VD models, we add one new commonsense node and one new link to multi-modal information from history, question, and image. Since the retrieved knowledge prior is subtle and uncontrollable, we consider it as an unobserved confounder in the commonsense node, which leads to spurious correlations for the answer inference. Then, to remove the effect of the confounder, we formulate it as the direct causal effect of commonsense on answers and remove the direct language effect by subtracting it from the total causal effect via counterfactual reasoning. Experimental results certify the effectiveness of our method on the prevailing Visdial v0.9 and Visdial v1.0 datasets.

Abstract:
While deep generative models have empowered music generation, it remains a challenging and under-explored problem to edit an existing musical piece at fine granularity. In this article, we propose SDMuse, a unified Stochastic Differential Music editing and generation framework, which can not only compose a whole musical piece from scratch, but also modify existing musical pieces in many ways, such as combination, continuation, inpainting, and style transferring. The proposed SDMuse follows a two-stage pipeline to achieve music generation and editing on top of a hybrid representation including pianoroll and MIDI-event. In particular, SDMuse first generates/edits pianoroll by iteratively denoising through a stochastic differential equation (SDE) based on a diffusion model generative prior, and then refines the generated pianoroll and predicts MIDI-event tokens auto-regressively. We evaluate the generated music of our method on ailabs1k7 pop music dataset in terms of quality and controllability on various music editing and generation tasks. Experimental results demonstrate the effectiveness of our proposed stochastic differential music editing and generation process, as well as the hybrid representations.

Abstract:
Current low-light light-field (LF) image enhancement algorithms tend to produce blurry results, for (1) loss of spatial details during enhancement and (2) inefficient exploitation of angular correlations, which helps to recover spatial details. Therefore, in this article, we propose a parallel multi-scale network (PMSNet), which attempts to (1) process features of different scales in parallel to aggregate the different contributions of multi-scale features at each layer, thus fully preserve spatial details, and (2) integrate multi-resolution 3D convolution streams to efficiently utilize angular correlations. Specifically, PMSNet consists of three stages: Stage-I employs multi-scale modules (MSMs) to generate local understanding with the aid of adjacent views. Notably, MSM retains high-resolution feature extraction to minimize loss of spatial details. Stage-II processes all views to encode global information. Based on the above extracted local and global information, Stage-III utilizes 3D multi-scale modules (3D-MSMs) to efficiently exploit angular correlations. To validate our idea, we comprehensively evaluate the performance of PMSNet on three publicly available datasets. Experimental results show that our method is superior to the current state-of-the-art methods, achieving an average PSNR of 24.76 dB.

Abstract:
This article discusses the limitations of single- and two-modal salient object detection (SOD) methods and the emergence of multi-modal SOD techniques that integrate Visible, Depth, or Thermal information. However, current multi-modal methods often rely on simple fusion techniques such as addition, multiplication and concatenation, to combine the different modalities, which is ineffective for challenging scenes, such as low illumination and background messy. To address this issue, we propose a novel multi-modal feature fusion network (MFFNet) for V-D-T salient object detection, where the two key points are the triple-modal deep fusion encoder and the progressive feature enhancement decoder. The MFFNet's triple-modal deep fusion (TDF) module is designed to integrate the features of the three modalities and explore their complementarity by utilizing mutual optimization during the encoding phase. In addition, the progressive feature enhancement decoder consists of the weighted context-enhanced feature (WCF) module, region optimization (RO) module and boundary perception (BP) module to produce region-aware and contour-aware features. After that, a multi-scale fusion (MF) module is proposed to integrate these features and generate high-quality saliency maps. We conduct extensive experiments on the VDT-2048 dataset, and our results show that the proposed MFFNet outperforms 12 state-of-the-art multi-modal methods.

Abstract:
Visual question answering (VQA) is a prevalent task in real-world, and plays an essential role in helping the blind understand the physical world. However, due to the real-world complexity, VQA test samples may come from a different distribution from the training data, resulting in unavoidable performance degradation. This similar issue also exists in the image recognition field, in which one most recent effective solutions is a test-time adaptation (TTA). TTA adapts a trained model at test time using only test samples, which provides a new idea to alleviate the analogous issue in VQA. However, naively introducing existing TTA methods (e.g., test-time entropy minimisation) into VQA is imperfect and achieves only marginal performance gain. The reason is that prior methods do not consider the special nature of the VQA problem and ignore that 1) the biased samples in the dataset may have negative effects on test-time model adaptation, and 2) the model may have captured the biases in the dataset. In this paper, we propose Test-time Debiased Self-supervised (TDS) learning objectives for VQA model adaptation. Specifically, we minimise the entropy for those unbiased test samples. To identify these samples, we construct a negative sample for each test sample, and regard the test samples as unbiased if the output answers are different when feeding the test sample and the counterpart negative sample into the VQA model. Meanwhile, we also remove those samples with high prediction entropy from adaptation, making the test-time gradients more reliable. To hinder the model from excessively fitting the superficial correlations of the biased sample, we adopt the biased samples and the counterpart negative samples to assist the adaptation. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate the effectiveness of our TDS.

Abstract:
Deep learning-based methods have achieved remarkable success with powerful modeling capabilities. However, the weights of these models are learned over the entire training dataset, which inevitably leads to the ignorance of sample specific properties in the learned enhancement mapping. This situation causes ineffective enhancement in the testing phase for the samples that differ significantly from the training distribution. In this paper, we introduce external memory to form an external memory-augmented network (EMNet) for low-light image enhancement. The external memory aims to capture the sample specific properties of the training dataset to guide the enhancement in the testing phase. Benefiting from the learned memory, more complex distributions of reference images in the entire dataset can be “remembered” to facilitate the adjustment of the testing samples more adaptively. To further augment the capacity of the model, we take the transformer as our baseline network, which specializes in capturing long-range spatial redundancy. Experimental results demonstrate that our proposed method has a promising performance and outperforms state-of-the-art methods. It is noted that, the proposed external memory is a plug-and-play mechanism that can be integrated with any existing method to further improve the enhancement quality. More practices of integrating external memory with other image enhancement methods are qualitatively and quantitatively analyzed. The results further confirm that the effectiveness of our proposed memory mechanism when combing with existing enhancement methods.

Abstract:
Self-supervised video representation learning leaves out heavy manual annotation by automatically excavating supervisory signals. Although contrastive learning based approaches exhibit superior performances, pretext task based approaches still deserve further study. This is because the pretext tasks exploit the nature of data and encourage feature extractors to learn spatiotemporal logic by discovering dependencies among video clips or cubes, without manual engineering on data augmentations or manual construction of contrastive pairs. To utilize chronological property more effectively and efficiently, this work proposes a novel pretext task, named serial restoration of shuffled clips (SRSC), disentangled by an elaborately designed task network composed of an order-aware encoder and a serial restoration decoder. In contrast to other order based pretext tasks that formulate clip order recognition as a one-step classification problem, the proposed SRSC task restores shuffled clips into the right order in multiple steps. Owing to the excellent elasticity of SRSC, a novel taxonomy of curriculum learning is further proposed to equip SRSC with different pre-training strategies. According to the factors that affect the complexity of solving the SRSC task, the proposed curriculum learning strategies can be categorized into task based, model based and data based. Extensive experiments are conducted on the subdivided strategies to explore their effectiveness and noteworthy laws. Compared with existing approaches, this work demonstrates that the proposed approach achieves state-of-the-art performances in pretext task based self-supervised video representation learning and a majority of the proposed strategies further boost the performance of downstream tasks. For the first time, the features pre-trained by the pretext tasks are applied to video captioning by feature-level early fusion, and enhance the input of existing approaches as a lightweight plugin.

Abstract:
Multimodal Conditional Image Synthesis(MCIS) aims to generate images according to different modalities input and their combination, which allows users to describe their requirements in complementary ways, e.g. segmentation for shapes and text for attributes. Despite satisfying results in MCIS, a non-trivial issue is neglected. Some modalities are fully optimized and dominate the generation, while other modalities are sub-optimized and fail to contribute their complementary information. We coin this phenomenon as Modality Bias. Our analysis reveals that generative models own greedy nature. Specifically, the modality that shares less semantic gap with the synthesized modality will be greedily incorporated and thus takes a larger proportion in synthesis. The main idea of previous works in Modality Bias is to punish the greedy nature, which hurts the performance of dominant modalities and impedes their contribution to multimodal synthesis. Instead, we propose to utilize the greedy nature by setting dominant modalities as guidance for sub-optimized modalities through coordinated feature space, named Coordinated Knowledge Mining. Afterwards, improved uni-modalities are aggregated by fusing coordinated features to further boost the performance of multimodal image synthesis, called Coordinated Knowledge Fusion. Extensive experiments prove that our method not only increases uni-modal performance by a large margin, but also promotes multimodal image synthesis by fully utilizing complementary information from different modalities.

Abstract:
Referring expression comprehension (REC) is a cross-modal matching task that aims to localize the target object in an image specified by a text description. Most existing approaches for this task focus on identifying only objects whose categories are covered by training data. This restricts their generalization to unseen categories and practical usage. To address this issue, we propose a domain adaptive network called CLIPREC for zero-shot REC, which integrates the Contrastive Language-Image Pretraining (CLIP) model for graph-based REC. The proposed CLIPREC is composed of a graph collaborative attention module with two directed graphs: one for objects in an image and the other for their corresponding categorical labels. To carry out zero-shot REC, we leverage the strong common image-text feature space from the CLIP model to correlate the two graphs. Furthermore, a multilayer perceptron is introduced to enable feature alignment so that the CLIP model is adapted to the expression representation from the language parser, resulting in effective reasoning from expressions involving both seen and unseen object categories. Extensive experimental and ablation results on several widely-adopted benchmarks show that the proposed approach performs favorably against state-of-the-art approaches for zero-shot REC.

Abstract:
Particle flow (PF) is a method originally proposed for single target tracking, and used recently to address the weight degeneracy problem of the sequential Monte Carlo probability hypothesis density (SMC-PHD) filter for audio-visual (AV) multi-speaker tracking, where the particle flow is calculated by using only the measurements near the particle, assuming that the target is detected, as in a recent method based on non-zero particle flow (NPF), i.e. the AV-NPF-SMC-PHD filter. This, however, can be problematic when occlusion happens and the occluded speaker may not be detected. To address this issue, we propose a new method where the labels of the particles are estimated using the likelihood function, and the particle flow is calculated in terms of the selected particles with the same labels. As a result, the particles associated with detected speakers and undetected speakers are distinguished based on the particle labels. With this novel method, named as AV-LPF-SMC-PHD, the speaker states can be estimated as the weighted mean of the labelled particles, which is computationally more efficient than using a clustering method as in the AV-NPF-SMC-PHD filter. The proposed algorithm is compared systematically with several baseline tracking methods using the AV16.3, AVDIAR and CLEAR datasets, and is shown to offer improved tracking accuracy with a lower computational cost.

Abstract:
Existing depth super-resolution (DSR) methods typically utilize an additional high-resolution (HR) color image of the same scene as assistance to recover the low-resolution (LR) depth map. Although these color-guided methods have achieved impressive progress, they easily face with color image under-utilization and mis-utilization issues. In this article, we deeply investigate the above problems and further propose a novel DSR framework to alleviate them. Specifically, we propose a Cross-scale and Cross-modality Aggregation Network (C^2ANet) to learn abundant and accurate complementarity from color images to help recover the degraded depth map. Our C^2ANet can simultaneously extract multi-scale representations from color images with parallel network hierarchies, and effectively aggregate cross-scale and cross-modality contexts to boost HR representations in each hierarchy. Then, to appropriately use the guided color image, we further design a Feature Aggregation Module (FAM) to adaptively select and fuse task-relevant features, which consists of (1) a feature alignment block to learn transformation offsets and align upsampled features with targeted HR features, and (2) a feature fusion block based on cross-attention mechanism to maintain strong structural context and suppress texture distraction. Experimental results on synthetic and real-world benchmark datasets demonstrate the superiority of our proposed method in comparison with other state-of-the-art DSR methods.

Abstract:
RGB-D salient object detection (SOD) focuses on utilizing the complementary cues of RGB and depth modalities to detect and segment salient regions. However, many proposed methods train their models in a simple multi-modal manner, ignoring the differences between these two modalities in the contribution of salient detection. Furthermore, the quality of depth datasets varies significantly between individuals and is another important factor affecting model performance. To address the aforementioned issues, this article proposes a novel depth-guided fusion network framework (DGFNet) for the RGB-D SOD task. To avoid the influence of low-quality depth maps on RGB-D SOD, we design a depth map enhanced algorithm which jointly models salient detection and depth estimation to improve the quality of depth. Also, we propose a depth attention mechanism to encode valuable spatial information for SOD, which is then used in depth-guided fusion (DGF) module to guide the fusion of cross-modality features at each level. Extensive experiments on seven commonly tested datasets demonstrate that our DGFNet outperforms the 23 state-of-the-art RGB-D-based SOD methods.

Abstract:
With the assistance of Convolutional Neural Networks (CNNs), Image Quality Assessment (IQA) models have made great progress in evaluating both simulated distortion and authentic distortion. However, most of the existing IQA models only learn the features of distorted images, and thus do not make full use of the available feature representation of other domains. Furthermore, the common multi-scale fusion strategies are relatively simple, such as downsampling and concatenating, which further limits the prediction performance. To this end, we propose a novel blind image quality index with cross-domain interaction and cross-scale integration, which is designed based on the combination of CNN and Transformer. First, the hierarchical spatial-domain and gradient-domain representations are obtained through a typical CNN architecture. Then, based on the proposed gradient-query cross-attention, these two types of features are fully interacted in the Cross-Domain Interaction (CDI) module. To represent the distortion information more comprehensively, the Cross-Scale Integration (CSI) module is proposed to combine the information between different scales progressively. Finally, the quality score is obtained through a simple regression module. The experimental results on five public IQA databases of both simulated and authentic scenes show that the proposed model outperforms the compared state-of-the-art metrics. In addition, cross-database experiments show that the proposed model has strong generalization performance.

Abstract:
Learning-based in-loop filters (ILFs) have recently been widely deployed in the video codec to remove compression artifacts and to obtain better-quality reconstructed videos. However, in the existing codec, the impact of the learning-based ILF is not considered in the Rate-Distortion optimization (RDO) process. With the learning-based ILF, the set of coding parameters selected by the conventional RDO process may no longer be the best one, and the best overall Rate-Distortion (R-D) performance can not be guaranteed. In this article, we propose a joint RDO (JRDO) for Video Coding and learning-based in-loop filtering, which incorporates the effect of the learning-based ILF on the reconstructed video into the RDO process, aiming to achieve the best overall R-D performance of the reconstructed video after in-loop filtering. Furthermore, to realize the proposed JRDO in a standardized video codec, we propose practical strategies to efficiently estimate the effect of learning-based ILF during the RDO process, i.e., efficiently estimate the distortion of the reconstructed block after in-loop filtering during the RDO process. Extensive experiments demonstrate that the proposed joint RDO is standard-compliant and can improve the R-D performance without increasing the decoding time. Besides, the superiority of joint RDO is achieved in various ILFs, indicating the generality of the proposed work.

Abstract:
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with less effective training epochs (at most 3.7× less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy, up to 3.4%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods.

Abstract:
Self-supervised representation learning has proven constructive for skeleton-based action recognition. For better performance, existing methods mainly focus on 1) multi-modal data augmentations and 2) triplet contrastive samples construction. However, designing these strategies is always heuristics and hard. Instead of exploring more similar strategies, this paper addresses this issue with a different view and proposes a novel Contrastive Spatio-Temporal Clustering (CSTC) module. CSTC constructs a supervised signal (pseudo-label) of action sequences in an online clustering manner, and it is complementary to the recent data augmentations or triplet contrastive samples construction strategies. Specifically, CSTC can be formulated as an optimal transport problem. we introduce the spatio-temporal regularizations into the original optimal transport term to guide the pseudo-label generation, i.e., a semantic regularization learned by frame index is proposed to constrain the frame order, and a prior normal distribution regularization based on sampling characteristics of samples is proposed to maintain the dependability of spatial cluster assignments. Furthermore, to enhance the learning of latent features, we propose a Bidirectional Cross-modal Clustering Consistency Objective (B3CO) to enforce cluster assignments consistency for different modalities of the same sample. Last, since fusing spatial and temporal clustering losses directly during back-propagation will confuse the learned dimension-specific semantics, we propose a simple yet effective training strategy to fix it by training the model using these two losses alternately. By integrating the above designs into the MoCo framework, we propose a Contrastive Spatio-Temporal Clustering Network (CSTCN), which can excavate cross-modal discriminative spatio-temporal features in the clustering space. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets show that CSTCN achieves state-of-the-art performance in both single- and multi-modal models, especially in the KNN and semi-supervised evaluation protocols. Besides, the key module CSTC shows good generalization capability, and achieves consistent performance improvement on the basis of several state-of-the-art methods which focus on data augmentations and triplet contrastive samples construction.

Abstract:
Recent years have witnessed great progress in audio-driven talking head animation. Among these methods, the 3D-based ones better preserve the 3D consistency of the generated head and produce more natural results compared with 2D-based approaches. However, most 3D-based methods employ 3D morphable face models as the intermediate representation and involve multi-stage training, which may lead to error accumulation. To alleviate this problem, in this article, we propose a fully end-to-end talking head animation method, which implicitly grasps the 3D structures by learning a conditional Neural Radiance Field (NeRF). As NeRF has proven to be an effective tool for 3D modeling, one can learn dynamic neural radiance fields conditioned on audio signals for talking head synthesis. Furthermore, we argue that audio signals cannot fully drive a lifelike talking head. When people are talking, they usually show many spontaneous facial movements like blinks and brow movements, which makes talkers natural and real. These movements cannot be fully driven by the audio signals since they are highly unrelated to the audio. Therefore, we incorporate motion information as another driving factor and develop an audio-motion dual-driven NeRF model to take a step toward more lifelike talking head synthesis. On this basis, as audio and motion mainly affect different regions of the human face, we propose a Spatially-adaptive Dual-driven NeRF (SD-NeRF), which fuses these two driven factors with a spatially-adaptive cross-attention mechanism. Quantitative and qualitative results demonstrate that, with finer facial controls, our method produces more realistic talking head videos compared with existing advanced works.

Abstract:
Advanced compression technologies have succeeded in exploring compact representations of intricate video contents, but their dramatic increase in computational complexity causes severe challenges when deploying new-generation video codecs. In this article, we revisited the essential compression characteristics of IEEE1857.10/AVS3 from the perspective of reducing the computational cost of different application scenarios as well as its merits in terms of accommodating parallel implementation in high-performance multicore computing platforms by leveraging the scalable video technology (SVT) architecture. A hybrid acceleration scheme is constructed to extract texture and contextual information for pruning massive encoding candidates, while the visual quality of the reconstructed images is well considered in terms of adaptive coding budgets. Furthermore, we have carefully studied the compression performance of existing AVS3 coding tools within different technical parameters and rate-distortion granularity, where the parametric encapsulation and resource scheduling derived a series of preset configurations capable of offering satisfactory trade-offs for various applications. Therefore, we propose the first and fastest AVS3 standard-compliant encoder with the capability of real-time processing video signals up to 8 K resolution, which may hopefully further benefit the emerging 8K-UHD video industry.

Abstract:
This work aims to estimate a high-quality depth map from a single RGB image. Due to the lack of depth clues, making full use of the long-range correlation and local information is critical for accurate depth estimation. To this end, we introduce an uncertainty rectified cross-distillation between the Transformer and convolutional neural network (CNN) to achieve a comprehensive depth estimator. Specifically, we utilize the depth estimates from the Transformer branch and CNN branch as pseudo labels to teach each other. At the same time, the pixel-wise depth uncertainty is modeled to mitigate the negative impact of noisy pseudo labels. To avoid the large capacity gap induced by the strong Transformer branch deteriorating the cross-distillation, we transfer the feature maps from the Transformer to the CNN and develop coupling units to assist the weak CNN branch in leveraging the transferred features. Furthermore, we introduce CutFlip, a surprisingly simple yet highly effective data augmentation technique, which forces the model to focus on more valuable depth reasoning clues apart from the vertical image position. Extensive experiments demonstrate that our model, termed URCDC-Depth, exceeds in performance previous state-of-the-art approaches on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets, with no additional computational burden in the evaluation phase. The source code will be publicly available upon acceptance. The source code is available at https://github.com/ShuweiShao/URCDC-Depth.

Abstract:
Recent years have witnessed the popularity of integrating Siamese network into RGBT tracking for fast-tracking. However, these trackers mostly utilize the feature information of the last output layer and ignore the benefits of multi-layer information. In addition, they often adopt feature-level fusion for different modalities but fail to explore the strength of decision-level fusion, which may easily decrease their flexibility and independence. In this article, a novel multi-layer attention aggregation Siamese network on the decision level is proposed for robust RGBT tracking. To be specific, a hierarchical channel attention Siamese network is built to recalibrate the extracted multi-layer features from RGB and thermal infrared images. This can focus on more discriminative features to learn robust feature representation. Then, a depth-wise correlation operation is performed to produce RGB and thermal response maps, respectively. To better exploit and utilize the complementary RGB and thermal information, a contribution-aware aggregation network is designed to adaptively aggregate them. Lastly, a classification and regression network is adopted to complete the bounding box prediction. Extensive experiments on four large-scale RGBT benchmarks demonstrate outstanding tracking ability over other state-of-the-art trackers.

Abstract:
In recent years, considerable progress has been witnessed in the person re-identification (Re-ID). However, in a more realistic long-term scenario, the appearance shift arising from the clothes-changing inevitably deteriorates the conventional methods that heavily depend on the clothing color. Although the current clothes-changing person Re-ID methods introduce external human knowledge (i.e, contour, mask) and sophisticated feature decoupling strategy to alleviate the clothing shift, they still face the risk of overfitting to clothing due to the limited clothing diversity of training set. To more efficiently and effectively promote the clothes-irrelevant feature learning, we present a novel joint Identity-aware Mixstyle and Graph-enhanced Prototype method for clothes-changing person Re-ID. Specifically, by treating the cloth-changing as fine-grained domain/style shift, the identity-aware mixstyle (IMS) is proposed from the perspective of domain generalization, which mixes the instance-level feature statistics of samples within each identity to synthesize novel and diverse clothing styles, while retaining the correspondence between synthesized samples and latent label space. By incorporating the IMS module, the more diverse styles can be exploited to train a clothing-shift robust model. To further reduce the feature discrepancy caused by clothing variations, the graph-enhanced prototype constraint (GEP) module is proposed to explore the graph similarity structure of style-augmented samples across memory bank to build informative and robust prototypes, which serve as powerful exemplars for better clothing-irrelevant metric learning. The two modules are integrated into a joint learning framework and benefit each other. The extensive experiments conducted on clothes-changing person Re-ID datasets validate the superiority and effectiveness of our method. In addition, our method also shows good universality and corruption robustness on other Re-ID tasks.

Abstract:
Person re-identification (ReID) has achieved great improvement under supervised settings, but suffers from considerable degradation when large distribution shifts between training and testing sets exist. Domain generalization (DG ReID) emerges to promote the generalization ability of models, overcoming the distribution shifts issue between source domains and unseen target domains. Among most prior methods in DG ReID, instance normalization (IN) serves as a promising solution for removing domain-specific information, however, it damages the discriminative ability simultaneously. In this article, we propose a new normalization method called Cluster-Instance Normalization (CINorm) to extract information from clusters for information compensation. The relations between samples in a batch can be mined to establish evolving clusters with aggregated samples during the forward training process. In this way, high intra-cluster congregation can eliminate the impacts of outliers to avoid overfitting, and high inter-cluster variances can synthesize diverse novel statistics to compensate discriminative information. Therefore, a Relation-Aware Normalization (RANorm) with a Dynamic ReCalibration (DRC) module is designed to integrate normalized features between evolving clusters and instances efficiently. Furthermore, a novel Group-based Triplet (G-Triplet) loss is proposed to divide a batch into multiple groups with greater compactness for hard-pair mining. Extensive experiments show that our method outperforms state-of-the-art algorithms on multiple DG benchmarks by a large margin. The proposed method can also achieve superior performance on image classification tasks under DG settings without using domain labels.

Abstract:
Face-swapping technology has been widely used in people's life, and people also put forward higher requirements for it. Most of the current face-swapping methods are difficult to generate a high-definition face image. Through StyleGAN, we can generate high-definition face images. However, face-swapping with StyleGAN is still challenging. Firstly, we need to map the target image to the latent space of StyleGAN. Many tasks need to map the input image to a new latent space for face-swapping, because identity features are complex and challenging to map to specific latent space layers directly. So face-swapping is completed in the remapping process, which consumes excess computing resources for reconstruction. And the generated image is difficult to maintain the original image color, face attributes, background and other attributes. We propose a new method, which only edits the code of w+ latent space of StyleGAN to complete the face-swapping and generate high-definition face images. We propose the GAN inversion method to improve the effect of face swapping, which combines convolution networks' advantages in extracting texture features and the benefits of transformers in extracting structure features. In the latent space of StyleGAN, the low-level feature layer is dominated by structure information, and the high-level feature layer is overwhelmed by texture information. Furthermore, we propose latent space selection, through which the neural network can learn disentangled representations of identity information in the latent space. Finally, we improved the post-processing process of face swapping to keep the image's background. Our method can complete face-swapping by editing the w+ space. Thus, high-quality face image can be generated and a lot of computing resource is saved on image reconstruction. At the same time, our method can keep other attributes better in the face-swapping process.

Abstract:
Reconstructinga 3D face from a single image is a crucial task in numerous multimedia applications. Face images with ground-truth 3D face shapes are scarce, so unsupervised deep learning methods, which rely primarily on the free supervision signal derived from the visual disparity between the input image and the rendered counterpart of the predicted 3D face, have proven superior for reconstructing 3D faces. However, it is challenging for such techniques to decouple the dynamic 3D face properties such as pose or expression from a single 2D image, especially when similar local visual appearance changes can be caused by both pose and expression motion, resulting in imprecise 3D face reconstruction. In this article, a novel cycle-consistency in dynamic 3D face characteristics is introduced as a free supervisory signal for learning accurate 3D face shapes from unlabeled facial images. The main idea of cycle-consistency is to explicitly inject the head pose or facial expression variation between video frames into a face image, and then to extract and reverse the injected variation in order to reconstruct the face image to its original state. In our model, a CNN network with multiple branches is proposed to disentangle 3D face properties like identity, expression, pose, and texture from 2D facial images, one branch for each 3D face property. During training, our model learns to completely decouple the dynamic 3D face properties (pose and expression) to be useful for performing cycle-consistent face reconstruction. Extensive experiments demonstrate the superiority of our approach. On the challenging AFLW2000-3D, MICC Florence, and NoW datasets, our method outperforms or is on par with the state of the art.

Abstract:
Image-based fashion design with AI techniques has attracted increasing attention in recent years. We focus on the reference-based fashion design task, where we aim to combine a reference appearance image and a clothing image to generate a new fashion clothing image. Although existing diffusion-based image translation methods have enabled flexible style transfer, it is often difficult to transfer the appearance of the image realistically during reverse diffusion. When the referenced appearance domain greatly differs from the source domain, it often leads to the collapse in the translation. To tackle this issue, we present a novel diffusion model-based unsupervised structure-aware transfer method, namely DiffFashion. Our method is free of model tuning and structure-preserving and has high flexibility in transferring from images with large domain gaps. Specifically, based on the optimal transport properties, we keep a shared latent across the clothing image and reference appearance image to bridge the gap between the two domains in the denoising process, and the latent of the reference image is gradually adapted to the clothing domain. Simultaneously, the structure is transferred from the source clothing to the output fashion image with mixed guidance, including pre-trained Vision Transformer (ViT) guidance and a foreground mask guidance, to further preserve the structure and appearance semantics from source and reference images. Our experimental results show that the proposed method outperforms state-of-the-art baseline models, generating more realistic images in the fashion design task.

Abstract:
Multi-Modal Emotion Recognition in Conversations (MMERC) is an increasingly active research field that leverages multi-modal signals to understand the feelings behind each utterance. Modeling contextual interactions and multi-modal fusion lie at the heart of this field, with graph-based models recently being widely used for MMERC to capture global multi-modal contextual information. However, these models generally mix all modality representations in a single graph, and utterances in each modality are fully connected, potentially ignoring three problems: 1) the heterogeneity of the multi-modal context, 2) the redundancy of contextual information, and 3) over-smoothing of the graph networks. To address these problems, we propose a Structure Aware Multi-Graph Network (SAMGN) for MMERC. Specifically, we construct multiple modality-specific graphs to model the heterogeneity of the multi-modal context. Instead of fully connecting the utterances in each modality, we design a structure learning module that determines whether edges exist between the utterances. This module reduces redundancy by forcing each utterance to focus on the contextual ones that contribute to its emotion recognition, acting like a message propagating reducer to alleviate over-smoothing. Then, we develop the SAMGN via Dual-Stream Propagation (DSP), which contains two propagation streams, i.e., intra- and inter-modal, performed in parallel to aggregate the heterogeneous modality information from multi-graphs. DSP also contains a gating unit that adaptively integrates the co-occurrence information from the above two propagations for emotion recognition. Experiments on two popular MMERC datasets demonstrate that SAMGN achieves new State-Of-The-Art (SOTA) results.

Abstract:
Transformers have been used for 3D human pose estimation with excellent performance; however, most transformers focus on encoding the global spatio-temporal correlation of all joints in the human body and there are few studies on the local Spatio-temporal correlation of each joint in the human body. In this article, we propose a Global and Local Spatio-Temporal Encoder (GLSTE) to model the Spatio-temporal correlation. Specifically, a Global Spatial Encoder (GSE) and a Global Temporal Encoder (GTE) are constructed to capture the global spatial information of all joints in a single frame and the global temporal information of all frames, respectively. A Local Spatio-Temporal Encoder (LSTE) is constructed to capture the spatial and temporal information of each joint in the local N frames. Furthermore, we propose a parallel attention module with weight sharing to better incorporate spatial and temporal information into each node simultaneously. Extensive experiments show that GLSTE outperforms state-of-the-art methods with fewer parameters and less computational overhead on two challenging datasets: Human3.6 M and MPI-INF-3DHP. Especially in the evaluation of Human3.6 M dataset, the results of our method with 27 frames as input are better than the vast majority of recent SOTA methods with 81 and 243 frames as input, which indicates that the model can learn more useful information with smaller inputs.

Abstract:
With the rapid development of deep generative models (such as Generative Adversarial Networks and Diffusion models), AI-synthesized images are now of such high quality that humans can hardly distinguish them from pristine ones. Although existing detection methods have shown high performance in specific evaluation settings, e.g., on images from seen models or on images without real-world post-processing, they tend to suffer serious performance degradation in real-world scenarios where testing images can be generated by more powerful generation models or combined with various post-processing operations. To address this issue, we propose a Global and Local Feature Fusion (GLFF) framework to learn rich and discriminative representations by combining multi-scale global features from the whole image with refined local features from informative patches for AI-synthesized image detection. GLFF fuses information from two branches: the global branch to extract multi-scale semantic features and the local branch to select informative patches for detailed local artifacts extraction. Due to the lack of a synthesized image dataset simulating real-world applications for evaluation, we further create a challenging fake image dataset, named DeepFakeFaceForensics (DF^3), which contains 6 state-of-the-art generation models and a variety of post-processing techniques to approach the real-world scenarios. Experimental results demonstrate the superiority of our method to the state-of-the-art methods on the proposed DF^3 dataset and three other open-source datasets.

Abstract:
Multimodal representation aims to integrate information from multiple modalities to improve overall performance. Recent works utilizing pairwise interactions have been proposed to deal with the long-range inter-modal and intra-modal dependencies in modeling multimodal data. However, these works usually feature high model complexity, and they are not robust to noisy multimodal data. To address these problems, we propose a novel multimodal representation method that learns private and hub representations of modalities. These representations and their connections form a star graph, a basis for Star Graph-based Interaction (SGI). SGI not only captures the long-range dependencies in multimodal data but also has two natural properties. Firstly, the number of modal interactions increases linearly with the number of modalities, which is computationally efficient compared with the square increase rate of pairwise interactions in previous works. Secondly, the indirect modal interactions through the hub representation in SGI (rather than the direct pairwise interactions between modalities) ensure the model's robustness to noisy modalities. Experiments on five benchmark datasets demonstrate that our new SGI representation (SGIR) achieves state-of-the-art performance on various multimodal tasks, and our qualitative and quantitative analyses show the excellent generalization ability of SGIR. Further experiments reveal that SGIR still outperforms widely used baseline models when modalities are corrupted by low levels of noise.

Abstract:
For robust and effective tracking, most efforts strive to design a powerful representation target model, while we are inspired by the idea of “knowing oneself and knowing others” to major in both the target and non-target features. In this work, we propose a unit correlation with interactive feature tracker (UCIF), which utilizes feature interaction and independent correlation operation to improve robustness and effectiveness. Specifically, we first propose a feature integration network, in which the feature enhancement module concentrates on enhancing the tracker's representation ability for both target and non-target. The feature interaction module is in charge of completing the interactive learning between target and non-target features. Then, considering the potential risk of blurring spatial information in regular correlation operation, a unit correlation network is presented, where the convolution sampling strategy can integrate the target features as well as reduce the computation costs. The unit kernel for correlation operation can protect the target spatial information. The channel ranking module suppresses background interference via weight assignment. Extensive experiments are conducted on both the short-term and long-term challenging benchmarks, including OTB2015, NFS, UAV123, TrackingNet, GOT-10 k, TLP, LaSOT and VOT-LT2019. Our tracker achieves remarkable performance in robustness and effectiveness.

Abstract:
Domain adaptation has been extensively explored as a means of transferring knowledge from the labeled source domain to the unlabeled target domain with disparate data distributions. However, the absence of target annotations and significant domain discrepancies pose a great challenge to transfer knowledge directly from source domain to target domain. To address this challenge, we propose a Progressive Fourier Adversarial Domain Adaptation (PFADA) framework, an effective and versatile framework which can generalize across multiple domain adaptation tasks. Firstly, we propose a Fourier-based style transfer strategy to generate a Fourier intermediate domain that incorporates source images with target domain-specific styles, while preserving the domain-invariant representations of the source data. Secondly, we introduce a progressive adversarial domain adaptation approach that utilizes the Fourier intermediate domain to facilitate the learning of domain-invariant representations. Finally, we present cross-domain semantic alignment and discriminative enhancement approach, which effectively guides the learning of discriminative cross-domain representations utilizing labeled source and intermediate domain data. Extensive experimental evaluations consistently validate the superior performance of the proposed method across diverse visual tasks, encompassing multiple domain adaptive image classification and retrieval scenarios.

Abstract:
Skeleton-based action recognition is crucial for natural human-computer interaction, dynamic behavior analysis, and behavior surveillance. The key challenge is to effectively capture the intrinsic local-global clues of the activity. However, it remains challenging to efficiently leverage multidimensional information related to joints' local visual appearances, global spatial relationships, and coherent temporal cues. To address this challenge, we propose a joints-centered spatial-temporal feature-fused framework for action recognition, which exploits skeleton-based graph diffusion and convolution. Specifically, we employ Partial Differential Equation (PDE) based skeleton graph diffusion to automatically activate and diffuse the salient appearance features of joints. This approach simultaneously integrates the joints' appearance clues and their hierarchical relationships at both the super-pixel level and structure level. The diffused appearance-related features of the joints are further fused with skeleton-related spatial-temporal features, and the resulting fused features are fed into a skeleton convolution network for action recognition. Our method was extensively evaluated on two public datasets (NTU-RGBD and UWA3D), and the results demonstrate the improved accuracy and effectiveness of our approach. Our code will be public.

Affiliations: Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China; Regional Medical Center for National Institute of Respiratory Diseases, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, China; AI Lab at Lenovo Research, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Department of Medical Oncology, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University in Hangzhou, Hangzhou, China

Abstract:
In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. Furthermore, we introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction and fuse the multi-modal representations adequately. Experimental results demonstrate that the proposed unified approach outperforms previous methods in all downstream tasks, including uni-modal, cross-modal, and multi-modal tasks.

Abstract:
The emerging trend of AR/VR places great demands on 3D content. However, most existing software requires expertise and is difficult for novice users to use. In this paper, we aim to create sketch-based modeling tools for user-friendly 3D modeling. We introduce Reality3DSketch with a novel application of an immersive 3D modeling experience, in which a user can capture the surrounding scene using a monocular RGB camera and can draw a single sketch of an object in the real-time reconstructed 3D scene. A 3D object is generated and placed in the desired location, enabled by our novel neural network with the input of a single sketch. Our neural network can predict the pose of a drawing and can turn a single sketch into a 3D model with view and structural awareness, which addresses the challenge of sparse sketch input and view ambiguity. We conducted extensive experiments synthetic and real-world datasets and achieved state-of-the-art (SOTA) results in both sketch view estimation and 3D modeling performance. According to our user study, our method of performing 3D modeling in a scene is >5x faster than conventional methods. Users are also more satisfied with the generated 3D model than the results of existing methods.

Abstract:
Deep neural networks (DNNs) have achieved state-of-the-art performance on face recognition (FR) tasks in the last decade. In real scenarios, the deployment of DNNs requires taking various face accessories into consideration, like glasses, hats, and masks. In the COVID-19 pandemic era, wearing face masks is one of the most effective ways to defend against the novel coronavirus. However, DNNs are known to be vulnerable to adversarial examples with a small but elaborated perturbation. Thus, a facial mask with adversarial perturbations may pose a great threat to the widely used deep learning-based FR models. In this paper, we consider a challenging adversarial setting: targeted attack against FR models. We propose a new stealthy physical masked FR attack via adversarial style optimization. Specifically, we train an adversarial style mask generator that hides adversarial perturbations inside style masks. Moreover, to ameliorate the phenomenon of sub-optimization with one fixed style, we propose to discover the optimal style given a target through style optimization in a continuous relaxation manner. We simultaneously optimize the generator and the style selection for generating strong and stealthy adversarial style masks. We evaluated the effectiveness and transferability of our proposed method via extensive white-box and black-box digital experiments. Furthermore, we also conducted physical attack experiments against local FR models and online platforms.

Abstract:
Single image reflection removal (SIRR) aims at eliminating unwanted interference caused by the reflection of transparent or smooth surfaces and obtaining an estimation of a clear transmission layer. Existing data-driven methods typically rely on decomposing the observed image into transmission and reflection layers, which neglects the physical generation principles of an image with reflections, thus leading to unsatisfactory results, especially in strong reflection regions. To address this issue, in this work, we analyze the imaging process of reflection image from the physical perspective and derive a conclusion that the physical quantity: illuminance of the reflection layer determines the reflection intensity. Then a two-stage reflection intensity-guided network (RINet) is proposed for reflection removal and transmission recovery. The key lies in the first stage are the parallel modules that generate the reflection intensity map and transmission layer. In the second stage, besides utilizing such intensity map as the guidance, we additionally calculate the gradient field as the other prior to facilitate the final reflection removal. Specifically, we design a dual-flow joint learning module (JLM) comprised of a transmission recovery branch and a gradient optimization branch that jointly optimizes image structures and details by exploiting the interactions between transmission and gradient features. In particular, guided by the reflection intensity map, the transmission recovery branch can dynamically focus on removing reflections. Equipped with the two-stage framework, our RINet constitutes a divide-and-conquer process to achieve effective transmission recovery and reflection removal. Experimental results on public datasets demonstrate the superiority of the proposed method over recent state-of-the-art methods.

Abstract:
The rapid proliferation of social media data has led to the widespread dissemination of multi-modal fake news, prompting researchers to develop novel detection methods. Most fake news detection approaches mine the rich context information, including the text and image content of news and associated comments. However, existing methods are often insufficient to filter out irrelevant contexts, such as noisy words, redundant image regions, and spam comments, which may introduce noise into the model. Particularly, these approaches struggle to handle comments, which often contain the most severe noise. In many cases, only a minuscule portion of comments is relevant to the news. To overcome these limitations, our research introduces a novel Comment-Context Dual Collaborative Masked Transformer Network (C^2DCMTN). To handle the irrelevant contexts, we propose a Multi-modal Masked Transformer Network. This network extends the traditional Transformer with a mask mechanism capable of dynamically obscuring irrelevant multi-modal context information. To effectively deal with comments, which can suffer from more severe noise issues, we have designed a Comment-Context Encoder that focuses solely on the most crucial comments. Comprehensive experiments on two publicly available real-world datasets confirm that C^2DCMTN outperforms state-of-the-art methods.

Abstract:
Detection transformers such as DETR (Carion et al., 2020) have recently exhibited promising performance for many object detection tasks, but the generalization ability of those methods is still quite limited for cross-domain adaptation scenarios. To address the cross-domain issue, a straightforward method is to perform token alignment with adversarial training in transformers. However, its performance is often unsatisfactory because the tokens in detection transformers are quite diverse and represent different spatial and semantic information. In this paper, we propose a new method for cross-domain detection transformers called spatial-aware and semantic-aware token alignment (SSTA). Specifically, we take advantage of the characteristics of cross-attention as used in the detection transformer and propose spatial-aware token alignment (SpaTA) and semantic-aware token alignment (SemTA) strategies to guide the token alignment across domains. For spatial-aware token alignment, we extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries. For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embedding to guide the learning of a multi-class discriminator to model the category relationship and achieve category-level token alignment during the entire adaptation process. We conduct extensive experiments on several widely-used benchmarks, and the results clearly show the effectiveness of our proposed approach over existing state-of-the-art methods.

Abstract:
Deep learning has been widely studied for processing and understanding multimedia data, and it does help improve performance. Recent research has shown that deep models are vulnerable to images containing adverse weather corruptions, leading to a safety risk for numerous safety-critical systems (e.g., autonomous driving systems). There are two problems with the current situation. First, collecting data under different weather scenarios is highly difficult in practice. Second, the performance degrades significantly when the training and test data are from different distributions, as exemplified by the weather corrupted test data. As a result, it is challenging to train a model without access to the images containing variations of various weather conditions, and it is difficult to make trained model generalized to unknown data under different weather conditions. In this paper, we introduce a Counterfactual Representation Learning (CRL) method to address these problems. Without access to training data including weather condition variations, our CRL makes the model resistant to unseen test data that has been corrupted by weather condition variations. Our basic idea is inspired by the perspective of counterfactual regularization. We build a causal model that introduces a counterfactual variable to eliminate the unobserved characteristics brought about by weather conditions. In particular, such a counterfactual variable is approximated by randomly shuffled features, echoing the previous empirical observation that the shuffling technique can perturb the shape details while preserving the local textures. We use information theoretic representation learning to encourage the neural networks to learn more powerful and robust features, which consist of two components. We conduct experiments on five benchmark datasets, namely, CIFAR-100-C, ImageNet-C, KITTI-C, BDD100 k, and CityScapes-C, all of which contain weather corruption. The results of our experiments show that our proposed method can not only be a plug-and-play technique but also work nicely for both object recognition and detection.

Abstract:
Multi-view subspace clustering aims to cluster the data lying in a union of subspaces with low dimensions. The commonly used spectral clustering performs the final clustering based on an n× n affinity graph, which suffers from relative high time and space complexity. Some existing works have chosen key anchors with uniform sampling strategy or K-means for dealing with large-scale datasets. However, few of them pay attention to the physical meaning of cluster representation in the column of the dataset for learning informative anchors, which is independent from the instance representation. In this paper, we propose efficient dual multi-view clustering (EDMC) with relative low complexity. To be specific, EDMC makes full use of cluster representation space in the column of the dataset to help produce informative anchors, which has a clear physical meaning and is independent of instance representation in the row. It simultaneously explores the cluster and instance subspace representations to learn anchors for large-scale datasets. We perform anchor learning and efficient multi-view clustering in a unified framework and then adopt an alternative optimization strategy for solving the formulated problem. Extensive experiments performed on different datasets in terms of several metrics validate the superiority of the proposed method.

Affiliations: AI Research Center, Xiamen Meiya Pico Information Company, Ltd., Xiamen Meiya Pico Information Security Research Institute Company, Ltd., Xiamen, China; Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Fujian Key Lab for Intelligent Processing and Wireless Transmission of Media Information, College of Physics and Information Engineering, Fuzhou University, Fuzhou, China; School of Artificial Intelligence, Anhui University, Hefei, China

Abstract:
Long-term (also called Clothing-Change) person re-identification (CC-reID) aims at confirming the identity of pedestrians captured at diverse locations and/or times. Current CC-reID methods heavily rely on ID features learned by the CNN architecture. However, with limited receptive fields, CNN is hard to effectively explore some unique but discriminative ID features (e.g., hair style, tattoo and accessories) from small body regions. Compared with CNN, Transformer has certain merits in exploring more diverse ID-unique features and retaining more details by the multi-head self-attention design and the removal of down-sampling operation. In this paper, a two-stream hybrid Convolution-Transformer Network (CT-Net) is proposed for CC-reID by combining both CNN and Transformer parallelly in an end-to-end learning scheme. Specifically, CT-Net contains a CNN-based stream (C-Stream) and a Transformer-based stream (T-Stream). Compared with using C-Stream only, T-Stream is used to encourage the C-Stream to explore more detailed ID-unique features when the clothing information is no reliable in CC-reID. Specifically, a Feature Supplement Module (FSM) is proposed to transfer features learned by T-Stream to C-Stream from low-level to high-level for mining more ID-unique feature. In order to further enhance the discriminability and complementary of ID features learned by our CT-Net, we also introduce a hierarchical supervision with bilinear pooling (HSBP). Experimental results demonstrate that CT-Net performs favorably against the state-of-the-art methods over three CC-reID benchmarks. Meanwhile, CT-Net also demonstrates good generalization ability by achieving comparable performance on traditional person re-ID datasets such as Market-1501 and DukeMTMC-reID.

Abstract:
Source-Free Domain Adaptation (SFDA) task aims to transfer knowledge from a labeled source domain to a label-scarce target domain, in which the source data can not be accessed but only a pre-trained source model and unlabeled target data are available during adaptation. Previous methods for source model adaptation rely on hypothesis transfer learning that trains the feature extractor to learn target features aligned to the distribution of source features while freezing the source classifier. However, reusing only the source classifier without exploring the comprehensive knowledge of the source model can lead to biased feature alignment. To this end, we propose a novel method called Transformer-bAsed thorouGh Source HypOthesis Transfer (TagSHOT) framework to effectively unleash the thorough knowledge potential of pre-trained source hypothesis. Specifically, our approach delves into the correlation coefficient among CLS/patch tokens across different Transformer layers, uncovering the concealed insights within the pre-trained source model and constructing a comprehensive source hypothesis. By tailoring the target feature alignment to this thorough source hypothesis, our model facilitates the adaptation of a broader range of classification-related knowledge to the target domain. Furthermore, we introduce a Salient Token Extension (STE) module, designed to capture the target-specific discriminative information by propagating the salient information among tokens. This mechanism enriches our model's ability to understand and incorporate target-specific nuances. Extensive experiments have been conducted to validate the effectiveness of our method, which outperforms state-of-the-art approaches by a large margin.

Abstract:
Domain adaptive LiDAR point cloud segmentation aims to learn an effective target segmentation model from labelled source data and unlabelled target data, which has attracted increasing attention in recent years due to the difficulty in point-cloud annotation. It remains a very open research challenge as point clouds of different domains often have clear distribution discrepancies with variations in LiDAR sensor configurations, environmental conditions, occlusions, etc. We design a simple yet effective spatial consistency training framework that can learn superior domain-invariant feature representations from unlabelled target point clouds. The framework exploits three types of spatial consistency, namely, geometric-transform consistency, sparsity consistency, and mixing consistency which capture the semantic invariance of point clouds with respect to viewpoint changes, sparsity changes, and local context changes, respectively. With a concise mean teacher learning strategy, our experiments show that the proposed spatial consistency training outperforms the state-of-the-art significantly and consistently across multiple public benchmarks.

Abstract:
Weakly-supervised Video Anomaly Detection (W-VAD) aims to detect abnormal events in videos given only video-level labels for training. Recent methods relying on multiple instance learning (MIL) and self-training achieve good performance, but they tend to focus on learning easy abnormal patterns while ignoring hard ones, e.g., unusual driving trajectory or over-speeding driving. How to detect hard anomalies is a critical but largely ignored problem in W-VAD. To tackle this challenge, we propose a novel framework, termed Abnormal Ratios guided Multi-phase Self-training (ARMS), for W-VAD. It includes a new abnormal ratio-based MIL (AR-MIL) loss and a new multi-phase self-training paradigm. The AR-MIL loss guides the learning of hard anomalies by enforcing a minimum ratio of abnormal snippets in an abnormal video and no abnormal snippets in a normal video. Our multi-phase self-training paradigm sequentially performs bootstrapping, hard anomalies mining, and adaptive self-training so as to address pseudo labeling on easy anomalies, detect hard anomalies, and setting adaptive abnormal ratios for different videos in a unified framework. Experimental results on three benchmark datasets, i.e., ShanghaiTech, UCF-Crime, and XD-Violence, show that ARMS outperforms all previous state-of-the-art methods and has a great advantage in detecting hard anomalies.

Abstract:
Label assignment (LA) is one of the essential phases in the object detection paradigm and aims to classify samples as foreground or background. Current LA strategies generally discriminate samples by explicit thresholds and then calculate weighted losses based on their significances. However, existing methods mostly neglect to consider the importance of samples comprehensively due to the uneven distribution of objects and the limitations of detector structures. In this paper, we propose a hierarchical equalization loss (HEL) by reconsidering the underlying factors affecting sample weights. First, we mitigate sample imbalance at three progressive levels. (1) Task level. We propose task-reconciled weights (TRW) to overcome the effects caused by inter-task inconsistencies (i.e., the inherent differences of classification and localization). (2) Instance level. We propose instance-aware normalization (IAN) for reconstructing the distribution of sample weights within an instance to suppress environmental noise. (3) Pyramid level. We propose hierarchical modulation (HM) to alleviate the unbalanced distribution of multi-scale objects on feature pyramids. Then, we stack the above three mechanisms and formulate the effective weighted loss. Moreover, we propose a staggered candidate bag construction (SCBC) mechanism to further improve the robustness of our method. Without adding any extra overhead, HEL can improve the performance of representative detectors by an impressive margin. Equipped with HEL, a single “ResNet-50+FPN+Head” detector can achieve a performance of 41.9 AP on COCO under 1× schedule, outperforming other existing LA methods. Extensive experiments conducted on multiple backbones and datasets demonstrate the effectiveness of our method.

Abstract:
By training first with a large base dataset, Few-Shot Class-Incremental Learning (FSCIL) aims at continually learning a sequence of few-shot learning tasks with novel classes. There are mainly two challenges in FSCIL: the overfitting issue of novel classes with limited labeled samples and the catastrophic forgetting of previously seen classes. The current protocol of FSCIL is built by mimicking the general class-incremental learning setting by building a unified framework, while the existing frameworks for FSCIL on this protocol always bias to the classes in the base dataset because the dominant performance of the deep model is decided by the size of the training dataset. Moreover, it is difficult to handle the stability-plasticity constraint in a unified FSCIL framework. To solve these issues, we rethink the configuration of FSCIL with the open-set hypothesis by reserving the possibility in the first session for incoming categories. To find a better decision boundary of close space and open space, Hyperbolic Reciprocal Point Learning module (Hyper-RPL) is built on Reciprocal Point Learning with hyperbolic neural networks. Besides, when learning novel categories from limited labeled data, we incorporate a hyperbolic metric learning (Hyper-Metric) module into the distillation-based framework to alleviate the overfitting issue and better handle the trade-off issue between the preservation of old knowledge and the acquisition of new knowledge. Finally, the comprehensive assessments of the proposed configuration and modules on three benchmark datasets are executed to validate the effectiveness, and state-of-the-art results are achieved.

Abstract:
Automatic font generation is a challenging and time-consuming task, particularly in languages that consist of large amounts of characters with complicated structures. Typical component-wise font generation methods decompose the source character into components and search for them from the reference glyph set as candidate components. These candidate components are then utilized to learn the local styles of the target glyph. However, these methods overlook that the same component at different locations may have different profiles. When the candidate components locate differently from their corresponding components in the target glyph, the style of a generated glyph will look inconsistent. It is observed that for arbitrary components at two specific locations, the deformation patterns are similar. Driven by this, we present a location-aware component-deformable font generation method. Specifically, we search for candidate components and their corresponding deformative component pairs from the reference glyph set. Each deformative component pair can accurately depict how to deform the candidate component to the desired profile in the target glyph. Hence, we introduce a location-dependent deformation module to perform component warping. In this way, we significantly improve the component deformation ability. Lastly, we integrate deformed components into target glyphs while enforcing their styles to be consistent with the reference ones. Extensive experiments demonstrate that our method produces target-font consistent glyphs and outperforms the state-of-the-art on both seen and unseen fonts.

Abstract:
Visual question answering (VQA) has been intensively studied as a multimodal task, requiring efforts to bridge vision and language for correct answer inference. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual semantic comprehension. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to comprehend and correctly infer the causal nexus of contextual object semantics in images. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to address this crucial issue. LOIS can achieve more fine-grained feature descriptions to generate visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from different visual features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.

Affiliations: Institute of Data Science, National University of Singapore, Singapore; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore; School of Computing and Information Systems, Singapore Management University, Singapore; School of Computing, National University of Singapore, Singapore; School of Information Systems Technology and Design, Singapore University of Technology and Design, Singapore; Business School, National University of Singapore, Singapore; Department of Electrical and Computer Engineering, National University of Singapore, Singapore

Abstract:
Learning discriminative and robust representations is important for facial expression recognition (FER) due to subtly different emotional faces and their subjective annotations. Previous works usually address one representation solely because these two goals seem to be contradictory for optimization. Their performances inevitably suffer from challenges from the other representation. In this article, by considering this problem from two novel perspectives, we demonstrate that discriminative and robust representations can be learned in a unified approach, i.e., DR-FER, and mutually benefit each other. Moreover, we make it with the supervision from only original annotations. Specifically, to learn discriminative representations, we propose performing masked image modeling (MIM) as an auxiliary task to force our network to discover expression-related facial areas. This is the first attempt to employ MIM to explore discriminative patterns in a self-supervised manner. To extract robust representations, we present a category-aware self-paced learning schedule to mine high-quality annotated (easy) expressions and incorrectly annotated (hard) counterparts. We further introduce a retrieval similarity-based relabeling strategy to correct hard expression annotations, exploiting them more effectively. By enhancing the discrimination ability of the FER classifier as a bridge, these two learning goals significantly strengthen each other. Extensive experiments on several popular benchmarks demonstrate the superior performance of our DR-FER. Moreover, thorough visualizations and extra experiments on manually annotation-corrupted datasets show that our approach successfully accomplishes learning both discriminative and robust representations simultaneously.

Abstract:
Recently, the approaches of linguistic modeling for scene text recognition have gradually become mainstream, mainly consisting of a vision model (VM), a language model (LM), and an optional fusion module. These methods typically use LM and fusion modules to refine the results of VM-based predictions iteratively. However, the VM mainly consists of a Transformer on top of ResNet. It means the attention mechanism is only applied to the high layer of the VM, ignoring the internal image dependencies in the dense features at multiple scales. Therefore, the results in the VM become the performance bottleneck. Meanwhile, the visual and language features of these methods reside in their own space. In this way, it ignores the alignment before fusion, leading to a failure to achieve maximum information interaction. To address these issues, we propose Visual cOllaboration and duaL-stream fusion for scene TExt Recognition, VOLTER for short. Firstly, a multi-stage Local-Global Collaboration Vision Model (LGC-VM) is constructed to focus on both local and global features at multiple scales, breaking vision bottlenecks to provide a better vision prediction. Secondly, to explicitly align the feature space of VM and LM, we introduce a Vision-Language Contrastive (VLC) module by encouraging positive vision-language pairs to have similar representations. Moreover, a Dual-Stream Feature Enhancement (DSFE) module is proposed for the unidirectional interaction of visual-language features to alleviate the alignment problem of different modalities and execute fusion further. Extensive experiments on benchmark datasets demonstrate that the proposed framework can achieve state-of-the-art performance.

Abstract:
Taking the steganalytic discriminators as the adversaries, the existing Generative Adversarial Networks (GAN)-based steganographic approaches learn the implicit cost functions to measure the embedding distortion for steganography. However, the steganalytic discriminators in these approaches are trained by the stego-samples with insufficient diversity, and their network structures offer very limited representational capacity. As a result, these steganalytic discriminators will not exhibit robustness to various steganographic patterns, which causes learning suboptimal cost functions, thus compromising the anti-steganalysis capability. To address this issue, we propose a novel GAN-based steganographic approach, in which the Diversified Inverse-Adversarial Training (DIAT) strategy and the Steganalytic Feature Attention (SteFA) structure are designed to train a robust steganalytic discriminator. Specifically, the DIAT strategy provides the steganalytic discriminator with an expanded feature space by generating diversified adversarial stego-samples; the SteFA structure enables the steganalytic discriminator to capture more various steganalytic features by employing the channel-attention mechanism on higher-order statistics. Consequently, the steganalytic discriminator can build a more precise decision boundary to make it more robust, which facilitates learning a superior steganographic cost function. Extensive experiments demonstrate that the proposed steganographic approach achieves promising anti-steganalysis capability over the state-of-the-arts under the same embedding payloads.

Abstract:
Weakly-supervised temporal action localization aims to localize action instances from untrimmed videos with only video-level labels. Due to the lack of frame-wise annotations, most methods embrace a localization-by-classification paradigm. However, the large supervision gap between classification and localization hinders models from obtaining accurate snippet-wise classification sequences and action proposals. We propose a snippet-to-prototype contrastive consensus network (SPCC-Net) to simultaneously generate feature-level and label-level supervision information to narrow the supervision gap between classification and localization. Specifically, the network adopts a two-stream framework incorporating the optical flow and fusion streams to fully leverage the motion and complementary information from multiple modalities. Firstly, the snippet-to-prototype contrast module is executed within each stream to learn prototypes for all categories and contrast them with action snippets to guarantee intra-class compactness and inter-class separability of snippet features. Secondly, for generating accurate label-level supervision information through complementary information of multimodal features, the multi-modality consensus module ensures not only category consistency through knowledge distillation but also semantic consistency through contrastive learning. Finally, we introduce the auxiliary multiple instance learning (MIL) loss to alleviate the issue that existing MIL-based methods only localize sparse discriminative snippets. Extensive experiments are conducted on two public datasets, THUMOS-14 and ActivityNet-1.3, to demonstrate the superior performance of our method over state-of-the-art methods.

Abstract:
Along with the concern of similarity measures in linear space, supervised online hash methods have been applied to the retrieval task. However, they ignored multi-dimensional space semantic mining and association characteristics will cause quantization errors of information hash code: 1) The similarity relation of discretized data needs to be considered in different spaces; 2) Latent semantic features need to be continuously embedded into hash code learning; 3) The correlation between the structure similarity and discrete hash matrices needs to be continuously optimized. To tackle these challenges, this paper proposes a novel Extensible Max-min Collaborative Retention Online Hash retrieval method based on mini-batch training data (EMCROH). It mainly includes the Max-min Bayesian Similarity Sparse Latent Hash module (MBSSLH), and the Repetition Collaborative Projection Learning module (RCPL). Specifically, MBSSLH is a max-min optimization model. Firstly, to explore the semantic similarity of multi-dimensional space, we propose a novel liner and nonlinear semantic similarity discrimination mechanism based on the log maximum likelihood similarity estimation with Euclidean space and minimize the input batch data features with a common projection matrix. Moreover, to further mine the potential semantic information of the discretization, we also propose a robust sparse discrete latent semantic information extraction submodule based on double latent factors. RCPL can extend the data externally using the repetition collaborative projection matrix with robustness regularization constraint. Finally, a novel max-min embedding iterative step is proposed to solve the batch discrete optimization problem based on Augmented Lagrange Multipliers (ALM) with Alternating Direction Minimization (ADM). Extensive experiments on several well-known large databases demonstrate that EMCROH outperforms the state-of-the-art hash methods.

Abstract:
Hand pose and shape estimation plays an important role in numerous applications. A cost-effective and practical-friendly approach is to perform accurate hand estimation from a single RGB image, but this task is challenging due to ubiquitous hand self-occlusion and hand-object interaction occlusions. In this paper, we propose a novel SPMHand network to alleviate the effect of occlusions, inspired by the process that humans infer the whole hand when the hand is occluded. The proposed SPMHand consists of two main modules to generate hand segmentations as guidance and conduct hand regressions in a progressive multi-path manner. The segmentation-guided deocclusion module enables the network to “see” the occluded hand by inferring the whole hand segmentation. Specifically, the visible hand segmentation is first obtained and then a hand morphology attention block is introduced to infer the whole hand segmentation by fusing the visible information with the learned hand features. The progressive multi-path regression module is designed to gradually regress the fine hand with intermediate supervisions. Features from deep to shallow are utilized for the hand regressions from coarse to decent. Subsequently, the structure feature, joint heatmaps and segmentations that provide guidance for deocclusion are embedded and fused for the final fine hand regression. Experiments on four challenging datasets illustrate that the proposed SPMHand outperforms the state-of-the-arts in both real-world and synthetic scenes, especially in the present of severe hand-object occlusions.

Abstract:
Efficient analysis of point clouds holds paramount significance in real-world 3D applications. Currently, prevailing point-based models adhere to the PointNet++ methodology, which involves embedding and abstracting point features within a sequence of spatially overlapping local point sets, resulting in noticeable computational redundancy. Drawing inspiration from the streamlined paradigm of pixel embedding followed by regional pooling in Convolutional Neural Networks (CNNs), we introduce a novel, uncomplicated yet potent architecture known as PointGL, crafted to facilitate efficient point cloud analysis. PointGL employs a hierarchical process of feature acquisition through two recursive steps. First, the Global Point Embedding leverages straightforward residual Multilayer Perceptrons (MLPs) to effectuate feature embedding for each individual point. Second, the novel Local Graph Pooling technique characterizes point-to-point relationships and abstracts regional representations through succinct local graphs. The harmonious fusion of one-time point embedding and parameter-free graph pooling contributes to PointGL's defining attributes of minimized model complexity and heightened efficiency. Our PointGL attains state-of-the-art accuracy on the ScanObjectNN dataset while exhibiting a runtime that is more than \bm 5 times faster and utilizing only approximately \bm 4% of the FLOPs and \bm 30% of the parameters compared to the recent PointMLP model.

Abstract:
Text-to-image synthesis aims to generate high-quality realistic images conditioned on text description. The great challenge of this task depends on deeply and seamlessly integrating image and text information. Thus, in this paper, we propose a deep multimodal fusion generative adversarial networks (DMF-GAN) that allows effective semantic interactions for fine-grained text-to-image generation. Specifically, through a novel recurrent semantic fusion network, DMF-GAN could consistently manipulate global assignment of text information among isolated fusion blocks. With the assistance of a multi-head attention module, DMF-GAN could model word information from different perspectives and further improve the semantic consistency. In addition, a word-level discriminator is proposed to provide the generator with fine-grained feedback related to each word. Compared with current state-of-the-art methods, our proposed DMF-GAN could efficiently synthesize realistic and text-alignment images and achieve better performance on challenging benchmarks.

Abstract:
Deep supervised hashing techniques have exhibited remarkable efficiency in cross-modal retrieval tasks, because they enable the transformation of data from different modalities into compact binary codes that preserve semantic similarity structures. Nonetheless, existing methods often rely on pairwise or triplet relationships within known (or in-distribution) semantics during training, failing to capture the comprehensive ranking information inherent in web data that encompasses diverse concepts. In addition, these methods are vulnerable to out-of-distribution (OOD) semantic data when applied in realistic scenarios, resulting in suboptimal performance. In this paper, we propose ranking distribution preserving hashing (RDPH) to address these problems. We present a novel ranking loss, a differentiable surrogate that maximizes the NDCG metric for cross-modal retrieval. This loss incorporates two target ranking distributions derived from the ideal NDCG scores of samples and the cosine similarity of features. These distributions encourage RDPH to generate hash codes that approximate the desired inter-modal and intra-modal ranking distributions. To enhance the robustness of the hash codes against OOD data, RDPH leverages the CLIP paradigm to acquire OOD-resilient intermediate representations. Besides, we utilize the outlier exposure strategy to enhance the discriminative ability of OOD for hash codes under supervision by constructing auxiliary pseudo-OOD data from known data in feature space. Experiments on three datasets demonstrate that the proposed method achieves state-of-the-art performance on regular retrieval tasks and good results on simulated real-world retrieval tasks.

Abstract:
Perceptual image quality is related to content and distortion. Distortion classification is a common way to learn distortion information. How to extract distortion information consistent with human perception is a problem to be solved. Besides, the joint effect on image quality caused by the interplay of content and distortion has not been fully studied. In this paper, a novel Content Distortion Interaction Network (CDINet) is proposed for blind image quality assessment. Distortion representation are guided by content representation to learn quality-aware representation. CDINet consists of four components: a Distortion-Aware Module (DAM), a Content-Aware Module (CAM), an Asymmetric Content-Distortion Interaction (ACDI) module, and a quality regression module. The content representation and distortion representation are extracted respectively and fused interactively in CDINet. Specifically, with the assistance of image restoration, distortion representation consistent with human perception is learned. To further improve the ability in distortion representation, the DAM is used to construct the differences between the distorted image and its reference image. The proposed ACDI module enables the interaction of content and distortion representations to occur at different levels with less computational cost. Since the proposed CDINet considers the joint impact on image quality caused by the interplay of content and distortion, the predicted image qualities highly align with human perception. Comprehensive experiments on 8 benchmark datasets demonstrate that the proposed CDINet effectively extracts quality-aware representation, achieving state-of-the-art performance in evaluating both synthetically and authentically distorted images.

Abstract:
Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.

Abstract:
Remote photoplethysmography (rPPG) is an important technique for detecting human vital signs and has received extensive attention. For a long time, researchers have focused attention on supervised methods that rely on large amounts of labeled data. These methods are limited by their need for large amounts of data and the difficulty of acquiring ground truth physiological signals. To address these issues, several self-supervised methods based on contrastive learning have been proposed. However, they focus on contrastive learning between samples, which neglects inherent self-similar priors in physiological signals and seems to have a limited ability to cope with noise. In this article, a linear self-supervised reconstruction task was designed for extracting the inherent self-similar priors in physiological signals. In addition, a specific noise-insensitive strategy was explored for reducing the interference of motion and illumination. The framework proposed in this article, rPPG-MAE, demonstrates excellent performance even on the challenging VIPL-HR dataset. We also evaluate the proposed method on two public datasets, namely, PURE and UBFC-rPPG. The results show that our method not only outperforms existing self-supervised methods but also outperforms state-of-the-art (SOTA) supervised methods. One important observation is that the quality of the dataset appears to be more important than the size of the dataset used in self-supervised pretraining of the rPPG.

Abstract:
With the advent of multi-modal data, multi-modal hashing has received increasing attention for it can configure complementary multi-modal fusion and support fast multimedia retrieval. Nevertheless, the “coarse-grained” modality weighting strategy widely used in existing methods always ignores the distinctive contributions of different features and is troubled by parameter adjustment. Besides, traditional supervised methods usually adopt “hard semantic” that reflects the logical relationship between data and labels, but fails to poring on the description degree of categories to data. To solve these problems, we propose a multi-Facet weIghting aSymmetric Multi-modal Hashing based on latent semantic distribution (FISMH) approach, which is divided into supervised paradigm SFISMH and unsupervised paradigm UFISMH. First, we design a Multi-facet Weighted Multi-modal Fusion module that utilizes both modality- and feature-wise weights to achieve multi-modal fusion, where the weight learning requires no additional parameter adjustment. Then, we design a Latent Semantic Distribution based Asymmetric Hash Learning module, which utilizes the pair-wise similarity and semantic distribution to guide hash learning, and avoids the challenging pair-wise factorization through asymmetric form. The semantic distribution is learned from the inherent information of feature space, which can further preserve the intra-class relationships. Finally, a discrete hash optimization is developed to reduce quantization and directly learn hash codes. The main difference between SFISMH and UFISMH is that the former utilizes category information while the latter explores the underlying data structure when constructing the pair-wise similarity. Extensive experiments demonstrate that both SFIMH and UFISMH outperform existing supervised and unsupervised multi-modal hashing methods, showcasing their exceptional performance.

Abstract:
The most important effect of the video hashing technique is to support fast retrieval, which is benefiting from the high efficiency of binary calculation. Current video hash approaches are thus mainly targeted at learning compact binary codes to represent video content accurately. However, they may overlook the generation efficiency for hash codes, i.e., designing lightweight neural networks. This article proposes an Efficient Unsupervised Video Hashing (EUVH) method, which is not only for computing compact hash codes but also for designing a lightweight deep model. Specifically, we present an MLP-based model, where the video tensor is split into several groups and multiple axial contexts are explored to separately refine them in parallel. The axial contexts are referred to as the dynamics aggregated from different axial scales, including long/middle/short-range dependencies. The group operation significantly reduces the computational cost of the MLP backbone. Moreover, to achieve compact video hash codes, three structural losses are utilized. As demonstrated by the experiment, the three structures are highly complementary for approximating the real data structure. We conduct extensive experiments on three benchmark datasets for the unsupervised video hashing task and show the superior trade-off between performance and computational cost of our EUVH to the state of the arts.

Abstract:
Language-driven action localization aims to search a video segment in an untrimmed video, which is semantically relevant to an input language query. This task is challenging since language queries describe diverse actions with different motion characteristics and semantic granularities. Some actions, such as “the person takes off their shoes, and goes to the door”, are characterized by complex motion relationships, while others, such as “a person is standing holding a mirror in one hand”, are distinguished by salient body postures. In this paper, we propose a dynamic pathway between an exploitation module and an exploration module for query-aware feature learning to handle the diversity of actions. The exploitation module works in a coarse-to-fine manner, first learns the feature of general motion relationships to search the coarse segment of the target action and then learns the feature of subtle motion changes to predict the refined action boundaries. The exploration module functions in a point-to-area diffusion fashion, first learns the feature of sub-action pattern to search the salient postures of the target action and then learns the feature of temporal dependency to expand the posture frames to the action segment. The exploitation module and the exploration module are dynamically and adaptively selected to learn comprehensive representations of diverse actions to improve the action localization accuracy. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method performs better than existing methods.

Abstract:
Image-text matching is vital important in the field of multi-modal intelligence. Recently, it is advocated in a way that decomposes images and texts into local fragments and followed by region-word aligning. As a result, the image-text relevance score is given by aggregating semantic similarities between matched region-word pairs. Despite effectiveness, this strategy fails to express data relations exactly. From the perspective of the text side, text words decomposed from a concise language sentence usually have limited contextual information, which can result in semantic identical but actually false text-region alignments. From the perspective of the image side, semantic ambiguity that multiple objects share the same semantic meaning can further exacerbate this problem. In this manuscript, we introduce a mutually Textual and Visual Refinement Network (TVRN), to tackle the inaccurate cross-modal alignment problem. In a nutshell, TVRN improves inter-modal matching by improving contextual information in sentences meanwhile reduces semantic ambiguity in images to capture the maximized relevant relations. More specifically, we develop a new module that integrates visual contextual clues into the text modality to generate informational text features with richer geometric contexts. Mutually, we further design a semantic alignment enhancement module that leverages consensus affinity of local image and text features to guide deeper semantic image embedding with the supervision of global image vectors. At the image-text matching stage, similarities at the local and global levels are integrated to capture coarse-grained and fine-grained interactions between vision and language. A large number of experiments on Flickr30 K and MS-COCO benchmarks demonstrate that TVRN is superior to existing methods.

Abstract:
Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions. This may result in information loss from corrupted images where the available information is inherently sparse, especially for the scenario of large missing regions. Recent advances in self-attention mechanisms within transformers have led to significant improvements in many computer vision tasks including inpainting. However, limited by the computational costs, existing methods cannot fully exploit the efficacy of long-range modelling capabilities of such models. In this paper, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for high-level inferences made within the model. Moreover, we propose a Spatially-activated Channel Attention Layer (SCAL), an efficient self-attention mechanism interpreting spatial awareness to model the corrupted image at multiple scales. To further enhance the effectiveness of SCAL, motivated by recent advanced in speech recognition, we introduce a sandwich structure that places feed-forward networks before and after the SCAL module. We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets, CelebA, CelebA-HQ, Places2, and Dunhuang.

Abstract:
Few-shot open-set recognition, as a new paradigm, leveraging a limited amount of supervised data to identify specific Remote Sensing (RS) scene categories and generalize to novel ones. However, the data bias induced by the small sample size not only causes severe overfitting within base classes, but also impairs the capacity for inference to identify RS scenes in hitherto unobserved categories. Furthermore, owing to environmental influences, RS images frequently manifest notable intra-class disparities and comparatively low inter-class distinctions, intensifying the challenge in obtaining suitable classifiers. To address above issues, we investigate the utilization of a Multi-modal Foundational Model (MFM) infused with essential domain knowledge to mitigate the generalization limitations encountered in few-shot scenarios. Recognizing that existing MFMs with a visual-text dual-branch structure are primarily tailored for natural scenes, we propose a custom Frequency Distribution-based Multi-modal Fine-Tuning strategy (FreqDiMFT) in a parameter-efficient manner. More specifically, within the vision branch, we address the high inter-class similarity and intra-class diversity in RS images by embedding the local-global frequency distribution information to facilitate the recognition of RS scenes. To further amplify the model's generalization ability post transfer, we introduce an adaptive feature refinement module designed for Transformers, proficient in filtering redundant features resulting from domain disparities. To mitigate the domain drift on the textual branch, we adopt an input format that combines basic templates with domain expertise from RS end to generate more discriminative class prototypes. To fully verify the effectiveness of our FreqDiMFT in a more practical setting, we collect a Large-Scale hybrid dataset (LSRS). Extensive experiments demonstrate that, even with a scant number of training samples, our strategy yields advanced performances compared to state-of-the-art models.

Abstract:
Information diffusion prediction aims at predicting the target users in the information diffusion path on social networks. Prior works mainly focus on the observed structure or sequence of cascades, trying to predict to whom this cascade will be infected passively. In this study, we argue that user intent understanding is also a key part of information diffusion prediction. We thereby propose a novel Multi-scale Context-enhanced Dynamic Attention Network (MCDAN) to predict which user will most likely join the observed current cascades. Specifically, to consider the global interactive relationship among users, we take full advantage of user friendships and global cascading relationships, which are extracted from the social network and historical cascades, respectively. To refine the model's ability to understand the user's preference for the current cascade, we propose a multi-scale sequential hypergraph attention module to capture the dynamic preference of users at different time scales. Moreover, we design a contextual attention enhancement module to strengthen the interaction of user representations within the current cascade. Finally, to engage the user's own susceptibility, we construct a susceptibility label for each user based on user susceptibility analysis and use the rank of this label for auxiliary prediction. We conduct experiments over four widely used datasets and show that MCDAN significantly overperforms the state-of-the-art models. The average improvements are up to 5.41% in terms of Hits@100 and 8.47% in terms of MAP@100, respectively.

Abstract:
Heavy haze leads to severely degraded visual quality for images, and thus the performance of high level image-based tasks such as object detection is deteriorated. It is necessary and important to design an effective dehazing method for the computer vision system. It is well known that image haze is a function of depth and binocular images can predict the depth. Existing binocular dehazing methods conduct disparity estimation and dehazing jointly to enhance each other. However, a small error in disparity gives rise to a large variation in depth and in the estimation of haze-free images. To alleviate the problem, we propose a plain binocular image dehazing network in this paper, called BidNet, to dehaze both the left and right images simultaneously. BidNet does not explicitly perform disparity estimation that is time-consuming and well-known to be challenging. Instead, we design a stereo transformation module to mine the relationship and correlation between binocular images, making the best of varying information of cross views. Additionally, we design a Stereo Foggy Cityscapes dataset extended from the Foggy Cityscapes dataset for training the proposed BidNet. Extensive experimental results demonstrate that BidNet significantly outperforms the SOTA dehazing methods on the synthetic stereo foggy datasets as well as in real stereo foggy scenes. Experimental results show that jointly dehazing binocular image pairs is mutually beneficial, which is better than only dehazing left images. Furthermore, when applying BidNet to preprocess foggy inputs, large improvements are obtained in the performance of object detection, instance segmentation, semantic segmentation, and stereo-based 3D object detection.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task because the different imaging principles of visible and infrared images bring about huge modality discrepancy. Existing methods primarily address this issue by generating intermediate images to align modality features and establish connections between the visible and infrared modalities. However, the quality of these generated images is often unstable, limiting the effectiveness of such approaches. To overcome this limitation, we propose a novel method called modality shared-specific features cooperative separation. It consists of two key modules: the saliency response module and the cooperative separation module, aimed at alleviating the modality gap. The saliency response module incorporates a location attention mechanism and local features to construct contextual connections and extract local saliency information. Then, the cooperative separation module employs a more concise dual-MLPs as generator to effectively separate shared-specific features. Additionally, we introduce a shared feature refinement mechanism in both the generator and discriminator. By coordinating the shared-specific features, our method achieves secondary separation and extracts purer modality-shared features without specific information. Extensive experiments conducted on the SYSU-MM01 and RegDB public datasets demonstrate that our proposed method performs excellently in VI-ReID.

Abstract:
Single-image dehazing is a challenging task that requires both local details and global distribution. Existing methods face challenges in color imbalance and inconsistent details when predicting a haze-free image, because of their limitations in generalization from a specific setting (physics-based methods), capturing global information (CNN-based methods) and capturing detailed local information (ViT-based methods). In response to these challenges, we propose a balanced low-cost hybrid network called LHNetV2 based on LHNetV1. The key insight of LHNetV2 is the effective fusion of different features, and a series of novel approaches is proposed to increase the running speed of the original LHNetV1. Firstly, building upon the Feature-aware Information Fusion method, we preserve the original Physical Embedding and Architecture Aggregation components in LHNetV1. Next, to overcome the speed bottleneck of LHNetV1, we enhance the calculation method of attention in the ViT sub-network and streamline the cross-stage interaction strategy in the CNN main-network. Finally, we introduce a dynamic adversarial loss function to bolster both the training stability and performance of LHNetV2. The experiments are extensively conducted on mainstream datasets, and the results demonstrate that LHNetV2 achieves the best balance between the performance and the running speed in single-image dehazing.

Abstract:
In recent years, multimodal topic models have gained significant attention in various tasks involving short texts. Despite their impressive results, most models rely on bag-of-words assumptions for each modality, neglecting the intrinsic word relations in the textual modality and the underlying object relations in the visual modality. To address this limitation, we propose a novel approach that represents each document modality as a graph, harnessing the word relations and the visual object relations to guide the topic extraction process. Our approach is grounded in the insight that, in the textual modality, words with specific relations, such as co-occurrence relations, semantic relations and syntactic relations, are more likely to be assigned to the same topic. Similarly, in the visual modality, the relations between objects, such as spatial relations and contextual relations, can also provide valuable information for topic extraction. By leveraging graph-based representations, our model captures the inherent associations between words and visual objects, resulting in the generation of more coherent and interpretable topics. To infer the model's parameters, we develop an effective algorithm that integrates neural variational inference and contrastive learning. The experimental results on three datasets verify the effectiveness of our proposed model in terms of topic coherence, topic diversity and mean average precision, confirming that incorporating word relations and object relations through graph-based representations significantly enhances the quality of the extracted topics.

Abstract:
Transformer-based semantic segmentation has been developed rapidly. Vision transformer (ViT) rely on self-attention mechanism which employs all image patches to compute long-range dependencies. ViT considers all tokens equally important for self-attention calculation. Nevertheless, it has been proved that image tokens contribute differently to the final prediction. In this paper, we propose a token filtration method to select informative tokens. These informative tokens are then used to reweight token sequence so that important image tokens can be focused by transformer for more accurate prediction. Meanwhile, due to lack of local information, transformer-based segmentation usually has incomplete object structure and coarse boundaries. To this end, a segmentation refinement method is introduced to refine transformer segmentation results. The refinement method integrates transformer outputs with convolutional features of the input image to generate refined prediction. Finally, we introduce the token filtration and refinement network (TFRNet) which adopts the proposed token filtration method and the refinement method to improve segmentation performance. We evaluate the proposed TFRNet on the ADE20K and Cityscapes datasets. Experimental results show that the proposed method outperforms other state-of-the-art approaches.

Abstract:
Conventional domain adaptive (DA) methods for person re-identification (ReID) face knowledge transfer challenges when labeled data from the source domain cannot be accessed due to privacy constraints. Although the methods operating under source-absent DA settings attempt to address this challenge by using models pretrained on the source domain in their mutual teaching frameworks, failing to capture domain divergence in scenarios in which the source data are completely inaccessible can simultaneously introduce issues related to mutual convergence. In response, we introduce an adversarial attacks-driven mutual teaching (AAMT) framework as an innovative and applicable source-free DA person ReID scheme. Specifically, we first carefully develop a perturbation generator to generate source-style adversarial examples by leveraging a pretrained source model. Then, these diverse adversarial examples are employed to attack the mutual teaching model, implicitly measuring the domain divergence. Accordingly, we design a contrastive learning loss to enlarge the differences between the training pairs and further mitigate the mutual convergence issue. Extensive experiments demonstrate that AAMT outperforms the existing methods under both conventional and source-absent DA settings, achieving state-of-the-art performance.

Abstract:
Unsupervised adversarial domain adaptation (ADA) aims to learn domain-invariant features by confusing a domain discriminator. As training goes on, the feature distributions of source and target samples are increasingly aligned/indistinguishable. The discrimination capability of the domain discriminator w.r.t. those aligned samples deteriorates due to the domain label of each sample is still fixed all through the learning process, which thus cannot effectively further drive the feature learning. A recently proposed method named Re-enforceable Adversarial Domain Adaptation (RADA) [1] tend to re-energize the domain discriminator during the training by using dynamic domain labels. Specifically, RADA sets up a heuristic criterion and uses it to relabel the well aligned target domain samples as source domain samples on the fly. In our study, we identify a critical problem of RADA: it is a kind of heuristic domain data re-partition solution without explicitly serving the adaptation task itself, suggesting that the criteria of RADA on which sample should be relabeled is hard to decide. To address the problem, we revisit domain relabeling process from a perspective of prompt tuning, and introduce a meta-optimized learnable prompts into RADA to replace some hand-craft designs in dynamic relabeling process, which scheme is named as RADA-prompt. Particularly, we employ a module of meta-prompter, which learns to adaptively relabel the samples based on the objective of serving UDA task. To train the meta-prompter, we leverage a domain alignment measurement and a classification measurement as the meta optimization objective. Extensive experiments on multiple unsupervised domain adaptation benchmarks demonstrate the effectiveness and superiority of RADA-prompt, this scheme also achieves state-of-the-art performance.

Abstract:
Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific uniqueness. As a result, these methods are not suitable for person clustering. In this article, we propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues. The distribution representation of a clue is a vector consisting of the relation between this clue and all other clues from all modalities, thus being modality agnostic and good for person clustering. Accordingly, we introduce a graph-based method to construct distribution representation and employ a cyclic update policy to refine distribution representation progressively. Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering Dataset (VPCD) and VoxCeleb2 multi-view clustering dataset, respectively.

Abstract:
Recent mask proposal models have significantly improved the performance of open-vocabulary semantic segmentation. However, the use of a ‘background’ embedding during training in these methods is problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels. Furthermore, they ignore the semantic relationship of text embeddings, which arguably can be highly informative for open-vocabulary prediction as some classes may have close relationship with other classes. To this end, this article proposes novel class enhancement losses to bypass the use of the ‘background’ embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores. To further capture the relationship between base and novel classes, we propose an effective pseudo label generation pipeline using the pretrained vision-language model. Extensive experiments on several benchmark datasets show that our method achieves overall the best performance for open-vocabulary semantic segmentation. Our method is flexible, and can also be applied to the zero-shot semantic segmentation problem.

Abstract:
Generative neural radiance fields (NeRF) bring image generation into the 3D era, which have delivered impressive generation quality and 3D consistency, especially in the face generation domain. Upon pre-trained generative NeRF, 3D-aware image editing has been explored and achieved promising performance via manipulating semantic maps or attributes. However, a more flexible editing interface, text, remains under-explored in the context of 3D-aware image editing. In this work, we leverage the Contrastive Language-Image Pre-training (CLIP) model to achieve 3D-aware image editing in pre-trained generative NeRF models given a target text prompt. To achieve accurate and controllable geometry editing, we propose MorphNeRF, a learnable morphing network that morphs the 3D geometry of images toward the target descriptions via generative NeRF. Different from prior studies that achieve image editing by manipulating latent codes or directly finetuning pre-trained models, morphing the geometry can better preserve the texture of the source image and facilitate the control of editing strength by adjusting the weight of morphing maps explicitly. Extensive experiments and comparisons show that the proposed MorphNeRF achieves superior image editing performance.

Abstract:
Image dehazing is a pivotal preliminary step in the advancement of robust intelligent surveillance system. However, it is an extremely challenging ill-posed problem, as it faces severe information degradation when accurately restoring the clean image from its haze-polluted counterpart. This paper proposes a novel Progressive Negative Enhancing (PNE) contrastive learning mechanism to fully exploit various types of negative information, thereby facilitating the traditional positive-oriented objective function for image dehazing. The proposed method can progressively update the negative samples during model training, to steadily squeeze the restored image towards its desired clean target from various directions. Furthermore, considering the image dehazing task as a many-to-one feature mapping problem, we also make an early effort to enhance the robustness of the dehazing model under variational haze densities. Specifically, a novel density-variational dehazing network is proposed to be optimized under the consistency-regularized framework using the proposed PNE learning mechanism. The consistency regularization ensures consistent output given multi-level degraded hazy images, thereby significantly enhancing the robustness of the model in dealing with various hazy scenarios. Extensive experiments demonstrate that the proposed method exhibits superior performance over existing state-of-the-art methods. It achieves average PSNR boosts of 0.60 dB, 0.28 dB and 0.82 dB on dehazing, deraining and desnowing tasks, respectively.

Abstract:
We investigate rate-distortion-computing optimized live 360° video streaming to heterogeneous mobile VR clients in 5G networks. The client population comprises devices that feature single (LTE) or dual (LTE/NR) cellular connectivity. The content is compressed using scalable 360° tiling at the origin and sent towards the clients over a single backbone network link. A mobile edge server then adapts the incoming streaming data to the individual clients and their respective down-link transmission rates using formal rate-distortion-computing optimization. Single connectivity clients are served by the edge server a baseline representation/layer of the content adapted to their down-link transmission capacity and device computing capability. A dual connectivity client is served in parallel a baseline content layer on its LTE connectivity and a complementary viewport-specific enhancement layer on its NR connectivity, synergistically adapted to the respective down-links' transmission capacities and its computing capability. We formulate two optimization problems to conduct the operation of the edge server in each case, taking into account the key system components of the delivery process and induced end-to-end latency, aiming to maximize the immersion fidelity delivered to each client. We explore respective geometric programming optimization strategies that compute the optimal solutions at lower complexity. We rigorously analyze the computational complexity of the two optimization algorithms we formulate. In our evaluation, we demonstrate considerable performance gains over multiple assessment factors relative to two state-of-the-art techniques. We also examine the robustness of our approach to inaccurate user navigation prediction, transient NR link loss, dynamic LTE bandwidth variations, and diverse 360° video content. Finally, we contrast our results over five popular video quality metrics. The paper makes a community contribution by publicly sharing a dataset that captures the rate-quality trade-offs of the 360° video content used in our evaluation, for multiple contemporary quality metrics, to stimulate further studies and follow up work.

Abstract:
Currently, existing action recognition methods mainly use a data-driven method to extract spatio-temporal representations of actions for recognition. However, this method may face performance bottlenecks. At the same time, existing action recognition methods are easily affected by the bias of scene information and object information in videos. In order to explore the essential causal relationship between factors and remove bias in action recognition, we introduce the theory of causal inference into the field of action recognition and propose a Knowledge-based Hierarchical Causal Inference Network (KHCIN) to help us step toward a new direction of inference in action recognition. First, we construct a Knowledge-based Hierarchical Causal Graph (KHCG) to structurally represent the scene, object and motion knowledge of a video. Then, in the model inference stage, we perform factual causal inference on a video on the constructed KHCG, and then deploy counterfactual inference on the Direct Content Hierarchy (DCH) and Indirect Interaction Hierarchy (IIH) in the KHCG. For DCH, we intervene in the model at the decision level to highlight bias errors in the model predictions. For the IIH, we focus on intervening in the feature modelling process. The biased interactions are revealed by interrupting the information communication in the feature space. By comparing the results of factual and counterfactual inference, we can easily expose the biased information in the original representations and eliminate them. Driven by counterfactual causal inference, our approach can significantly improve the performance of action recognition while improving model explainability. Extensive experiments demonstrate the effectiveness of this method. We hope that KHCIN can provide some new ideas for better introduction of causal inference theory in the action recognition community in the future.

Abstract:
Existing volumetric neural rendering techniques, such as Neural Radiance Fields (NeRF), face limitations in synthesizing high-quality novel views when the camera poses of input images are imperfect. To address this issue, we propose a novel 3D reconstruction framework that enables simultaneous optimization of camera poses, dubbed CBARF (Cascaded Bundle-Adjusting NeRF). In a nutshell, our framework optimizes camera poses in a coarse-to-fine manner and then reconstructs scenes based on the rectified poses. It is observed that the initialization of camera poses has a significant impact on the performance of bundle-adjustment (BA). Therefore, we cascade multiple BA modules at different scales to progressively improve the camera poses. Meanwhile, we develop a neighbor-replacement strategy to further optimize the results of BA in each stage. In this step, we introduce a novel criterion to effectively identify poorly estimated camera poses. Then we replace them with the poses of neighboring cameras, thus further eliminating the impact of inaccurate camera poses. Once camera poses have been optimized, we employ a density voxel grid to generate high-quality 3D reconstructed scenes and images in novel views. Experimental results demonstrate that our CBARF model achieves state-of-the-art performance in both pose optimization and novel view synthesis, especially in the existence of large camera pose noise.

Abstract:
Generating choreography from music poses a significant challenge. Conventional dance generation methods are limited by only being able to match specific dance movements to music with corresponding rhythms, restricting the utilization of existing dance sequences. To address this limitation, we propose a method that generates a label, based on a probability distribution function derived from music features, that can be applied to music segments of varying lengths. By using the Kullback-Leibler divergence, we assess the similarity between music segments based on these labels. To ensure adaptability to different musical rhythms, we employ a cubic spline method to represent dance movements. This approach allows us to control the speed of a dance sequence by resampling it, enabling adaptation to varying rhythms based on the tempo of newly input music. To evaluate the effectiveness of our method, we compared the dances generated by our approach with those generated by other neural network-based and conventional methods. Quantitative evaluations demonstrated that our method outperforms these alternatives in terms of dance quality and fidelity.

Abstract:
We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehensive analysis of preliminary experimental results reveals the complementary nature of the conventional end-to-end (E2E) and proposed VSM frameworks, especially concerning speaker head movements. To increase lip reading accuracy, we propose hybrid viseme subwords and end-to-end modeling (HVSEM), which exploits the strengths of both approaches through multitask learning. As an extension to HVSEM, we also propose collaborative viseme subword and end-to-end modeling (CVSEM), which further explores the synergy between the VSM and E2E frameworks by integrating a state-mapped temporal mask (SMTM) into joint modeling. Experimental evaluations using different model backbones on both the LRW and LRW-1000 datasets confirm the superior performance and generalizability of the proposed frameworks. Specifically, VSM outperforms the baseline E2E framework, while HVSEM outperforms VSM in a hybrid combination of VSM and E2E modeling. Building on HVSEM, CVSEM further achieves impressive accuracies on 90.75% and 58.89%, setting new benchmarks for both datasets.

Abstract:
Object detection aims to classify interest objects within an image and pinpoint their positions using predicted rectangular bounding boxes. However, classification and localization tasks are heterogeneous, not only spatially misaligned but also differing in properties and feature requirements. Modern detectors commonly share the spatial region and detection head for both tasks, making them challenging to achieve optimal performance altogether, resulting in inconsistent accuracy. Specifically, the predicted bounding box may have higher classification confidence but lower localization quality, or vice versa. To tackle this issue, the spatial decoupling mechanism via general deformable RoI pooling is first proposed. This mechanism separately pursues the favorable regions for classification and localization, and subsequently extracts the corresponding features. Then, the semi-decoupled head is designed. Compared to the decoupled head that utilizes independent classification and localization networks, potentially leading to excessive decoupling and compromised detection performance, the semi-decoupled head enables the networks to mutually enhance each other while concentrating on their respective tasks. In addition, the semi-decoupled head also introduces a redundancy suppression module to filter out redundant task-irrelevant information of features extracted by separate networks and reinforce task-related information. By combining the spatial decoupling mechanism with the semi-decoupled head, the proposed detector achieves an impressive 43.7 AP in Faster R-CNN framework with ResNet-101 as backbone network. Without bells and whistles, extensive experimental results on the popular MS COCO dataset demonstrate that the proposed detector suppresses the baseline by a significant margin and outperforms some state-of-the-art detectors.

Affiliations: College of Artificial Intelligence and Big Data, Hefei University, Hefei, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China; Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China, Hefei, China; School of Electrical and Mechanical Engineering, University of Adelaide, Adelaide, SA, Australia; School of Artificial Intelligence, Anhui University, Hefei, China; Zhuhai UM Science and Technology Research Institute, FST University of Macau, Macau

Abstract:
In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the recognition performance.

Abstract:
Corruptions due to data perturbations and label noise are prevalent in the datasets from unreliable sources, which poses significant threats to model training. Despite existing efforts in developing robust models, current learning methods commonly overlook the possible co-existence of both corruptions, limiting the effectiveness and practicability of the model. In this paper, we develop an Effective and Robust Adversarial Training (ERAT) framework to simultaneously handle two types of corruption (i.e., data and label) without prior knowledge of their specifics. We propose a hybrid adversarial training surrounding multiple potential adversarial perturbations, alongside a semi-supervised learning based on class-rebalancing sample selection to enhance the resilience of the model for dual corruption. On the one hand, in the proposed adversarial training, the perturbation generation module learns multiple surrogate malicious data perturbations by taking a DNN model as the victim, while the model is trained to maintain semantic consistency between the original data and the hybrid perturbed data. It is expected to enable the model to cope with unpredictable perturbations in real-world data corruption. On the other hand, a class-rebalancing data selection strategy is designed to fairly differentiate clean labels from noisy labels. Semi-supervised learning is performed accordingly by discarding noisy labels. Extensive experiments demonstrate the superiority of the proposed ERAT framework.

Abstract:
Continual learning is a promising machine learning paradigm to learn new tasks while retaining previously learned knowledge over streaming training data. Till now, rehearsal-based methods, keeping a small part of data from old tasks as a memory buffer, have shown good performance in mitigating catastrophic forgetting for previously learned knowledge. However, most of these methods typically treat each new task equally, which may not adequately consider the relationship or similarity between old and new tasks. Furthermore, these methods commonly neglect sample importance in the continual training process and result in sub-optimal performance on certain tasks. To address this challenging problem, we propose Relational Experience Replay (RER), a bi-level learning framework, to adaptively tune task-wise relationships and sample importance within each task to achieve a better ‘stability’ and ‘plasticity’ trade-off. As such, the proposed method is capable of accumulating new knowledge while consolidating previously learned old knowledge during continual learning. Extensive experiments conducted on three benchmark image datasets (CIFAR-10, CIFAR-100, and Tiny ImageNet) and two text datasets (20News and DBpedia) show that the proposed method can consistently improve the performance of all baselines and surpass current state-of-the-art methods.

Abstract:
In the field of video depth estimation, significant strides have been made with deep learning-based multi-view stereo approaches. However, existing studies struggle to produce consistently accurate depth maps that account for both multi-view geometry and temporal consistency from monocular video contents. To overcome this limitation, we introduce CMVDE, an innovative video depth estimation framework that leverages a multi-view geometric-temporal coupling approach in an end-to-end manner. Our proposed geometric consistency module efficiently generates multi-view geometric features by employing mutual cross-view epipolar attention between adjacent video frames. Additionally, it compresses these features using the novel multi-scale feature compressor, producing an effective input tensor for the subsequent module. Moreover, our framework enhances temporal consistency across consecutive video frames with the temporal consistency module based on convolutional LSTM [1] leveraging previous depth information as geometric guidance. Compared to state-of-the-art models, our approach achieves superior performance in depth quality and consecutive consistency on the ScanNet [2] and 7-Scenes [3] datasets, surpassing previous multi-view video depth estimation methods.

Abstract:
In this paper, we propose a new task named incremental few-shot action recognition (IFSAR), which aims to learn new action classes incrementally with limited samples. Existing few-shot class incremental learning methods are mainly designed for image datasets and cannot be directly applied to action recognition due to the complicated temporal evolution and spatial structure in videos. Besides, because of the incremental and few-shot setting, the catastrophic forgetting and overfitting problems are further intensified in the video domain. To address the above issues, we propose a spatiotemporal orthogonal projection capsule network (STOP), which employs a spatiotemporal attention routing mechanism and an orthogonal projection capsule layer for effective IFSAR. The former can effectively encode spatial and temporal transformation information and explore the action part-whole relationships to prevent catastrophic forgetting, while the latter is further designed to maintain a sufficient distance between the prototypes of old and novel classes to avoid overfitting by considering spatial-temporal features. Extensive experimental results demonstrate that the proposed method outperforms a series of state-of-the-art approaches on UCF-101, Kinetics-100, and HMDB-51 datasets.

Abstract:
Single-object tracking generally advances by incrementally determining the tracked target's position through interactions between the search region and the template. However, the template provides less information than does the search region in terms of both temporal cues and spatial resolution. To alleviate this imbalance, we introduce an anthropic tracking framework, MATrack (Mutual Affinity Tracker), which explicitly strengthens weak template information and implicitly reduces background clutter through interactions between multiple templates and the search region. Additionally, we propose a coarse-to-fine localization approach that combines the benefits of corner-based and center-based methods. This approach enables us to simultaneously update the most recent state and background information without two-stage training. MATrack achieves state-of-the-art performance on multiple test benchmarks, including GOT-10k, LASOT, TrackingNet, OTB-100, UAV123, and NFS30. Among these benchmarks, MATrack-320’s performance stands out, particularly in the short-term tracking dataset GOT-10k, where it achieves an accuracy overlap (AO) of 77.3. We also conduct comprehensive quantitative and qualitative evaluations to demonstrate that our method significantly outperforms other state-of-the-art approaches.

Abstract:
Various applications require realistic, artifact-free, and animatable 3D avatars. However, traditional 3D morphable models (3DMMs) produce animatable 3D heads but fail to capture accurate geometries and details, while existing deep implicit functions have been shown to achieve realistic reconstructions but suffer from artifacts and struggle to yield 3D heads that are easy to animate. To reconstruct high-fidelity, artifact-less, and animatable 3D heads from single-view images, we leverage semantics to bridge the best properties of 3DMMs and deep implicit functions and propose SeIF—a semantic-constrained deep implicit function. First, SeIF derives fine-grained semantics from a standard 3DMM (e.g., FLAME) and samples a semantic code for each query point in the query space to provide a soft constraint to the deep implicit function. The reconstruction results show that this semantic constraint does not weaken the powerful representation ability of the deep implicit function while significantly suppressing artifacts. Second, SeIF predicts a more accurate semantic code for each query point and utilizes the semantic codes to uniformize the structure of reconstructed 3D head meshes with the standard 3DMM. Since our reconstructed 3D head meshes have the same structure as the 3DMM, 3DMM-based animation approaches can be easily transferred to animate our reconstructed 3D heads. As a result, SeIF can reconstruct high-fidelity, artifact-less, and animatable 3D heads from single-view images of individuals with diverse ages, genders, races, and facial expressions. Quantitative and qualitative experimental results on seven datasets show that SeIF outperforms existing state-of-the-art methods by a large margin.

Abstract:
Deeplearning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propose integrating deep features from pre-trained visual models with a statistical analysis model into a Multi-scale Deep Feature Statistics (MDFS) model for achieving opinion-unaware BIQA (OU-BIQA), thereby eliminating the reliance on human rating data and significantly improving training efficiency. Specifically, we extract patch-wise multi-scale features from pre-trained vision models, which are subsequently fitted into a multivariate Gaussian (MVG) model. The final quality score is determined by quantifying the distance between the MVG model derived from the test image and the benchmark MVG model derived from the high-quality image set. A comprehensive series of experiments conducted on various datasets show that our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models. Furthermore, it shows improved generalizability across diverse target-specific BIQA tasks.

Abstract:
Traditional fine-grained image classification typically relies on large-scale training samples with annotated ground truth. However, some fine-grained categories in the real world have few available images, and the existing few-shot models have difficulty in distinguishing the subtle differences among them. Moreover, the intra-class distances between some fine-grained categories may be very large, but the inter-class ones are small, which makes the distinguishing features of each category are different for distinct tasks. To solve the challenges, we propose a novel network (FicNet) using multi-frequency neighborhood (MFN) and double-cross modulation (DCM). MFN captures the multi-frequency structure representation that is irrelevant to the background by integrating the spatial and frequency domain information, and then reduces the intra-class distance. DCM modulates the representation by global context and inter-class relationships, which enables both support and query features to have complete targets and respond to the same parts, and then accurately identify subtle inter-class differences. Comprehensive experiments on three fine-grained benchmark datasets for two few-shot tasks verify that FicNet has excellent performance compared to the state-of-the-art methods. Notably, it can obtain classification accuracy 93.17% and 95.36% on datasets “Caltech-UCSD Birds” and “Stanford Cars”, respectively, surpassing the benchmarks set by general fine-grained image classification methods.

Abstract:
Currently, there is a growing scholarly and industrial interest in micro-video-centric research. Within these domains, multi-label learning has emerged as a fundamental yet attractive subject. Existing methods primarily place emphasis on feature representations of individual micro-videos, while neglecting latent interdependencies between instance and label domains. To address this problem, in this paper, we propose a novel self-attentive deep consistent matrix factorization (SADCMF) method, which jointly explores dual-domain hierarchical representations and their inherent dependencies for micro-video multi-label classification. Specifically, SADCMF includes three primary characteristics. 1) A dual-domain deep collaborative factorization module is developed to explore the first-stage representations of instance features and the discriminative embeddings of label semantics in a mutually beneficial manner. 2) A correlation-driven self-attentive factorization module is devised to acquire the label-aware attentive outputs, which are further combined with original features through a residual structure to enrich the second-stage feature representations. 3) A dual-stream representation consistency module ensures the unidirectional and bidirectional representation consistency, meanwhile, narrows the discrepancies between the two-stage representations for improving the generalization ability of our method. Extensive experiments conducted on two publicly available micro-video multi-label datasets demonstrate its superior performance in comparison with state-of-the-art methods.

Abstract:
Most existing shallow semi-supervised domain adaptation (SSDA) algorithms are based mainly on the framework adopting the maximum mean discrepancy (MMD) criterion, which is unstable and easily becomes stuck in a poor local minimum. Moreover, existing SSDA methods typically assume that the influence of the source domain is equivalent to that of the target domain, which is unreasonable and severely limits their performance. To address such drawbacks, we propose a novel SSDA framework derived from simple least squares regression (LSR) in a joint transductive and inductive learning paradigm, named transferable LSR (TLSR). Specifically, TLSR first learns domain-shared features using transfer component analysis (TCA) in a transductive paradigm. Then, TLSR augments the TCA features into the raw sample feature, formulating them into a block-diagonal matrix and training them in an inductive learning paradigm. This joint transductive and inductive learning paradigm helps alleviate the negative impacts of the MMD criterion of TCA but preserves the useful learned domain-shared knowledge. Moreover, the proposed block-diagonal input structure helps to separate the learned projections into independent domain-specific parts. Owing to the block-diagonal input structure, the influence of each domain can be reweighted, leading to significant improvements in performance. The experimental results demonstrate that the proposed TLSR outperforms the other shallow state-of-the-art competitors in 68 out of 90 cross-domain tasks.

Abstract:
Point clouds are rapidly gaining popularity in many practical applications, and point cloud quality assessment (PCQA) is an important research topic that helps us measure and improve the visual experience in applications using point clouds. Research on full-reference (FR) PCQAs has recently made impressive progress, and research on no-reference (NR) PCQAs has also gradually increased. However, the performance of the prior NR PCQA methods still suffers from weak generalization ability and lower accuracy than the FR metrics in general. In this work, we propose a two-stage sampling method that can reasonably represent a whole point cloud, making it possible to efficiently calculate the point cloud quality. For quality prediction, we designed a twin-attention-based transformer PCQA model (3DTA), which uses the data of the two-stage sampling method as input and directly outputs the predicted quality score. Our model is accurate and widely applicable, and it has a simple and flexible structure. Experimental results show that in most cases, the proposed 3DTA model substantially outperforms the benchmark NR methods. The accuracy of the proposed method is competitive even against that of the FR method, which makes 3DTA a strong candidate for the PCQA task, regardless of the reference availability.

Abstract:
This paper investigates the domain of automatic music generation (AMG) and its capacity to produce music that is aligned with user preferences. The incorporation of user music preference (UMP) awareness in AMG technology has the potential to reduce reliance on musicians and domain experts while encouraging users to engage in activities that promote human health and potential. Current research in AMG has been limited to the qualitative control of a constrained set of attributes in the generated music such as selecting a genre from a given list. This constraint makes it challenging to develop music that is both aligned with UMP and suitable for practical text query-based applications. To address this challenge, we propose to apply deep-graph-networks on music community data, jointly modeling UMP and music features. Moreover, users' textual descriptions of expected music can be transformed into graphs that are compatible with UMPs. Node embeddings representing user queries' connotation are extracted to condition the music generator. The results on objective and subjective metrics demonstrate a significant improvement in UMP accuracy by 31.3%, UMP-aware AMG by 63.5%, and text-to-music AMG effectiveness by 76.5%. Our detailed analysis indicates that the generated music aligns best with queries comprised of short sentences and commonly used words.

Abstract:
Due to the unprecedented power of text-to-image diffusion models, customizing these models to generate new concepts has gained increasing attention. Existing works have achieved some success on real-world concepts, but fail on the concepts of anime characters. We empirically find that such low quality comes from the newly introduced identifier text tokens, which are optimized to identify different characters. In this paper, we propose AnimeDiff which focuses on customized image generation of anime characters. Our AnimeDiff directly binds anime characters with their names and keeps the embeddings of text tokens unchanged. Furthermore, when composing multiple characters in a single image, the model tends to confuse the properties of those characters. To address this issue, our AnimeDiff incorporates a Cut-and-Paste data augmentation strategy that produces multi-character images for training by cutting and pasting multiple characters onto background images. Experiments are conducted to prove the superiority of AnimeDiff over other methods.

Abstract:
As a special subset of multi-view settings for 3D human pose estimation, stereoscopic settings show promising applications in practice since they are not ill-posed but could be as mobile as monocular ones. However, when there are only two views, the problems of occlusions and “double counting” (ambiguity between symmetric joints) pose greater challenges that are not addressed by previous approaches. On this concern, we propose a novel framework to detect limb orientations in field form and incorporate them explicitly with joint features. Two modules are proposed to realize the fusion. At 3D level, we design compound triangulation as an explicit module that produces the optimal pose using 2D joint locations and limb orientations. The module is derived from reformulating triangulation in 3D space, and expanding it with the optimization of limb orientations. At 2D level, we propose a parameter-free module named co-fixing to enable joint and limb features to fix each other to alleviate the impact of “double counting.” Features from both parts are first used to infer each other via simple convolutions and then fixed by the inferred ones respectively. We test our method on two public benchmarks, Human3.6M and Total Capture, and our method achieves state-of-the-art performance on stereoscopic settings and comparable results on common 4-view benchmarks.

Abstract:
Low-light image enhancement is a challenging task due to the limited visibility in dark environments. While recent advances have shown progress in integrating CNNs and Transformers, the inadequate local-global perceptual interactions still impedes their application in complex degradation scenarios. To tackle this issue, we propose BiFormer, a lightweight framework that facilitates local-global collaborative perception via bilateral interaction. Specifically, our framework introduces a core CNN-Transformer collaborative perception block (CPB) that combines local-aware convolutional attention (LCA) and global-aware recursive Transformer (GRT) to simultaneously preserve local details and ensure global consistency. To promote perceptual interaction, we adopt bilateral interaction strategy for both local and global perception, which involves local-to-global second-order interaction (SoI) in the dual-domain, as well as a mixed-channel fusion (MCF) module for global-to-local interaction. The MCF is also a highly efficient feature fusion module tailored for degraded features. Extensive experiments conducted on low-level and high-level tasks demonstrate that BiFormer achieves state-of-the-art performance. Furthermore, it exhibits a significant reduction in model parameters and computational cost compared to existing Transformer-based low-light image enhancement methods.

Affiliations: State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China; State Key Laboratory of Integrated Services Networks, School of Cyber Engineering, Xidian University, Xi'an, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China; Sydney AI Center, School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW, Australia

Abstract:
Prompt learning has emerged as a thriving parameter-efficient fine-tuning technique for adapting pre-trained vision-language models (VLMs) to various downstream tasks. However, existing prompt learning approaches still exhibit limited capability for adapting foundational VLMs to specific domains that require specialized and expert-level knowledge. Since this kind of specific knowledge is primarily embedded in the pre-defined text labels, we infer that foundational VLMs cannot directly interpret semantic meaningful information from these specific text labels, which causes the above limitation. From this perspective, this paper additionally models text labels with learnable tokens and casts this operation into traditional prompt learning framework. By optimizing label tokens, semantic meaningful text labels are automatically learned for each class. Nevertheless, directly optimizing text label still remains two critical problems, i.e., insufficient optimization and biased optimization. We further address these problems by proposing Modality Interaction Text Label Optimization (MITLOp) and Color-based Consistency Augmentation (CCAug) respectively, thereby effectively improving the quality of the optimized text labels. Extensive experiments indicate that our proposed method achieves significant improvements in VLM adaptation on specific domains.

Abstract:
Recently, benefitting from the rapid development of deep learning technology, the research of salient object detection has achieved great progress. However, the performance of existing cutting-edge saliency models relies on large network size and high computational overhead. This is unamiable to real-world applications, especially the practical platforms with low cost and limited computing resources. In this paper, we propose a novel lightweight saliency model, namely Attention-guided Densely Multi-scale Network (ADMNet), to tackle this issue. Firstly, we design the multi-scale perception (MP) module to acquire different contextual features by using different receptive fields. Embarking on MP module, we build the encoder of our model, where each convolutional block adopts a dense structure to connect MP modules. Following this way, our model can provide powerful encoder features for the characterization of salient objects. Secondly, we employ dual attention (DA) module to equip the decoder blocks. Particularly, in DA module, the binarized coarse saliency inference of the decoder block (i.e., a hard spatial attention map) is first employed to filter out interference cues from the decoder feature, and then by introducing large receptive fields, the enhanced decoder feature is used to generate a soft spatial attention map, which further purifies the fused features. Following this way, the deep features are steered to give more concerns to salient regions. Extensive experiments on five public challenging datasets including ECSSD, DUT-OMRON, DUTS-TE, HKU-IS, and PASCAL-S clearly show that our model achieves comparable performance with the state-of-the-art saliency models while running at a 219.4fps GPU speed and a 1.76fps CPU speed for a 368×368 image with only 0.84 M parameters.

Abstract:
Existing continual image classification methods demonstrate that samples from all sequences of continual classification tasks contain common (task-invariant) features and class-specific (task-variant) features that can be decoupled for classification tasks. However, the existing feature decomposition strategies only focus on individual tasks while neglecting the essential cues that the relationship between different tasks can provide, thereby hindering the improvement of continual image classification results. To address this issue, we propose an Adversarial Contrastive Continual Learning (ACCL) method that decouples task-invariant and task-variant features by constructing all-round, multi-level contrasts on sample pairs within individual tasks or from different tasks. Specifically, three constraints on the distribution of task-invariant and task-variant features are included, i.e., task-invariant features across different tasks should remain consistent, task-variant features should exhibit differences, and task-invariant and task-variant features should differ from each other. At the same time, we also design an effective contrastive replay strategy to make full use of the replay samples to participate in the construction of sample pairs, further alleviating the forgetting problem, and modeling cross-task relationships. Through extensive experiments on continual image classification tasks on CIFAR100, MiniImageNet and TinyImageNet, we show the superiority of our proposed strategy, improving the accuracy and with better visualized outcomes.

Abstract:
Cross-technology communication is essential for the Internet of Multimedia Things (IoMT) applications, enabling seamless integration of diverse media formats, optimized data transmission, and improved user experiences across devices and platforms. This integration drives innovative and efficient IoMT solutions in areas like smart homes, smart cities, and healthcare monitoring. However, this integration of diverse wireless standards within cross-technology multimedia communication increases the susceptibility of wireless networks to attacks. Current methods lack robust authentication mechanisms, leaving them vulnerable to spoofing attacks. To mitigate this concern, we introduce DeepSpoof, a spoofing system that utilizes deep learning to analyze historical wireless traffic and anticipate future patterns in the IoMT context. This innovative approach significantly boosts an attacker's impersonation capabilities and offers a higher degree of covertness compared to traditional spoofing methods. Rigorous evaluations, leveraging both simulated and real-world data, confirm that DeepSpoof significantly elevates the average success rate of attacks.

Abstract:
The infrared and visible cross-modal registration and fusion can generate more comprehensive representations of object and scene information. Previous frameworks primarily focus on addressing the modality disparities and the impact of preserving diverse modality information on the performance of registration and fusion tasks among different static image pairs. However, these frameworks overlook the practical deployment on real-world devices, particularly in the context of video streams. Consequently, the resulting video streams often suffer from instability in registration and fusion, characterized by fusion artifacts and inter-frame jitter. In light of these considerations, this paper proposes a unified registration and fusion scheme for video streams, termed RCVS. It utilizes a robust matcher and spatial-temporal calibration module to achieve stable registration of video sequences. Subsequently, RCVS combines a fast lightweight fusion network to provide stable fusion video streams for infrared and visible imaging. Additionally, we collect a infrared and visible video dataset HDO, which comprises high-quality infrared and visible video data captured across diverse scenes. Our RCVS exhibits superior performance in video stream registration and fusion tasks, adapting well to real-world demands. Overall, our proposed framework and HDO dataset offer the first effective and comprehensive benchmark in this field, solving stability and real-time challenges in infrared and visible video stream fusion while assessing different solution performances to foster development in this area.

Abstract:
The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both spatial relationships and object combinations. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.

Abstract:
Face forgery detection has attracted much attention due to the ever-increasing social concerns caused by facial manipulation techniques. Recently, identity-based detection methods have made considerable progress, which is especially suitable in the celebrity protection scenario. However, they still suffer from two main limitations: (a) generic identity extractor is not specifically designed for forgery detection, leading to nonnegligible Identity Representation Bias to forged images. (b) existing methods only analyze the identity representation of each image individually, but ignores the query-reference interaction for inconsistency exploiting. To address these issues, a novel Inconsistency Exploiting based Identity Rectification Network (IEIRNet) is proposed in this paper. Firstly, for the identity bias rectification, the IEIRNet follows an effective two-branches structure. Besides the Generic Identity Extractor (GIE) branch, an essential Bias Diminishing Module (BDM) branch is proposed to eliminate the identity bias through a novel Attention-based Bias Rectification (ABR) component, accordingly acquiring the ultimate discriminative identity representation. Secondly, for query-reference inconsistency exploiting, an Inconsistency Exploiting Module (IEM) is applied in IEIRNet to comprehensively exploit the inconsistency clues from both spatial and channel perspectives. In the spatial aspect, an innovative region-aware kernel is derived to activate the local region inconsistency with deep spatial interaction. Afterward in the channel aspect, a coattention mechanism is utilized to model the channel interaction meticulously, and accordingly highlight the channel-wise inconsistency with adaptive weight assignment and channel-wise dropout. Our IEIRNet has shown effectiveness and superiority in various generalization and robustness experiments.

Abstract:
Semi-Supervised Object Detection (SSOD) has shown remarkable results by leveraging image pairs with a teacher-student framework. An excellent strong augmentation method can generate richer images and alleviate the influence of noise in pseudo-labels. However, existing data augmentation methods for SSOD do not consider instance-level information, thus, they cannot make full use of unlabeled data. Besides, the current teacher-student framework in SSOD solely relies on pseudo-labeling techniques, which may disregard some uncertain information. In this article, we introduce a new method called Elaborate Teacher which generates and exploits image pairs in a more refined manner. To enrich strongly augmented images, a novel data augmentation method called Information-Aware Mixup Representation (IAMR) is proposed. IAMR utilizes the teacher model's predictions as prior information and considers instance-level information, which can be seamlessly integrated with existing SSOD data augmentation methods. Furthermore, to fully exploit the information in unlabeled data, we propose the Enhanced Scale Consistency Regularization (ESCR), which considers the consistency from both semantic space and feature space. Elaborate Teacher introduces a fresh data augmentation method, complemented by consistency regularization, which boosts the performance of semi-supervised object detectors. Extensive experiments on the PASCAL VOC and MS-COCO datasets demonstrate the effectiveness of our method in leveraging unlabeled image information. Our method consistently outperforms the baseline method and improves mAP by 11.6% and 9.0% relative to the supervised baseline method when using 5% and 10% of labeled data on MS-COCO, respectively.

Abstract:
Cardiac magnetic resonance imaging (CMRI) can help experts quickly diagnose cardiovascular diseases. Due to the patient's breathing and slight movement during the magnetic resonance imaging scan, the obtained CMRI may be severely blurred, affecting the accuracy of clinical diagnosis. To address this issue, we propose the quadratic conditional diffusion model for blind CMRI super-resolution (DBSR). Specifically, we propose a conditional blur kernel noise predictor, which predicts the blur kernel from low-resolution images by the diffusion model, transforming the unknown blur kernel in low-resolution CMRI into a known one. Meanwhile, we design a novel conditional CMRI noise predictor, which uses the predicted blur kernel as prior knowledge to guide the diffusion model in reconstructing high-resolution CMRI. Furthermore, we propose a cascaded residual attention network feature extractor, which extracts feature information from CMRI low-resolution images for blur kernel prediction and SR reconstruction of CMRI images. Extensive experimental results indicate that our proposed DBSR achieves better blind super-resolution reconstruction results than several state-of-the-art baselines.

Abstract:
As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text-guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to-video generation methods as well as audio generation methods on Kinetics and VAS datasets.

Abstract:
Weakly-supervised Temporal Action Localization (W-TAL) aims to train a model to localize all action instances potentially from different classes in an untrimmed video, using a training dataset that has video-level action class labels but has no detailed annotations on the start and end timestamps of action instances. We propose to solve the W-TAL problem from the feature learning aspect, with a new architecture, termed F3-Net, which includes (1) a Feature Weakening (FW) module that can identify and randomly weaken either the most discriminative action or the most discriminative background features over the training iterations to force the network to precisely localize the action instances in both discriminative and ambiguous action-related frames, without spreading to the background intervals; (2) a Feature Contextualization (FC) module that can infer the global contexts among video segments and attentionally fuse them with the local contexts from individual video segments to generate more representative features; and (3) a Feature Discrimination (FD) module that can highlight the most discriminative video segments/classes corresponding to each class/segment, respectively, for localizing multiple action instances from different classes within a video. Experimental results on THUMOS14 and ActivityNet1.3 demonstrate the state-of-the-art performance of our F3-Net, and the FW and FC are also effective plug-in modules to improve other methods. This project will be available at https://moniruzzamanmd.github.io/F3-Net/https://moniruzzamanmd.github.io/F3-Net/

Affiliations: Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, China

Abstract:
Nighttime person Re-ID (person re-identification in the nighttime) is a very important and challenging task for visual surveillance but it has not been thoroughly investigated. Under the low illumination condition, the performance of person Re-ID methods usually sharply deteriorates. To address the low illumination challenge in nighttime person Re-ID, this article proposes an Illumination Distillation Framework (IDF), which utilizes illumination enhancement and illumination distillation schemes to promote the learning of Re-ID models. Specifically, IDF consists of a master branch, an illumination enhancement branch, and an illumination distillation module. The master branch is used to extract the features from a nighttime image. The illumination enhancement branch first estimates an enhanced image from the nighttime image using a nonlinear curve mapping method and then extracts the enhanced features. However, nighttime and enhanced features usually contain data noise due to unstable lighting conditions and enhancement failures. To fully exploit the complementary benefits of nighttime and enhanced features while suppressing data noise, we propose an illumination distillation module. In particular, the illumination distillation module fuses the features from two branches through a bottleneck fusion model and then uses the fused features to guide the learning of both branches in a distillation manner. In addition, we build a real-world nighttime person Re-ID dataset, named Night600, which contains 600 identities captured from different viewpoints and nighttime illumination conditions under complex outdoor environments. Experimental results demonstrate that our IDF can achieve state-of-the-art performance on two nighttime person Re-ID datasets (i.e., Night600 and Knight).

Abstract:
We present an automatic generation pipeline of interactive nonlinear video for online apparel shopping navigation. Our approach was inspired by Google's “Messy Middle” theory, which suggests that people mentally are faced with two tasks—exploration and evaluation—before purchasing online. Given a set of apparel product presentation videos, our navigation UI organizes them to optimize users' product exploration and automatically generates interactive videos for users' product evaluation. To support automatic methods, we proposed a video clustering similarity (\operatornameCSIM) and a camera movement similarity (\operatornameMSIM), as well as a comparative video generation algorithm for product recommendation, presentation, and comparison. To evaluate our pipeline's effectiveness, we conducted several user studies. The results showed that our pipeline can help users complete the consumption process more efficiently, making it easier for them to understand and choose a product.

Abstract:
The Class Activation Map (CAM) is widely used to generate pseudo-labels for Weakly Supervised Semantic Segmentation (WSSS), while it does not adequately consider the modeling of foreground-independent information, resulting in prone to false positive pixels. In this paper, we propose a Wave-like Class Activation Map (WaveCAM) from the perspective of representation fusion and dynamic aggregation representation to alleviate the above problem. Specifically, our WaveCAM includes the foreground-aware representation modeling that enhances perception of foreground information, and the foreground-independent representation modeling that enhances perception of foreground-independent information, and a representation-adaptive fusion module that fuses the two representations. Both representations are expressed as wave functions with amplitude and phase to dynamically aggregate representations and extract semantic information after initialization, and they are fused through the adaptive fusion module to obtain an output containing rich semantic information. Extensive experiments on PASCAL VOC 2012 dataset and MS COCO 2014 dataset validate that our WaveCAM can easily embed multi-stage WSSS and end-to-end WSSS, achieving the state-of-the-art performance.

Abstract:
Unsupervised person re-identification (Re-ID) aims to learn discriminative representations for person retrieval from unlabeled data. Currently, state-of-the-art techniques accomplish this task by using instance contrastive learning, which contrasts the similarities of the instances in different views. However, existing contrastive methods only focus on the positive effects of inter-instance relationships, while neglecting the negative effects of intra-instance redundancy information. This redundancy information can generate invalid or spurious intra-class relationships during the instance contrasting process, which enlarges the intra-class gaps and increases the noisy pseudo-labels. To address this issue, we propose a discriminative identity-feature exploring and differential aware learning (DiDAL) framework to learn more discriminative intra-identity representations. Specifically, the DiDAL extracts intra-instance salient features by synthetic complementary attention, and further explores the discriminative identity features by modeling the relationship among these salient features based on graph neural networks. This strategy aims to reduce the intra-instance redundancy information. Moreover, DiDAL explores hard instances by leveraging the extracted intra-instance salient features, and matches an anchor with multiple hard positive instances to enhance the robustness of the model to noisy pseudo-labels. Extensive experiment results on two widely used person re-identification datasets and a vehicle re-identification dataset demonstrate the superiority of the proposed method compared with existing state-of-the-art methods.

Abstract:
Enhancing low-light image visibility is a critical task in computer vision since it helps to improve input for high-level algorithms. High-quality images typically have clear structural information. In previous studies, due to the lack of proper structural guidance, restored images had some problems, such as unclear structural areas and overexposed or underexposed local areas. To address the above problems, in this paper, we introduce a coefficient of variation (COV) with excellent performance in maintaining structural information, and then we propose a low-light image enhancement method that utilizes the COV to extract structural information from images. First, we apply a traditional retinex model to estimate both reflectance and illumination. Second, we use the COV to indicate the degree of dispersion of the input sample, which enables us to obtain a robust structure-distinguishing weight map for low-light images. The weight map is adaptively divided to obtain a structural weight map, which is then used to enhance the gradient image. This process is applied before the reflectance layer of the retinex model. Finally, the result is obtained by using the block coordinate descent method. According to extensive experiments, outstanding results can be achieved by our proposed method in terms of both subjective and objective evaluation metrics in comparison with other state-of-the-art methods. The source code is available at our website.

Abstract:
Emotion recognition in conversations (ERC), the task of recognizing the emotion of each utterance in a conversation, is crucial for building empathetic machines. Existing studies focus mainly on capturing context- and speaker-sensitive dependencies on the textual modality but ignore the significance of multimodal information. Different from emotion recognition in textual conversations, capturing intra- and inter-modal interactions between utterances, learning weights between different modalities, and enhancing modal representations play important roles in multimodal ERC. In this paper, we propose a transformer-based model with self-distillation (SDT) for the task. The transformer-based model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers, and learns weights between modalities dynamically by designing a hierarchical gated fusion strategy. Furthermore, to learn more expressive modal representations, we treat soft labels of the proposed model as extra training supervision. Specifically, we introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality. Experiments on IEMOCAP and MELD datasets demonstrate that SDT outperforms previous state-of-the-art baselines.

Abstract:
Real-world images captured in remote sensing, image or video retrieval, and outdoor surveillance are often degraded due to poor weather conditions, such as rain and mist. These conditions introduce artifacts that make visual analysis challenging and limit the performance of high-level computer vision methods. In time-critical applications, it is vital to develop algorithms that automatically remove rain without compromising the quality of the image contents. This article proposes a novel approach called QSAM-Net, a quaternion multi-stage multiscale neural network with a self-attention module. The algorithm requires significantly fewer parameters by a factor of 3.98 than the real-valued counterpart and state-of-the-art methods while improving the visual quality of the images. The extensive evaluation and benchmarking on synthetic and real-world rainy images demonstrate the effectiveness of QSAM-Net. This feature makes the network suitable for edge devices and applications requiring near real-time performance. Furthermore, the experiments show that the improved visual quality of images also leads to better object detection accuracy and training speed.

Abstract:
With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the “Pre-training and Fine-tuning” paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this article, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method.

Abstract:
Multispectral pedestrian detection has shown many advantages in a variety of environments, particularly poor illumination conditions, by leveraging visible-thermal modalities. However, in-depth insight into distinguishing the complementary content of multimodal data and exploring the extent of multimodal feature fusion is still lacking. In this paper, we propose a novel multispectral pedestrian detector with multiscale cross-modal homogeneity enhancement and confidence-aware feature fusion. RGB and thermal streams are constructed to extract features and generate candidate proposals. During feature extraction, multiscale cross-modal homogeneity enhancement is proposed to enhance single-modal features using the separated homogeneous features via modal interactions. Homogeneity features encode the semantic information of the scene and are extracted from the RGB-thermal pairs by employing a channel attention mechanism. Proposals from two modalities are united to obtain multimodal proposals. Then, confidence measurement fusion is proposed to achieve multispectral feature fusion in each proposal by measuring the internal confidence of each modality and the interaction confidence between modalities. In addition, a confidence transfer loss function is designed to focus more on hard-to-detect samples during training. Experimental results on two challenging datasets demonstrate that the proposed method achieves better performance compared to existing methods.

Abstract:
Focused plenoptic cameras can record spatial and angular information of the light field (LF) simultaneously with higher spatial resolution relative to traditional plenoptic cameras, which facilitate various applications in computer vision. However, the existing plenoptic image compression methods present ineffectiveness to the captured images due to the complex micro-textures generated by the microlens relay imaging and long-distance correlations among the microimages. In this article, a lossy end-to-end learning architecture is proposed to compress the focused plenoptic images efficiently. First, a data preprocessing scheme is designed according to the imaging principle to remove the sub-aperture image ineffective pixels in the recorded light field and align the microimages to the rectangular grid. Then, the global attention module with large receptive field is proposed to capture the global correlation among the feature maps using pixel-wise vector attention computed in the resampling process. Also, a new image dataset consisting of 1910 focused plenoptic images with content and depth diversity is built to benefit training and testing. Extensive experimental evaluations demonstrate the effectiveness of the proposed approach. It outperforms intra coding of HEVC and VVC by an average of 62.57% and 51.67% bitrate reduction on the 20 preprocessed focused plenoptic images, respectively. Also, it achieves 18.73% bitrate saving and generates perceptually pleasant reconstructions compared to the state-of-the-art end-to-end image compression methods, which benefits the applications of focused plenoptic cameras greatly. The dataset and code are publicly available at https://github.com/VincentChandelier/GACN.

Affiliations: Metaverse Institute, School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, China; School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China; Department of Radiology, Second Affiliated Hospital of South China University of Technology, Guangzhou, China; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Department of Computing, the School of Design, and the Research Institute for Sports Science and Technology, The Hong Kong Polytechnic University, Kowloon, Hong Kong

Abstract:
Multimodal image fusion plays an essential role in medical image analysis and application, where computed tomography (CT), magnetic resonance (MR), single-photon emission computed tomography (SPECT), and positron emission tomography (PET) are commonly-used modalities, especially for brain disease diagnoses. Most existing fusion methods do not consider the characteristics of medical images, and they adopt similar strategies and assessment standards to natural image fusion. While distinctive medical semantic information (MS-Info) is hidden in different modalities, the ultimate clinical assessment of the fusion results is ignored. Our MsgFusion first builds a relationship between the key MS-Info of the MR/CT/PET/SPECT images and image features to guide the CNN feature extractions using two branches and the design of the image fusion framework. For MR images, we combine the spatial domain feature and frequency domain feature (SF) to develop one branch. For PET/SPECT/CT images, we integrate the gray color space feature and adapt the HSV color space feature (GV) to develop another branch. A classification-based hierarchical fusion strategy is also proposed to reconstruct the fusion images to persist and enhance the salient MS-Info reflecting anatomical structure and functional metabolism. Fusion experiments are carried out on many pairs of MR-PET/SPECT and MR-CT images. According to seven classical objective quality assessments and one new subjective clinical quality assessment from 30 clinical doctors, the fusion results of the proposed MsgFusion are superior to those of the existing representative methods.

Abstract:
Current RGB-D salient object detection (RGB-D SOD) methods mainly develop a generalizable model trained by binary cross-entropy (BCE) loss based on convolutional or Transformer backbones. However, they usually exploit convolutional modules to fuse multi-modality features, with little attention paid to capturing the long-range multi-modality interactions for feature fusion. Furthermore, BCE loss does not explicitly explore intra- and inter-pixel relationships in a joint embedding space. To address these issues, we propose a cross-modality interaction parallel-transformer (CIPT) module, which better captures the long-range multi-modality interactions, generating more comprehensive fusion features. Besides, we propose a pixel-level contrastive learning (PCL) method that improves inter-pixel discrimination and intra-pixel compactness, resulting in a well-structured embedding space and a better saliency detector. Specifically, we propose an asymmetric network (TPCL) for RGB-D SOD, which consists of a Swin V2 Transformer-based backbone and a designed lightweight backbone (LDNet). Moreover, an edge-guided module and a feature enhancement (FE) module are proposed to refine the learned fusion features. Extensive experiments demonstrate that our method achieves excellent performance against 15 state-of-the-art methods on seven public datasets. We expect our work to facilitate the exploration of applying Transformer and contrastive learning for RGB-D SOD tasks.

Abstract:
In this work, we introduce FARP-Net, an adaptive local-global feature aggregation and relation-aware proposal network for high-quality 3D object detection from pure point clouds. Our key insight is that learning adaptive local-global feature aggregation from an irregular yet sparse point cloud and generating superb proposals are both pivotal for detection. Technically, we propose a novel local-global feature aggregation layer (LGFAL) that fully exploits the complementary correlation between local features and global features, and fuses their strengths adaptively via an attention-based fusion module. Furthermore, we incorporate a lightweight feature affine module (LFAM) into LGFAL to map the local features into a normal distribution, thus acquiring fine-grained features of each local region in a weight-sharing manner. During object proposal generation, we propose a weighted relation-aware proposal module (WRPM) that uses an objectness-aware formalism to weigh the relation importance among object candidates for a clear and principal context, thereby facilitating the generation of high-quality proposals. The WRPM challenges the traditional practice of extracting contextual information among all object candidates, which is inefficient as object candidates are always noisy and redundant. Experimentally, FARP-Net delivers superior performance on two widely used benchmarks with fewer parameters, 64.0% mAP@0.25 on the SUN RGB-D dataset and 70.9% mAP@0.25 on the ScanNet V2 dataset. We further validate that the proposed LGFAL and WRPM can be integrated into both indoor and outdoor detectors to boost performance.

Abstract:
Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this article, we propose a unified cross-modal representation learning framework VatLM (Visual-Audio-Text Language Model). The proposed VatLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VatLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VatLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), and visual speech recognition (VSR) tasks. Results show that the proposed VatLM outperforms previous state-of-the-art models, such as the audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VatLM is capable of aligning different modalities into the same space.

Abstract:
Camera and 3D LiDAR sensors have become indispensable devices in modern autonomous driving vehicles. Camera provides fine-grained texture and color information in 2D space, while LiDAR captures more precise and farther-away distance measurements of the surrounding environments. The complementary information from these two sensors makes the fusion of two modalities a desired option. However, two primary challenges in the fusion of camera and LiDAR hinder its performance, i.e., how to effectively fuse the information from these two modalities and how to precisely align them (suffering from the weak spatiotemporal synchronization problem). This article proposes a coarse-to-fine LiDAR and camera fusion-based network, named LIF-Seg, for LiDAR segmentation. For the first challenge, unlike these previous works fusing the point cloud and image information in a one-to-one manner, the proposed method introduces a simple but effective early-fusion strategy to fully utilize the contextual information of images. Second, to tackle the weak spatiotemporal synchronization problem, an offset rectification approach is designed to align the features of the two modalities. The cooperation of these two components leads to the success of the effective camera-LiDAR fusion. Experimental results on the nuScenes dataset show the superiority of LIF-Seg over existing methods by a large margin. Ablation studies and analyses further illustrate that the LIF-Seg can effectively address the weak spatiotemporal synchronization problem.

Abstract:
High dynamic range (HDR) images require tone-mapping to be viewed on low dynamic range (LDR) displays. The performance of tone-mapping algorithms can be evaluated through a subjective study in which participants based on their liking rank or score tone-mapped images (TMIs). Subjective evaluation can be painstakingly slow; therefore, several quantitative metrics have been proposed for objective evaluation. This article presents a new robust metric that uses 16 features, measuring the loss of color, contrast, brightness, and structure, extracted from the test TMI and the reference HDR image. The effect of these attributes on image quality is investigated and combined into a single score in the [0, 1] range describing the quality of TMI. We validate the performance of the proposed metric by comparing it with 24 existing state-of-the-art metrics. The study uses two subjective datasets of TMIs, including one existing benchmark dataset and a new proposed dataset comprising HDR images of a variety of scenes, and a dataset of traditional images not generated through tone-mapping. In these studies, our method shows the highest correlation with subjective scores for both datasets of TMIs and remains in the second position for the dataset of traditional images.

Abstract:
Only a few key fish individuals can play a dominant role in actual fish group, therefore, it is reasonable to infer group activities from the relationship between individual actions. However, the complex underwater environment, rapid and similar fish individual movements are likely to cause the indistinct action characteristics, as well as adhesion of data distribution, and it is difficult to infer the relationship between individual actions directly by using graph convolutional network (GCN). Therefore, this article proposes a graph convolution vector calibration (GCVC) network for fish group activity recognition through individual action relationship reasoning. By improving reasoning ability of GCN, an activity feature vector calibration module is designed to solve the data adhesion and mismatch between the estimated and true distribution. The idea is to first count the distribution of the original data, and make each dimension of its active feature vector follow the Gaussian distribution, so as to generate a better similar category distribution. In addition, we also produced a fish activity dataset to verify the performance of the proposed algorithm. The experimental results show that the GCVC achieves a group activity recognition accuracy of 93.33%, and the Macro-F1 is 93.25%, which is 19.21% and 24.2% higher than before, respectively. By using GCVC, the corrected activity feature vector distribution is more consistent, and the data adhesion is reduced, the model can achieve more fully supervised learning.

Abstract:
Practical applications with visual question answering (VQA) systems are challenging, and recent research has aimed at investigating this important field. Many issues related to real-world VQA applications must be considered. Although existing methods have focused on adding external knowledge and other descriptive information to assist in reasoning, they are limited by the impact of information retrieval errors on downstream tasks and the misalignment of the aggregated information. Thus, the overall performance of these models must be improved. To address these challenges, we propose a novel VQA model that utilizes a differentiated pretrained model to represent the input information and connects the input data with three external knowledge components through a common feature space. To combine the information in the three feature spaces, we propose an information aggregation strategy that employs a weighted score to aggregate the information in the relation and entity spaces in the answer prediction process. The experimental results show that our method achieves good performance in fact-based and zero-shot VQA tasks and achieves state-of-the-art performance with the ZS-F-VQA dataset.

Abstract:
Existing correlation filter (CF) tracking methods are fragile for boundary effects, vague target information, and heuristic model updating, as these limitations degrade the detection ability of the learned filter. In response to that, this article embarks on basic CF learning and presents a novel distractor-aware template-coupled correlation filter (DATC-CF) by exploiting the spatial-temporal appearance context of the target, which aims at improving the discriminative ability of the learned filter against distractive background and the descriptive ability in adapting unexpected scenes. Specifically, the power of spatial context comes from a distractor-aware regularizer weighted by background distractors. By adaptively optimizing the weight of each distractor, our filter training can focus more on the critical distractors. The temporal context is represented by a dynamic template set, and we formulate a template-coupled regularizer that can make use of the commonality over all templates while maintaining a passive filter update under a multi-template learning scheme. DATC-CF integrates the two regularizers and is summarized as a multi-variable joint optimization problem where a filter ensemble can be learned. With DATC-CF, a multi-model tracking framework DATC_MM is developed by maximizing the posterior distribution over the learned filters. For robust tracking, we further apply high-confidence updating and establish a complementary distractor-aware color detector to restore the CF tracking failures. Finally, experiments on several large-scale benchmark datasets demonstrate the effectiveness of the proposed tracking methods against state-of-the-art trackers.

Abstract:
Multi-Beam LiDAR (MBL) sensors sample the real-world with discrete 3D point clouds (PC) and have become a major and essential 3D sensing capability for autonomous robots. To ensure an accurate point sampling on surfaces, high-resolution MBL sensors (e.g., Ouster OS0-128) are commonly used to collect dense point clouds for robot tasks, including object detection and tracking, simultaneous localization and mapping (SLAM), in applications such as autonomous driving vehicles (ADVs). However, the high cost and large volume/weight/energy consumption of such sensors limit their usage in broader applications such as UAV/UGV swarms with small-scale agents with limited payload. Existing studies on Super-Resolution (SR) upsampling of the PC from low-resolution MBL have not considered the geometry semantics of the scenes, thus resulting in less optimal SR points for downstream subtasks (e.g., SLAM). Thus, this article proposes SGSR-Net, a structure semantics-guided MBL Super-Resolution network. SGSR-Net takes the low-resolution range images of the MBL sensors as input and produces dense and structure-aware Super-Resolution point cloud from those sparse measurements through a vertical spatial and channel attention-enhanced CNN model coupling with guided Monte Carlo filtering, for indoor LiDAR-SLAM applications. The SGSR-Net is validated using datasets collected by a UGV equipped with multiple MBL sensors. The results demonstrate that the proposed CG-LSR (CASE Attention Guided Encoder-Decoder LiDAR Super-Resolution Network) reduces the MAE of the SR points by 12.4% down to 0.177 m when compared with the state-of-the-art (SOTA) method Shan et al. (2020), Ren et al. (2021), Kwon et al. (2022), Long and Wang (2022). The indoor SLAM results with SR-points produced by SGSR-Net show that the mean and RMSE of the absolute pose error (APE) are decreased by 27% and 30%, down to 0.849 m and 0.902 m, respectively, which significantly improve the indoor-SLAM performance and stability of SOTA LiDAR-SLAM systems (i.e. LeGO-LOAM Shan and Englot (2018), Dellenbach et al. (2022), Vizzo et al. (2023), Zhang and Singh (2014)).

Abstract:
Given data with label noise (i.e., incorrect data), deep neural networks would gradually memorize the label noise and impair model performance. To relieve this issue, curriculum learning is proposed to improve model performance and generalization by ordering training samples in a meaningful (e.g., easy to hard) sequence. Previous work takes incorrect samples as generic hard ones without discriminating between hard samples (i.e., hard samples in correct data) and incorrect samples. Indeed, a model should learn from hard samples to promote generalization rather than overfit to incorrect ones. In this article, we address this problem by appending a novel loss function DiscrimLoss, on top of the existing task loss. Its main effect is to automatically and stably estimate the importance of easy samples and difficult samples (including hard and incorrect samples) at the early stages of training to improve the model performance. Then, during the following stages, DiscrimLoss is dedicated to discriminating between hard and incorrect samples to improve the model generalization. Such a training strategy can be formulated dynamically in a self-supervised manner, effectively mimicking the main principle of curriculum learning. Experiments on image classification, image regression, text sequence regression, and event relation reasoning demonstrate the versatility and effectiveness of our method, particularly in the presence of diversified noise levels.

Abstract:
In this article, an effective print-camera (P-C) resistant image watermarking scheme is proposed. To achieve watermark robustness, most of existing works try to simulate P-C noise by a sophisticated math model. However, the diversity of P-C noises in the real world is ignored, and the watermarked image may not attain a good balance between high robustness and low distortion. To address the problem, we construct an efficient end-to-end network architecture for watermark embedding and extraction. To be specific, a deep noise simulation network (NSN) is designed to simulate the fusion process of real P-C noises, which can help to generate high-robust watermarked image. Also, a multitask loss function based on just-noticeable-difference (JND) is proposed to conduct constrained learning for residual image containing watermark information, thus, the distortion of generated watermarked image can be significantly reduced. Experimental results show that our scheme can achieve high robustness against P-C process while maintaining a satisfactory watermark capacity and visual quality of watermarked image.

Abstract:
The detection performance in crowd scenes is limited by recalling hard objects (e.g., occluded objects). It requires that this kind of objects can be successfully detected and retained by the non-maximum suppression (NMS) while controlling false positives. The existing dynamic label assignment algorithms can help recall these objects by adaptively allocating appropriate positive samples, however, they ignore the alignment with the selecting rules of NMS. This leads to the fact that detecting objects in crowd scenes are still very sensitive to the NMS threshold setting. As a result, the existing methods can only set a low NMS threshold to avoid the excessive false positives, causing some objects failed to be recalled. And these methods also generally lack more excitation for positive samples, which hinders further facilitating the recall of hard instances in crowd scenes. This article proposes a novel dynamic label assignment strategy for object detection in crowd scenes, called non-maximum suppression guided label assignment (NGLA), which aligns the assignment strategy with NMS process and learns more prominent positive samples. Following NMS, NGLA introduces the IoU between samples with their corresponding best samples to define positive and negative samples. To cooperate with NGLA, an NMS-aware loss is proposed to dynamically assign sample weights when supervising sample predictions, which also considers the IoU with the best sample. In addition, for better classification prediction, a regression assisted classification branch is designed to help detectors perceive the relation between the regression predictions of each sample and the corresponding best sample. Experiments demonstrate that NGLA outperforms other label assignment methods on CrowdHuman and Citypersons, and is less sensitive to the NMS threshold in crowd scenes.

Abstract:
Video delivery over wireless networks with limited network resources and dynamically changing channel quality is an important challenge, and one of the most promising solutions for tackling this problem is to employ multicast transmissions, which improves network resource utilization efficiency. This article focuses on delivering video concurrently to multiple users over mobile networks leveraging Multicast Broadcast Multimedia Service (MBMS) and Mobile Edge Computing (MEC) technology. We propose a k-means clustering and cooperative bargaining game-based adaptive video multicast solution (KGS) over mobile edge networks, with the goal of providing high-quality video delivery service in an envisaged MBMS service area across multiple cell sites. By taking user subgrouping, resource allocation, and bitrate adaptation into account, we establish a Cooperative Bargaining Game (CBG) based joint optimization model for multiple Multicast Broadcast Synchronized Frequency Network (MBSFN) users in mobile edge networks. Then we transform this model into a two-stage convex optimization problem and a nonlinear integer programming problem. We propose a heuristic approach to solve them and achieve a Pareto optimal video delivery strategy for all users. Finally, the efficiency of the proposed scheme is evaluated through extensive simulations.

Abstract:
For video bit-depth enhancement (VBDE) tasks, inter-frame information is critical for removing false contours and recovering the details in low bit-depth (LBD) videos. However, due to different structural distortions and complex motions in the neighboring frames, it is difficult to effectively utilized inter-frame information. Most algorithms rely on alignment operations to provide information of neighboring frames, suffering from slow inference speed due to the complex alignment module design. Meanwhile, most existing methods sequentially perform the intra-frame feature extractions and inter-frame information fusions, but fail to efficiently fuse spatio-temporal information. Therefore, in this paper, we propose a two-stage progressive group (TSPG) network to find complementary information related to the target frame without adopting an alignment operation. To simultaneously achieve intra-frame feature extractions and inter-frame feature fusions, we propose a parallel spatio-temporal fusion (PSTF) module with a dual-branch spatial-temporal residual (DSTR) block to focus on more useful temporal information while ensuring a faster inference speeds. Extensive experiments on public datasets demonstrate that our proposed multi-stage spatio-temporal fusion network (named MSTFN) can quickly and effectively eliminate false contours and recover high quality target frames. Furthermore, our method outperforms the state-of-the-art methods in terms of both PSNR and SSIM, and can reach faster inference speeds.

Abstract:
Nowadays, video surveillance systems are widely deployed in public areas. However, in the unreachable corner of surveillance cameras, it still seems impossible to find the suspects only depending on eyewitness memory. Therefore, the technology that can detect particular pedestrians only by text-based attributes, or text-attribute person search, attracts lots of attention from academia. Most existing text-attribute person search methods focus on learning better feature representations by designing better network structures or using local information but lack direct constraints between modalities. This paper proposes a feature embedding motivated and graph attention network-based model, optimizing the feature extraction process by its attention mechanism. Meanwhile, this paper studies the effectiveness of the attention mechanism in feature alignment, and thus redesigns the cross-attention module, simplifying the complexity of the model and constraining the inter-modality gap in maximum by the self-attention mechanism of the graph attention network. In this way, the method simultaneously offsets the influence of modal-specific features and optimizes the number of parameters. Thus, the method improves performance and reduces time costs. Meanwhile, according to the inherent feature of attributes, this article introduces a novel embedding space, which effectively enhances the discrimination ability of the model. Extensive experiments illustrate the superiority of our model in two widely used text-attribute person search benchmarks among the state-of-the-art methods.

Abstract:
Multimedia applications are often associated with cross-domain knowledge transfer, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain under the assumption that the target domain contains unknown classes. Existing OSDA methods consistently lay stress on the covariate shift, ignoring the potential label shift problem. The performance of OSDA methods degrades drastically under intra-domain class imbalance and inter-domain label shift. However, little attention has been paid to this issue in the community. In this paper, the Imbalanced Open Set Domain Adaptation (IOSDA) is explored where the covariate shift, label shift and category mismatch exist simultaneously. To alleviate the negative effects raised by label shift in OSDA, we propose Open-set Moving-threshold Estimation and Gradual Alignment (OMEGA) - a novel architecture that improves existing OSDA methods on class-imbalanced data. Specifically, a novel unknown-aware target clustering scheme is proposed to form tight clusters in the target domain to reduce the negative effects of label shift and intra-domain class imbalance. Furthermore, moving-threshold estimation is designed to generate specific thresholds for each target sample rather than using one for all. Extensive experiments on IOSDA, OSDA and OPDA benchmarks demonstrate that our method could significantly outperform existing state-of-the-arts.

Abstract:
Recently, with the growing popularity of micro-videos, multi-label learning has attracted increasing attention due to its potential commercial value in different scenarios. However, existing methods place more emphasis on the alignment between explicit semantics and visual features, while neglecting the exploration of interactions at fine-grained semantic levels. To address this problem, we propose a novel dual-domain aligned deep hierarchical matrix factorization (DADHMF) method for micro-video multi-label classification. Specifically, we construct a dual-stream deep matrix factorization framework to explore implicit hierarchical semantics and corresponding intrinsic feature representations in top-down and bottom-up ways, respectively. On this basis, we leverage the intralayer alignment strategy to narrow the semantic gap between label and instance domains by introducing adaptive semantic-aware embeddings. Moreover, we further utilize the inverse covariance estimation module to automatically capture latent semantic correlations, and project the structural information into the semantic-aware embeddings to ensure the stability of the intralayer alignment. Extensive experiments on two available micro-video multi-label datasets demonstrate that our proposed method outperforms the state-of-the-art methods.

Abstract:
In the field of multimodal machine learning, multimodal sentiment analysis task has been an active area of research. The predominant approaches focus on learning efficient multimodal representations containing intra- and inter-modality information. However, the heterogeneous nature of different modalities brings great challenges to multimodal representation learning. In this article, we propose a multi-stage fusion framework to dynamically fine-tune multimodal representations via a hybrid-modal attention mechanism. Previous methods mostly only fine-tune the textual representation due to the success of large corpus pre-trained models and neglect the inconsistency problem of different modality spaces. Thus, we design a module called the Multimodal Shifting Gate (MSG) to fine-tune the three modalities by modeling inter-modality dynamics and shifting representations. We also adopt a module named Masked Bimodal Adjustment (MBA) on the textual modality to improve the inconsistency of parameter spaces and reduce the modality gap. In addition, we utilize syntactic-level and semantic-level textual features output from different layers of the Transformer model to sufficiently capture the intra-modality dynamics. Moreover, we construct a Shifting HuberLoss to robustly introduce the variation of the shifting value into the training process. Extensive experiments on the public datasets, including CMU-MOSI and CMU-MOSEI, demonstrate the efficacy of our approach.

Abstract:
Real-world data typically follows a long-tailed distribution. When a small sample of tail classes does not cover the underlying distribution well, methods such as class re-balancing strategies and decoupled training are difficult to work, and additional knowledge needs to be introduced to recover the underlying distribution of the tail classes. In this work, we observe that the similarity between the variances of the feature distributions increases with the class similarity. Then, we also find that well-represented feature distributions typically contain multiple subcenters, which allows for denser samples at the edges of the distribution and promotes model learning to more robust decision bounds. Based on these observations, we propose to calibrate the feature distribution of the tail class by transferring the variance of the feature distribution of the head class, and then sample from the calibrated tail class distribution to generate augmented samples. To coordinate with the tail class calibration method, we also propose label-aware noise suppression (LANS) for reducing the generation of noisy samples and a three-stage training scheme for reshaping decision boundaries and compacting feature learning. Experimental results on iNaturalist2018, ImageNet-LT, CIFAR-10-LT, and CIFAR-100-LT show that our method achieves state-of-the-art performance in most metrics compared to similar approaches.

Affiliations: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China; State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, China; Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems and Safety Control, Beihang University, Beijing, China; State Key Laboratory of Intelligent Technology and the Systems Department of Computer Science and Technology, Tsinghua University, Beijing, China

Abstract:
Pure point-based neural networks have recently shown tremendous promise for point cloud tasks, including 3D object classification, 3D object part segmentation, 3D semantic segmentation, and 3D object detection. Nevertheless, it is a laborious process to construct a network for each task due to the artificial parameters and hyperparameters involved, e.g., the depths and widths of the network and the number of sampled points at each stage. In this work, we propose Auto-Points, a novel one-shot search framework that automatically seeks the optimal architecture configuration for point cloud tasks. Technically, we introduce a set abstraction mixer (SAM) layer that is capable of scaling up flexibly along the depth and width of the network. Each SAM layer consists of numerous child candidates, which simplifies architecture search and enables us to discover the optimum design for each point cloud task pursuant to resource constraint from an enormous search space. To fully optimize the child candidates, we develop a weight-entwinement neural architecture search (NAS) technique that entwines the weights of different candidates in the same layer during supernet training such that all candidates can be extremely optimized. Benefiting from the proposed techniques, the trained supernet allows the searched subnets to be exceptionally well-optimized without further retraining or finetuning. In particular, the searched models deliver superior performances on multiple extensively employed benchmarks, 93.9% overall accuracy (OA) on ModelNet40, 89.1% OA on ScanObjectNN, 87.1% instance average IoU on ShapeNetPart, 69.1% mIoU on S3DIS, 70.4% mAP@0.25 on ScanNet V2, and 64.4% mAP@0.25 on SUN RGB-D.

Abstract:
When dealing with crowds, occlusions, and truncations in complex scenes, existing solutions for multi-person pose estimation remain challenging. This is because all examples are randomly organized and equally treated by previous methods during training, which ignores that examples vary significantly in their difficulty levels. Once trained, hard examples are underutilized due to the high proportion of simple training examples, resulting in poor model robustness for complex scenes. To tackle this, we propose a novel training strategy termed DMH-CL for complex pose estimation, brought from curriculum learning (CL) which mainly addresses easy examples in the early training stage and hard ones in the later stage. Different from typical CL methods, we define easy/hard examples via mining both the dataset-specific statistical difficulty and the multi-model evaluated difficulty. After that, we adopt an annealing arrangement strategy to construct learning courses from easy to hard. Furthermore, we introduce a model learning feedback indicator, i.e., Dynamic Model Hardness (DMH) to conduct course scheduling, and to explicitly explore hard poses and utilize the knowledge learned from easy poses to better handle complex scenes as well. Our DMH-CL is model-agnostic and can be easily applied to various pose estimators including single-stage models and two-stage models, and achieves significant improvements on two challenging benchmarks especially for complex scenes. Notably, it achieves substantial performance gains of 2.6% and 4.6% for hard poses compared to the strong single-stage model PETR on CrowdPose and COCO datasets, respectively. Source codes and models are publicly available online.

Abstract:
Face forgery technology has developed rapidly, causing severe security issues in society. Recently, with the continuous emergence of forgery techniques and types, most forensics methods suffer from the generalization problem. In particular, it is difficult for existing generalized methods to detect fake faces with unseen fake types. The reason is that the distribution gaps among cross-forgery types are too large. In this article, we propose a novel generalized framework to narrow large gaps based on bridging cross-domain alignment to solve this problem. Specifically, our framework consists of three key steps: preventing, bridging and aligning distribution gaps. Firstly, in the feature mining stage, taking advantage of the ability of Instance Normalization (IN) to better tolerate domain gaps, we design Adaptive Batch and Instance Normalization (ABIN) to replace the commonly used BN to adaptively extract features to preliminarily prevent domain gaps. Secondly, we propose to generate bridging samples distributed among the inter-domains to fill large gaps based on progressive linear interpolation operation. Finally, with the help of bridging samples, the cross-domain alignment is performed to better narrow distribution gaps to refine data distribution, which helps to learn a more generalized framework. Extensive experiments show that our proposed framework achieves the state-of-the-art generalized performance.

Abstract:
There is a growing consensus in the research community that the optimization of low-light image enhancement approaches should be guided by the visual quality perceived by end users. Despite the substantial efforts invested in the design of low-light enhancement algorithms, there has been comparatively limited focus on assessing subjective and objective quality systematically. To mitigate this gap and provide a clear path towards optimizing low-light image enhancement for better visual quality, we propose a gap-closing framework. In particular, our gap-closing framework starts with the creation of a large-scale dataset for Subjective QUality Assessment of REconstructed LOw-Light Images (SQUARE-LOL). This database serves as the foundation for studying the quality of enhanced images and conducting a comprehensive subjective user study. Subsequently, we propose an objective quality assessment measure that plays a critical role in bridging the gap between visual quality and enhancement. Finally, we demonstrate that our proposed objective quality measure can be incorporated into the process of optimizing the learning of the enhancement model toward perceptual optimality. We validate the effectiveness of our proposed framework through both the accuracy of quality prediction and the perceptual quality of image enhancement.

Abstract:
Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because they can provide a new paradigm for transferable visual representation learning with easily available text on the Web. However, the VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier to distinguish different images. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative adapted visual features by comprehensively using an implicit knowledge distillation, a vision-specific contrastive loss, and a cross-modal contrastive loss. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.

Abstract:
With the successful integration of contrastive learning and graph neural networks, graph contrastive learning (GCL) has demonstrated superior performance in graph representation learning. Majority of earlier works tend to utilize dual-view frameworks. However, they require high computational costs; additionally, we observed that they can hardly obtain robust result as the training processes swing between two important metrics: alignment and uniformity. We address these problems by designing a novel single-view paradigm called Light Single-view Graph Contrastive Learning (LSGCL). To reduce time consumption, we use a single-view framework. Specifically, the input graph is fed directly into a message-passing pattern encoder by concatenating the row-wise normalized hidden representations. Next, a novel single-view instance discrimination is applied, redefining the anchor, positive, and negative samples. The anchor is the positive sample, and the other nodes are negative. We also discuss why LSGCL can successfully achieve the trade-off between alignment and uniformity. In particular, the obtained representation is perfectly aligned, and visualizations show that the representation can provide cutting-edge value under uniformity. Albeit simple, our LSGCL can produce comparable performance or better than state-of-the-art methods while only incurring about 20% time cost compared to the state-of-the-art baselines.

Abstract:
The domain shift of crowd scenes significantly hinders the application of crowd counting models in open scenarios. Although domain adaptation methods for crowd counting have bridged this gap to some extent, they ignore one of the significant causes of domain shift, which is the inter-domain data distribution bias. We discover that there exists a connection between the known and unknown distribution, which can be utilized by similarity mining to address the domain shift. However, there are still challenges related to insufficient and inaccurate similarity mining. In this article, a novel Fine-grained Inter-domain Similarity Mining (FSIM) framework is proposed. To comprehensively explore the similar distributions between source and target domains, we propose a Multi-scale Distribution Alignment (MDA) module based on diffusion retrieval. To enhance the reliability of inter- domain similarity mining, we propose a Multi-retrieval Refinement (MR) module based on evidence theory, which serves as an uncertainty measurement method. Eventually, to eliminate the data distribution bias, we perform model retraining using a similar distribution. Extensive experiments conducted on five standard crowd counting benchmarks, SHA, SHB, QNRF, NWPU, and JHU-CROWD++, show that the proposed FSIM has strong generalizability.

Abstract:
Face verification has seen remarkable progress that benefits from large-scale publicly available databases. However, it remains a challenge how to generalize a pretrained face verification model to a new scenario with a limited amount of data. In many real-world applications, the training database only contains a limited number of identities with two images for each identity due to the privacy concern. In this article, we propose to transfer knowledge from a pretrained unmasked face verification model to a new model for verification between masked and unmasked faces, to meet the application requirements during the COVID-19 pandemic. To overcome the lack of intra-class diversity resulting from only a pair of masked and unmasked faces for each identity (\texti.e., two shots for each identity), a static prototype classification function is designed to learn features for masked faces by utilizing unmasked face knowledge from the pretrained model. Meanwhile, a contrastive constrained embedding function is designed to preserve unmasked face knowledge of the pretrained model during the transfer learning process. By combining these two functions, our method uses knowledge acquired from the pretrained unmasked face verification model to proceed with verification between masked and unmasked faces with a limited amount of training data. Extensive experiments demonstrate that our method can perform better than state-of-the-art methods for verification between masked and unmasked faces in the few-shot transfer learning setting.

Abstract:
The proliferation of 360-degree video applications has brought significant challenges to existing networks. To meet the requirements of high transmission rate, low interaction latency, and high reliability, Mobile Edge Computing (MEC) has emerged as a promising technology that enables caching and processing at network edges. In this article, we present DACOD360, a deadline-aware content delivery system for the 360-degree video streaming over MEC networks. To address the challenges such as unpredictable viewports, uneven cached tiles, concurrent requests, and dynamic bandwidth, we formulate the deadline-aware delivery problem as a long-term integer program model to maximize the Quality of Experience (QoE) under the constraints of network bandwidth, cache capacity, and deadline. This optimization problem is a complex sequential decision that considers both deadline-constrained service quality at the temporal scale and multi-user resource allocation at the spatial scale. To solve it, we decompose the original problem into two sub-problems and solve them iteratively using Deep Reinforcement Learning (DRL) and Cooperative Bargaining Game (CBG). Comprehensive experiments are conducted in a wide variety of environments, and the results demonstrate that our proposed scheme outperforms the state-of-the-art schemes in terms of long-term QoE, traffic reduction, and other metrics.

Abstract:
Clothes-changing person re-identification (Re-ID) aims at learning identity-relevant feature representations among clothing-changed persons. Currently, the state-of-the-art methods accomplish this task by using additional assistance (e.g., silhouettes, sketches, clothes labels, etc.) to explore identity-relevant information. However, humans do not require redundant assistance information to retrieve clothing-changed persons. It is commonly known that humans can recall targets they have seen before with a simple reminder. Inspired by human perception, we propose an association and forgetting learning (AFL) framework for clothes-changing person re-identification. Specifically, on the one hand, during the association learning process, the AFL framework constructs association factors for each identity to simulate the reminders found in human perception. Then, the original instances and the explored hardest positive instances are cross-correlated by the association factors to learn identity-relevant features. On the other hand, the model is forced to forget the identity-irrelevant features by the proposed forgetting learning module, which improves the intra-class compactness. Finally, we further propose a clustering relationship exploration (CRE) module to optimize the cluster distribution of clothes-changing instances, which enables AFL to also be effectively applied in unsupervised settings, improving the universal applicability of the model. Extensive experiment results obtained on clothes-changing person Re-ID datasets under supervised and unsupervised settings demonstrate the superiority of the proposed method over the existing state-of-the-art methods.

Abstract:
Voice spoofing detection is a technique for enhancing the security of automatic speaker verification system, but the existing research still faces problems such as weak detection capability and expensive computation. To address these problems, this work presents a lightweight voice anti-spoofing method by using improved one-class learning DOC-Softmax and knowledge distillation. The main idea of DOC-Softmax is to learn a feature space where the genuine samples have a compact space and the spoofing samples are parted from the bona fide space by a certain interval. And the dispersion loss is introduced for spoofing samples to cover the whole spoofing space as much as possible. Moreover, a lightweight voice spoofing detection model is designed to speed up inference, and the knowledge distillation is employed to improve representation power of the lightweight model. Without any data augmentation and ensemble learning, a series of experiments are conducted on LA and PA scenarios of the ASVspoof 2019 dataset, and the experimental results indicate that the proposed method performs better than most existing voice anti-spoofing methods.

Abstract:
Due to the lack of labeled data in many real-world applications, unsupervised domain adaptation has attracted a great deal of attention in the machine learning community through its use of labeled data from source domains. However, how to make full use of the discriminative information from different sources remains a challenge due to various domain gaps. In this article, we propose a domain complementary adaptation method by leveraging the diversity between sources and the discriminability of each source with contrastive learning. In the proposed method, we adopt several branch networks, denoted as domain branch networks, to learn different views of discriminative domain-invariant features from each source. Moreover, an ensemble classification network trained with domain-invariant features from all domain branch networks is adopted to guide the domain branch networks in providing diverse knowledge. We design a domain mutual contrastive loss by forcing the domain branch networks to be different from one another and be consistent with the ensemble classification network to learn diverse domain-invariant features. To further improve the discriminability of domain branch networks, a domain structure-oriented contrastive loss is proposed to learn the discriminative intrinsic neighborhood structure across each source and target domain. Extensive experiments on the Office-31, Office-Home and DomainNet datasets show that the proposed method outperforms state-of-the-art methods.

Abstract:
Video question answering (VideoQA) is an essential task in vision-language understanding, which has attracted numerous research attention recently. Nevertheless, existing works mostly achieve promising performances on short videos of duration within 15 seconds. For VideoQA on minute-level long-term videos, those methods are likely to fail because of lacking the ability to deal with noise and redundancy caused by scene changes and multiple actions in the video. Considering the fact that the question often remains concentrated in a short temporal range, we propose to first locate the question to a segment in the video and then infer the answer using the located segment only. Under this scheme, we propose “Locate before Answering” (LocAns), a novel approach that integrates a question localization module and an answer prediction module into an end-to-end model. During the training phase, the available answer label not only serves as the supervision signal of the answer prediction module, but also is used to generate pseudo temporal labels for the question localization module. Moreover, we design a decoupled alternative training strategy to update the two modules separately. In the experiments, LocAns achieves state-of-the-art performance on three modern long-term VideoQA datasets, NExT-QA, ActivityNet-QA, and AGQA. Its qualitative examples show the reliable performance of the question localization.

Abstract:
The blackbox nature of deep models has prompted a growing interest in explaining their inner workings and decision-making processes. Although backpropagation (BP)-based attribution methods are popular for visual interpretation, existing methods frequently yield implausible outcomes. For example, gradient-based attribution methods tend to highlight irrelevant regions and generate noise, while CAM-based attributions suffer from low resolution and blurry results. These limitations undermine their ability to correctly identify the target objects and fail to provide the desired justification to guarantee the credibility of the model's decision-making. In this article, we analyze plausibility issues in the frequency domain and point out that the plausibility issues correspond to frequency-domain incompleteness, i.e., the frequency-domain representation of explanations lacks low- or high-frequency components. Then, we propose a straightforward yet effective approach, threshold interception and fusion (TIF), to address this issue by fusing multilayer attributions. Our strategy involves collecting attribution results for all neurons and dividing the attribution map into a concept region that represents the current neuron and background regions based on a given threshold value \alpha. We then fuse these concept regions with the neuron weights in each layer and upsample the layer attributions to match the input size. Finally, we obtain the overall attribution by summing the layer attributions pixelwise. Our experiments demonstrate TIF efficacy by consistently enhancing visual performance across a variety of gradient-based attributions. To further demonstrate the ability to provide compact and fine-grained target objects, we directly employ TIF for the weakly supervised semantic segmentation task. Our results illustrate that TIF significantly outperforms existing methods without additional supervision or architectural modifications. We also observe an overall TIF improvement in the fidelity metric, suggesting that compactness and fine-graininess are not only plausibility issues but also fidelity issues.

Abstract:
Deep image prior (DIP) is an emerging technology that indicates that the structure of an untrained network can serve as an excellent prior for image restoration. It bridges the gap between training-based and training-free methods and exhibits considerable potential in image compressed sensing (CS) reconstruction. In this article, we extend DIP and propose a novel Low-Rank Regularization Video Compressed Sensing Network for CS video reconstruction (dubbed LRR-VCSNet). We explore the application of a low-rank latent tensor with an untrained network for global low-rank regularization on video reconstruction, and the interframe low-rank approximation for framewise nonlocal low-rank regularization in the data space is also exploited. In addition, we design the structure of the untrained network based on the encoder-decoder architecture to improve the performance. Extensive experiments on six standard CIF video sequences show that LLR-VCSNet significantly outperforms traditional video CS methods and achieves competitive results when compared with the state-of-the-art training-based video CS method.

Abstract:
Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.

Abstract:
For few-shot object detection, this work proposes a binary similarity detector (BSDet), which realizes a novel similarity-based multiple binary classification and enhances the feature margin between positive and hard negative samples. First, we revisit the classification paradigm, concluding that multiple binary classification paradigm is more suitable than multi-class classification paradigm for the few-shot task. Hence, we propose a binary similarity head (BSH) by posing the classification task as multiple binary similarity measurements rather than a multi-class prediction. Second, focusing on the hard negative samples, we propose a feature enhancement module (FEM). During training phase, the FEM can push the features of positive and hard negative samples far away from each other, and thus effectively suppresses false positives. Abundant experiments and visualizations indicate that our method achieves state-of-the-art performances on few-shot object detection tasks.

Abstract:
To improve the performance of deep neural networks, the Mixup method has been proposed to alleviate their memorization issues and sensitivity to adversarial samples. This provides networks with better generalization abilities. The learning principle of Mixup is essentially to train deep neural networks for regularization tasks with a convex combination of the original feature vectors and their labels. However, soft labels are generated directly using the mixing ratio without dealing with the uncertain information generated during the mixing process. Therefore, this paper proposes a new data augmentation method based on Mixup and Dempster-Shafer theory called DS-Mixup, which is a regularizer that can express and deal with the uncertainty caused by ambiguity. This method uses interval numbers to generate mass functions of mixed samples to model the distribution of set-valued random variables; then, ambiguous decision spaces are constructed, and soft labels with single-element subsets and multielement subsets are generated to further improve the delineation of decision boundaries during the training process. In addition, an evidence neural network with DS-Mixup is designed in this paper to accomplish recognition or classification tasks. Experimental results obtained on multimedia datasets, including attribute, image, text and signal data, show that the proposed method achieves more effective data augmentation effects and further improves the performance of deep neural networks.

Abstract:
Current oriented object detection methods mainly utilize a vanilla coordinate-angle representation for bounding box regression, which usually suffers from inconsistency between the bounding box regression losses and prediction errors induced with respect to different rotation angles, aspect ratios, and scales. Therefore, although the existing oriented object detectors have achieved very good performances under coarse evaluation metrics such as AP50, their performance significantly degrades when using stricter evaluation metric such as AP75. To address the abovementioned issues, we propose a new regression method with bounding box vectorization that implicitly represents the shape and orientation of an object with a set of orthogonal vectors. By doing this, the proposed method delicately avoids the inconsistency issues encountered in oriented bounding box regression. During training, we introduce the Tanimoto coefficient to evaluate the similarity of the bounding box vector in a shape- and orientation-aware manner, and we refer to the proposed box-to-vector loss as the B2V loss. In addition to 2D object detection, the proposed method can be easily generalized to 3D scenarios involving orientation estimation, such as autonomous driving. We evaluate the proposed method through extensive experiments conducted on four popular oriented object detection datasets, including both 2D and 3D datasets, where the proposed method significantly outperforms the recently developed state-of-the-art methods when using a more accurate evaluation metric.

Abstract:
Local feature extraction consists of keypoint detection and local descriptor extraction. Firstly, in keypoint detector learning, existing covariance constraint loss functions cannot constrain the probability distribution shapes in local probability maps that surround keypoints. And existing auxiliary peak loss functions, which are used to alleviate the problem, impair the performance of local feature methods. To solve this problem, we propose a novel Covariant Peak constraint Loss (CP Loss) which is defined as the expectations of local probability maps' position errors. Minimizing our CP Loss can make local probability maps accurately peak at reliable keypoints. Secondly, in descriptor learning, the Neural Reprojection Error (NRE) aims at constraining dense descriptor maps of images. But we argue that only those descriptors of keypoints need to be constrained. Thus, we propose a novel Conditional Neural Reprojection Error (CNRE) that is only conditioned on keypoints. Compared with the NRE, our CNRE can achieve much higher efficiency and produce more keypoint-specific descriptors with better matching performance. We use our CP Loss and CNRE to train a local feature network named as CPCN-Feat. Experimental results show that our CPCN-Feat achieves state-of-the-art performance on four challenging benchmarks.

Abstract:
Partial multi-label learning (PML) needs to address the problem of multi-label learning when the dataset contains redundant information. PML is more challenging compared to traditional multi-label learning, because PML needs not only to perform the multi classification task, but also to reduce the impact of noise information on the model. Existing PML methods suffer from the following problems. (1) Only single source of noise is considered. (2) Some methods ignore the label correlations. To solve the above problems, we proposes a new dual noise elimination and dynamic label correlation guided partial multi-label learning (PML-DNDC). Specifically, the hidden ground-truth label matrix is decomposed into two compressed matrices of instance and classifier, which are used to approximate the candidate label matrix, to eliminate the negative effects of label noise on the model. On one hand, the compressed instance matrix maintains local structural consistency with the original instances, eliminating noise in the feature. On the other hand, dynamic label correlation guidance is designed to help classifier training by dynamically exploring the potential label correlations, which encourages relevant labels to obtain similar classifiers. After extensive experiments and analyses, we conclude that the proposed PML-DNDC is superior to the state-of-the-art methods.

Abstract:
The growing demands for privacy protection challenge the joint training of one model by leveraging multiple datasets. Federated learning (FL) provides a new way to overcome this challenge and has attracted many research interests, which enables multiple parties to collaboratively train a machine learning model without exchanging their local data. Despite some success, the non-independent and identically distributed (non-IID) data distributions in different parties remain challenging and easily damage the performance of FL methods, specifically for the heterogeneous multimodal data. Existing FL studies on non-IID data settings are often dedicated to the label space, neglecting the non-IID issues in feature space, thus limiting their performance when the parties with non-IID multimodal data. This paper proposes a new Federated learning method via Selective feature Alignment (FedSea) to align representations across multiple parties in the feature space. FedSea uses a domain adversarial learning framework consisting of an affine-transform-based generator and a gradient-reversal-based client discriminator to perform IID transformation and reduce data source distinguishability, respectively. An attention-based mask module and a feature IID confidence quantification method are introduced to effectively address the diverse feature non-IID levels across multimodal data. Comprehensive experiments are conducted on three widely-used public datasets and one large-scale industrial dataset, showing FedSea has: 1) better performance than state-of-the-art FL methods on both multimodal and single-modal datasets; 2) superior feature alignment ability on non-IID datasets, and 3) good model interpretability.

Abstract:
Shiplicense plate recognition (SLPR) plays an important role in intelligent waterway management, but few attention has been paid to SLPR in scene text recognition (STR) community. Inspired by various outstanding achievements on STR, combined the intrinsic properties of SLPR, we propose a Multi-Modal and Multi-Attention dynamic fusion network (M^3ANet) for SLPR in this article. Specifically, the visual-language joint modeling for SLPR is developed and the channel-spatial-self attention dynamic fusion mechanism is proposed for accuracy boosting. Explicitly fusing linguistic information extracted from ship name related corpus improves the adaptability of the recognition model to occlusion, background confusion, blur, etc., which is integrated with vision features to establish a multi-modal recognition network. Gated fully fusion is utilized to fuse visual features re-weighted by multi-attention components, inducing flexible compatibility with multiple types of decoders and more refined recognition decoder inputs. Additionally, to comprehensively mine spatially salient text regions in ship license plate images, we investigate the grouped spatial attention. Extensive experiments empirically demonstrate the effectiveness of M^3ANet and superior performance (93.80% with regular images, while 90.34% with irregular images) on two benchmarks.

Abstract:
With a deeper understanding of the security issues in steganography, coverless steganography has become a hotspot due to no modification to the carriers. However, the existing coverless video steganographic algorithms have considered a few types of video attacks. In this paper, a robust coverless video steganography based on the similarity of inter-frames is proposed. First, a public video database is selected and preprocessed to construct a Secret Communication Video Database (SCVD). The similarity score between the first and last frames is calculated for video sorting to utilize the temporal characteristics of videos. After that, the mapping table between the secret information and the SCVD is designed for both senders and receivers. Finally, each secret information segment can be represented by one video sequence in the SCVD according to the mapping table to accomplish the data hiding and extraction. Experimental results show that the proposed method performs much better in capacity, robustness, and security than the state-of-the-art methods. It is worth mentioning that the proposed method overcomes the security issue of transmitting a large amount of auxiliary information in coverless video steganographic algorithms.

Abstract:
Full-reference image dehazing quality assessment (FR-IDQA) evaluates the visual quality of a dehazed image by measuring its differences with a clear reference. The existing FR-IDQA methods are not convincing due to the lack of well-aligned datasets of hazy and clear image pairs and the limited hand-crafted features make it difficult to simulate the complicated perception by the human visual system (HVS). In this work, we build a real-world image dataset, namely RW-Haze, which comprises natural hazy images and their well-aligned clear references. Each clear image is paired with several hazy images with diverse haze levels from slight to heavy. Meanwhile, the existing FR-IDQA works evaluate the dehazed image quality in a global manner, without considering local haze distributions in the original hazy image. Actually, the perceived haze in a natural hazy image is not uniformly distributed, and the haze density varies with scene depth. Based on this priori observation, we design a haze density-aware convolutional neural network (CNN), namely DehIQA, for FR-IDQA. It adopts transfer learning to alleviate the issue of lacking sufficient labeled data. Specifically, we divide image dehazing assessment into two tasks. The source task is to classify unpaired clear and hazy images, which enforces the deep network to learn haze-related features. The target task is image quality assessment, which is achieved by transferring the trained model for the source task to the target task. Considering the fact that the perceived distortion in a dehazed image is also not uniform, we present a haze density-aware mechanism into DehIQA, which assigns different weights for different local regions in a dehazed image in terms of the dark channel of the original hazy image. Extensive experimental results show that DehIQA outperforms the state-of-the-art (SOTA) works on the benchmark dataset and achieves better consistency with human perceptions.

Abstract:
The growing practice of outsourcing captured photos to the cloud has provided users with convenience while also raising privacy concerns. Traditional image encryption techniques prioritize privacy protection but often compromise usability, which is unacceptable for cloud users. To strike a balance between image privacy and usability, scholars have proposed thumbnail-preserving encryption (TPE), whose cipher image preserves the same thumbnail as the plain image while erasing details beyond the thumbnail, providing visual usability while protecting privacy. Regrettably, most of the proposed TPE schemes are not well-suited for widely used JPEG images, and existing TPE schemes supporting JPEG suffer from drawbacks such as poor visual usability, high expansion rate, and the inability to decrypt without loss. Besides, the retrieval designed for TPE-encrypted images exhibits limited generalization. To address these challenges, we pertinently introduce a TPE based on adaptive deviation embedding (TPE-ADE) for JPEG images, incorporating Huffman coding and reversible data hiding techniques. By leveraging JPEG in-compression encryption, we achieve perfectly reversible TPE that enhances visual usability and reduces expansion rates of TPE-encrypted images. Additionally, we encourage the TPE-encrypted images to resemble low-resolution images (LRIs). Then, the convolutional neural network (CNN) is employed to recognize and retrieve LRIs to verify the functionality of TPE-encrypted images. Also, a teacher-assistant-student (TAS) learning paradigm is proposed to optimize the CNN model, enhancing the performances of recognition and retrieval. Experimental results validate the superiority of our encryption algorithm and the effectiveness of TAS.

Abstract:
Designing learning-based no-reference (NR) video quality assessment (VQA) algorithms for camera-captured videos is cumbersome due to the large number of human annotations of quality. In this work, we propose a semi-supervised learning (SSL) framework exploiting many unlabelled and very limited numbers of authentically distorted labelled videos. Our main contributions are twofold. Leveraging the benefits of consistency regularization and pseudo-labelling, our SSL model generates pairwise pseudo-ranks for the unlabelled videos using a student-teacher model on strong-weak augmented videos. We design the strong-weak augmentations to be quality invariant to use the unlabelled videos effectively in SSL. The generated pseudo-ranks are used along with the limited labels to train our SSL model. Our primary focus in SSL for NR VQA is to learn mapping from video feature representations to quality scores. We compare various feature extraction methods and show that our SSL framework can lead to improved performance on these features. We present a spatial and temporal feature extraction method based on predicting spatial and temporal entropic differences. We show that these features help achieve robust performance when trained with limited data, providing a better baseline to apply SSL. Extensive experiments on three popular VQA datasets demonstrate that the proposed semi-supervised VQA method improves on the performance of existing methods in terms of correlation with human opinion by approximately 15 \! - \! 20 %

Abstract:
Unsupervised domain adaptive object detection (UDA-OD) is a challenging task that aims to improve the generalization of detectors across domains. Although the existing UDA-OD methods have demonstrated their capabilities, they fail to investigate two critical correlations in the adaptation procedure, i.e., 1) the correlation between the features inside an image and 2) the correlation between the domain-invariant and domain-specific features across domains. To take full advantage of these two correlations, we propose a Cyclic Reconstruction and Decoupling Adaptation (CRADA) framework to efficiently decouple and align the features from different domains. Our CRADA builds graphs for images to capture the correlation between the informative points, and decouples it into two components, one for the domain-specific features and the other for the domain-invariant features. To enhance the qualities of the decoupled features, we also propose a cyclic decoupling-reconstruction-decoupling strategy and a swap-and-reconstruction procedure for the decoupled features of different domains. To make the training procedure easier, we introduce a confidence-guided update scheme for the memory bank and overcome the problem of asymmetric categories in each training batch. We conduct comprehensive experiments to verify the effectiveness of our proposed CRADA.

Abstract:
Driven by the high nonlinearity of deep neural networks, deep hashing has achieved the pictured great potential in cross-modal retrieval applications, significantly bridging the modality gap. Current deep cross-modal hashing usually utilizes affinity matching or local ranking to capture the local semantic relationships in the learned common space, leading to high neighborhood ambiguity. Simultaneously, most of these frameworks utilize additional regularization terms or margin thresholds to enhance the overall performance, in which searching the model's hyper-parameters under mass training data would have a substantial overhead. In this paper, with a novel extension of information-theoretic measures, a novel deep cross-modal hashing method, named Deep Neighborhood-preserving Hashing (DNpH), is designed to learn a highly separable discrete space, effectively mitigating the semantic gap across different modalities. Specifically, to minimize neighborhood ambiguity, the Quadratic Spherical Mutual Information (QSMI) is first introduced into deep cross-modal hashing to separate neighbors and non-neighbors well, while it is free of tuning parameters during model training compared with other similarity measures. To optimize quadratic mutual information loss smoothly, a square clamping method is developed to improve the stability of model optimization, avoiding converging on bad local optimum. Besides, two transformer encoders are exploited as feature extractors for multi-modal samples to learn the informative semantic representations. Finally, we compare our proposed DNpH framework with various state-of-the-art cross-modal hashing on four public datasets, and large amounts of experiment results demonstrate our contributions and show that DNpH outperforms the compared baselines on different evaluation metrics.

Abstract:
Virtual try-on is a promising computer vision topic with a high commercial value wherein a new garment is visually worn on a person with a photo-realistic effect. Previous studies conduct their shape and content inference at one stage, employing a single-scale warping mechanism and a relatively unsophisticated content inference mechanism. These approaches have led to suboptimal results in terms of garment warping and skin reservation under challenging try-on scenarios. To address these limitations, we propose a novel virtual try-on method via progressive inference paradigm (PGVTON) that leverages a top-down inference pipeline and a general garment try-on strategy. Specifically, we propose a robust try-on parsing inference method by disentangling semantic categories and introducing consistency. Exploiting the try-on parsing as the shape guidance, we implement the garment try-on via warping-mapping-composition. To facilitate adaptation to a wide range of try-on scenarios, we adopt a covering more and selecting one warping strategy and explicitly distinguish tasks based on alignment. Additionally, we regulate StyleGAN2 to implement re-naked skin inpainting, conditioned on the target skin shape and spatial-agnostic skin features. Experiments demonstrate that our method has state-of-the-art performance under two challenging scenarios.

Abstract:
Domain shift poses a significant challenge in Cross-Domain Facial Expression Recognition (CD-FER) due to the distribution variation between the source and target domains. Current algorithms mainly focus on learning domain-invariant features through global feature adaptation, while neglecting the transferability of local features across different domains. Additionally, these algorithms lack discriminative supervision during training on target datasets, resulting in deteriorated feature representation in the target domain. To address these limitations, we propose an Adaptive Global-Local Representation Learning and Selection (AGLRLS) framework. The framework incorporates global-local adversarial adaptation and semantic-aware pseudo label generation to enhance the learning of domain-invariant and discriminative feature representation during training. Meanwhile, a global-local prediction consistency learning is introduced to improve classification results during inference. Specifically, the framework consists of separate global-local adversarial learning modules that learn domain-invariant global and local features independently. We also design a semantic-aware pseudo label generation module, which computes semantic labels based on global and local features. Moreover, a novel dynamic threshold strategy is employed to learn the optimal thresholds by leveraging independent prediction of global and local features, ensuring filtering out the unreliable pseudo labels while retaining reliable ones. These labels are utilized for model optimization through the adversarial learning process in an end-to-end manner. During inference, a global-local prediction consistency module is developed to automatically learn an optimal result from multiple predictions. To validate the effectiveness of our framework, we conduct comprehensive experiments and analysis based on a fair evaluation benchmark. The results demonstrate that the proposed framework outperforms the current competing methods by a substantial margin.

Abstract:
Cross-domain few-shot learning (CDFSL) has received great interest for its effectiveness in solving the problem of the shift between source and target domains in few-shot scenarios. To extract more representative features, recent CDFSL works have exploited small-scale unlabeled samples from the target domain during the feature extraction phase. Existing self-supervised CDFSL methods, however, typically fine-tune the weights of the pre-trained model without taking into account the mismatch between source and target domains. To address this shortcoming, we introduce a self-supervised soft weight pruning strategy for cross-domain few-shot classification tasks with unlabeled target data. Starting from a pre-trained network from the source domain, our approach iterates between pruning out the relatively unimportant connections of the network and reactivating the pruned connections in a joint contrastive and L^2-SP regularized training framework. By combining the soft weight pruning strategy and regularization, our method effectively restricts redundant weights while simultaneously learning crucial features for both source and target tasks. Our approach, in comparison to other methods, does not involve any additional modules in the models; however, it can still achieve remarkable performance. Our approach can be efficiently incorporated into a variety of contrastive learning methods in a plug-and-play fashion. Extensive experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing representative cross-domain few-shot methods by a large margin.

Abstract:
Distributed learning requires a frequent communication of neural network update data. For this, we present a set of new compression tools, jointly called differential neural network coding (dNNC). dNNC is specifically tailored to efficiently code incremental neural network updates and includes tools for federated BatchNorm folding (FedBNF), structured and unstructured sparsification, tensor row skipping, quantization optimization and temporal adaptation for improved context-adaptive binary arithmetic coding (CABAC). Furthermore, dNNC provides a new parameter update tree (PUT) mechanism, which allows to identify updates for different neural network parameter sub-sets and their relationship in synchronous and asynchronous neural network communication scenarios. Most of these tools have been included into the standardization process of the NNC standard (ISO/IEC 15938-17) edition 2. We benchmark dNNC in multiple federated and split learning scenarios using a variety of NN models and data including vision transformers and large-scale ImageNet experiments: It achieves compression efficiencies of 60% in comparison to the NNC standard edition 1 for transparent coding cases, i.e., without degrading the inference or training performance. This corresponds to a reduction in the size of the NN updates to less than 1% of their original size. Moreover, dNNC reduces the overall energy consumption required for communication in federated learning systems by up to 94%.

Abstract:
Rate control (RC) plays an essential role in video coding. RC algorithms based on more accurate rate-distortion (R-D) models often achieve higher control precision and better R-D performance. The most widely used R-D model for HEVC and VVC is the hyperbolic R-D model, which is equivalent to a linear relationship between \ln R and \ln D. Due to its high accuracy, few studies have attempted to improve the accuracy of the hyperbolic R-D model further. Intuitively, we consider that the accuracy of the hyperbolic R-D model could be further improved by increasing the order of the R-D model. However, this may also increase the number of model parameters to be estimated, which may not benefit the one-pass RC precision. In this paper, we first explicitly note that there is a trade-off between the order of the R-D model and the difficulty in estimating the model parameters in one-pass RC. Then, motivated by the trade-off, we propose high-order R-D models and the corresponding one-pass frame-level RC algorithms for video coding. Finally, we introduce the quadratic R-D model into frame-level RC in the VVC Test Model (VTM-19.2) and provide a content-adaptive model selection between the first-order and second-order R-D models. Experimental results show that the proposed frame-level RC algorithm based on the quadratic R-D model reduces the average frame-level bitrate error by 5.30%, 3.11%, and 13.45% and achieves 0.49%, 0.65%, and 0.35% BD-Rate savings under Low Delay B (LDB), Low Delay P (LDP), and Random Access (RA) configurations, respectively, when compared to the default RC algorithm used in VTM-19.2.

Abstract:
As a branch of domain adaptation (DA), multi-source DA (MSDA) is a challenging issue that aims to transfer knowledge from multiple well-labeled source domains to a target domain for target tasks. However, most existing related works focus on single-target domain adaptation, and multiple target domain adaptation is not accounted for. We believe that multiple target domains provide valuable knowledge. Meanwhile, in multi-source and multi-target adaptation scenarios, feature generators with static parameters have difficulty generating deep features of each individual domain. In this article, we propose a Dynamic Generator With Attention (DGWA) method for multi-source and multi-target domain adaptation to adapt domain-agnostic deep features in multi source and multi target domain scenarios. The feature generator with dynamic parameters can dynamically change its parameters with data input from different domains, which greatly improves the generalization of the feature pools. An attention mechanism is used in our DGWA to learn more transferable information from different domains. To demonstrate the performance of DGWA, we conduct extensive experiments on several popular domain adaptation datasets, including the digits, Office+Caltech10, Office-Home, and ImageCLEF-DA datasets. The experimental results demonstrate that our method performs better than state-of-the-art methods.

Abstract:
Deep recognition models aim to recognize targets with various quality levels in uncontrolled application circumstances, and typically low-quality images usually retard the recognition performance dramatically. As such, a straightforward solution is to restore low-quality input images as pre-processing during deployment. However, this scheme cannot guarantee that deep recognition features of the processed images are conducive to recognition accuracy. How deep recognition features of low-quality images can be refined during training to optimize recognition models has largely escaped research attention in the field of metric learning. In this paper, we propose a quality-aware feature refinement framework based on the dedicated quality priors obtained according to the recognition performance, and a novel quality self-distillation algorithm to learn recognition models. We further show that the proposed scheme can significantly boost the performance of the recognition model with two popular deep recognition tasks, including face recognition and person re-identification. Extensive experimental results provide sufficient evidence on the effectiveness and impressive generalization capability of the proposed framework. Moreover, our framework can be essentially integrated with existing state-of-the-art classification loss functions and network architectures, without extra computation costs during deployment.

Abstract:
Improving the robustness of models against feature noise has emerged as one of the most crucial research topics in the field of multimodal sentiment analysis. Recent studies assume that the training instances are free of noise and develop either translation or reconstruction based method under the guidance of perfect training data for robust testing time performance. However, such an ideal assumption neglects the potential presence of the feature noise in training instances and inevitably results in degradation for the scenario where high-quality training instances are unavailable. In order to achieve robust training with noisy instances, we propose the Meta Noise Adaption (Meta-NA) learning strategy, a meta learning method accumulating the experience of dealing with various types of feature noise. Specifically, we first formulate the tasks distribution where each task is corresponding to one specific pattern of noise, and propose the feature adaption module adding on the unimodal encoder in late fusion based architecture. Through an nested online optimization between the auxiliary feature adaption module and the late fusion backbone modules, the proposed method can leverage shared knowledge across different noisy source tasks and learn how to learn from the noisy instances for robust testing performances. Extensive experiments are conducted on two benchmark multimodal sentiment analysis datasets, namely MOSI and CH-SIMS v2. The results demonstrate that our proposed method can rapidly adapt to various unseen types of feature noise and outperforms all baseline methods, particularly when the training instances are limited.

Abstract:
Unsupervised person re-identification (Re-ID) has made significant progress by leveraging valuable pseudo labels from completely unlabeled data. However, the predominant use of pseudo labels heavily relies on clustering results, which may lead to the accumulation of supervision deviation due to inevitable noise. In this paper, we propose a novel framework, namely Dual Knowledge Distillation on Multiview Pseudo Labels (DKD-MPL), to address this challenge. Specifically, the proposed DKD-MPL framework consists of two modules: Global Knowledge Distillation (GKD) and Self-Knowledge Distillation (SKD). In the GKD module, the pseudo labels obtained from the epoch-wise clustering procedure serve as the logits for the teacher model, while the mini-batch query images' pseudo labels act as the logits for the student model. Within the SKD module, we facilitate self-knowledge distillation by considering the pseudo labels generated by positive anchors and query images as two augmentations of the mini-batch data. As a result, DKD-MPL facilitates the exploitation of both global and local complementary knowledge across different views of pseudo labels, thereby mitigating supervision deviation. To demonstrate the effectiveness of DKD-MPL, we provide a theoretical analysis of the proposed loss and conduct extensive experiments on four popular datasets, e.g., Market-1501, DukeMTMC-reID, MSMT17, and VeRi-776. The results indicate that our method surpasses unsupervised approaches and achieves comparable performance to supervised person Re-ID methods.

Abstract:
Demoiréing is the task of removing moiré patterns, which are commonly caused by the interference between the screen and digital cameras. Although research on single image demoiréing has made great progress, research on video demoiréing has received less attention from the community. Video demoiréing poses a new set of challenges. First, most existing video restoration algorithms rely on multi-resolution pixel-based alignment, which can cause damage to the details of the predicted results. Second, these algorithms are based on flow-based loss or relation-based loss, making it difficult to handle the large motions of adjacent frames while keeping temporal consistency intact. To address these challenges, we present a novel deep learning-based approach called the Deep Temporal Color Embedding network (DTCENet) that employs an invertible network to align distortion color patches in a patch-based embedding framework. DTCENet can well preserve details while eliminate color distortions. Furthermore, we introduce a video-image invertible loss function to effectively handle the color inconsistent problem of adjacent frames. Our approach shows promising results in demoiréing videos, with improved performance over existing state-of-the-art algorithms. Our method gets about 10% improvements in terms of LPIPS and 10.3% improvements in terms of FID compared with the recent SOTA methods.

Abstract:
Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain not only noisy labels but also class imbalance. The imbalance issue is prone to causing failure in the loss-based sample selection since the under-learning of tail classes also leans to produce high losses. To this end, we propose a simple yet effective method to address noisy labels in imbalanced datasets. Specifically, we propose Class-Balance-based sample Selection (CBS) to prevent the tail class samples from being neglected during training. We propose Confidence-based Sample Augmentation (CSA) for the chosen clean samples to enhance their reliability in the training process. To exploit selected noisy samples, we resort to prediction history to rectify labels of noisy samples. Moreover, we introduce the Average Confidence Margin (ACM) metric to measure the quality of corrected labels by leveraging the model's evolving training dynamics, thereby ensuring that low-quality corrected noisy samples are appropriately masked out. Lastly, consistency regularization is imposed on filtered label-corrected noisy samples to boost model performance. Comprehensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method, especially in imbalanced scenarios.

Abstract:
Automatically generated oral panoramic X-ray report is highly beneficial for improving the efficiency of dental diagnosis. However, recent solutions adopt holistic methods, resulting in a cursory description of the oral condition. This may lead to reports lacking details, such as specific sites or lesion contours. Therefore, we propose a Multi-Level objective Alignment Transformer(MLAT) network, which integrates all tooth and disease objects into a positional alignment graph to extract fine-grained object-level features. Specifically, we introduce a novel Object-Level Collaborative Encoder (OLCE) module, which uses a positional alignment graph to construct object relationships. OLCE enhances object-level feature extraction by eliminating interference information between pathologically unrelated objects. In addition, we build a high-quality panoramic X-ray image-report dataset consisting of 562 sets of images and reports labeled by 13 experienced dental specialists. Experiments on the collected dataset show that the proposed MLAT significantly outperforms the state-of-the-art baselines by more than 5% in 4 different metrics, including BLEUs, Meteor, Rouge, and BERTScore.

Abstract:
Pedestrian trajectory prediction is a critical technology in the evolution of self-driving cars toward complete artificial intelligence. Over recent years, focusing on the trajectories of pedestrians to model their social interactions has surged with great interest in more accurate trajectory predictions. However, existing methods for modeling pedestrian social interactions rely on pre-defined rules, struggling to capture non-explicit social interactions. In this work, we propose a novel framework named DTGAN, which extends the application of Generative Adversarial Networks (GANs) to graph sequence data, with the primary objective of automatically capturing implicit social interactions and achieving precise predictions of pedestrian trajectory. DTGAN innovatively incorporates random weights within each graph to eliminate the need for pre-defined interaction rules. We further enhance the performance of DTGAN by exploring diverse task loss functions during adversarial training, which yields improvements of 16.7% and 39.3% on metrics ADE and FDE, respectively. The effectiveness and accuracy of our framework are verified on two public datasets. The experimental results show that our proposed DTGAN achieves superior performance and is well able to understand pedestrians' intentions.

Abstract:
Person-agnostic face swapping has gained significant attention in recent years, as it offers the potential to enhance various real-world applications by combining high fidelity and identity consistency. However, conventional face swapping methods often rely on intricate adjustments of different loss functions, leading to instability during both the training and inference stages. In this work, we propose a simple yet effective framework named StableSwap with a reversible autoencoder to modify the face in a shared latent space. Our approach capitalizes on the information-rich image latent codes to tackle the challenges of complex editing tasks, utilizing the abundant details present in both the source and target faces. To ensure an expressive and robust latent space, we employ a latent alignment approach with perceptual and adversarial losses to optimize the autoencoder. Additionally, we devise a multi-stage identity injection module that samples multiple features with different facial priors and incorporates them to guide the latent image manipulation. By leveraging attention-based blocks, we fuse these futures and update the latent code in a mask-conditioned manner. Both quantitative and qualitative results on the mainstream benchmarks demonstrate that our StableSwap generates competitive identity-consistent swapped faces compared with state-of-the-art methods. Our method outperforms previous approaches in terms of ID Retrieval (98.68) and FID (2.49), while also exhibiting enhanced stability during model training. Beyond this, our model achieves region-controllable face swapping with the capability to perform more fine-grained operations in latent space.

Abstract:
Recent years have witnessed a growing interest in co-object segmentation and multi-modal salient object detection. Many efforts are devoted to segmenting co-existed objects among a group of images or detecting salient objects from different modalities. Albeit the appreciable performance achieved on respective benchmarks, each of these methods is limited to a specific task and cannot be generalized to other tasks. In this paper, we develop a Unified TRansformer-based framework, namely UniTR, aiming at tackling the above tasks individually with a unified architecture. Specifically, a transformer module (CoFormer) is introduced to learn the consistency of relevant objects or complementarity from different modalities. To generate high-quality segmentation maps, we adopt a dual-stream decoding paradigm that allows the extracted consistent or complementary information to better guide mask prediction. Moreover, a feature fusion module (ZoomFormer) is designed to enhance backbone features and capture multi-granularity and multi-semantic information. Extensive experiments show that our UniTR performs well on 17 benchmarks, and surpasses existing state-of-the-art approaches.

Abstract:
We propose a co-part segmentation method that takes a set of point clouds of the same category as input where neither a ground truth label nor a prior network is required. With difficulties caused by the label absence, we formulate the co-part segmentation task into two subtasks, including superpoint generation and part aggregation. In the first subtask, our superpoint generation network divides each point cloud into homogeneous partitions, each called superpoint, while in the second subtask, these superpoints are further aggregated into a few semantic parts via our part aggregation network. We introduce the coupled attention blocks in the part aggregation network to explicitly enforce semantic consistency in the segmentation by exploiting intra-, inter-, and paired-cloud geometrical information by minimizing the devised intra-, inter-, and paired-cloud losses, respectively. The intra-cloud loss triggers a semantic segmentation in each point cloud, while the inter-cloud loss considers all clouds to enforce their semantic consistency. The paired-cloud loss is designed to ensure that each part of one point cloud can be discriminatively reconstructed from the superpoints of another point cloud. We perform experiments on two benchmark datasets, ShapeNet part and COSEG, and provide quantitative and qualitative results to demonstrate the superiority of our method over existing methods. We also show that the proposed method can help several downstream tasks, including semi-supervised part segmentation and data augmentation for shape classification.

Abstract:
Due to the high computational efficiency and low storage cost, cross-modal hashing retrieval attracts great deal of attention. However, as heterogeneous data from different modalities often have distinct physical meanings and underlying structures, learning encoding with equal length for different modalities may result in an insurmountable semantic gap. In addition, there are other issues to be addressed in this field such as how to combine label and sample information to learn hash codes effectively, how to reduce the time consumption caused by computing n × n similarity matrix, and how to effectively solve the complex discrete optimization problem. To cope with the above challenges, this study proposes a novel model called Scalable Discrete and Asymmetric Unequal Length Hashing (SDAULH). First, SDAULH constructs a novel hash model that utilizes unequal length encoding schemes to narrow the semantic gap between heterogeneous modalities. Second, SDAULH develops a dual semantic embedding learning scheme, which combines pairwise similarity between label and sample data to generate a more discriminative hash code. Third, SDAULH associates with both hash codes and label information by an asymmetric relaxation strategy. Furthermore, SDAULH solves directly the discrete optimization problem by generating discrete hash codes. Experimental results on four benchmark datasets demonstrate the promising performance of SDAULH.

Abstract:
Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (MLLC), which aims to use graph neural networks to capture structural relationships in Semantic-Level Graphs (SLGs) and Class-Level Graphs (CLGs) to rectify erroneous pseudo-labels. Specifically, SLGs represent semantic affinities between pairs of pixel features, and CLGs describe classification consistencies between pairs of pixel labels. With the support of proximate pattern information from graphs, MLLC can rectify incorrectly predicted pseudo-labels and can facilitate discriminative feature representations. We design an end-to-end network to train and perform this effective label corrections mechanism. Experiments demonstrate that MLLC can significantly improve supervised baselines and outperforms state-of-the-art approaches in different scenarios on Cityscapes and PASCAL VOC 2012 datasets. Specifically, MLLC improves the supervised baseline by at least 5% and 2% with DeepLabV2 and DeepLabV3+ respectively under different partition protocols.

Abstract:
Sign language recognition (SLR) can connect the hearing-impaired and able-bodied communities. The SLR works through multiple modalities of co-action, which has garnered attention. However, these methods are much less effective or even fail in recognition when confronted with missing modalities. Therefore, this article proposes MMSLR, a multimodal SLR framework with cross-modal complementary information. The framework comprises three key components: the cross-modal information complementation (CMIC) module, the fusion and prediction module (FPM), and the sign language recognition module (SLRM). The CMIC module is designed with multi-layer, multi-view spatial-temporal detectors to observe different modality features in both temporal and spatial dimensions. Additionally, it utilizes co-training to achieve complementary information among multi-modalities. The FPM integrates cross-modal attention with Canberra distance to eliminate inter-modal redundant information while fusing multimodal features. The SLRM constructed based on Transformer fuses partially obtained modalities from CMIC through bidirectional cross-channel attention. Teacher-Student pairs are constructed to transfer full-modal features from FPM to the above fused modality features. Moreover, experimental results on the provided MM-Sentence and publicly available OH-Sentence, TH-Sentence and USTC-CSL datasets demonstrate that MMSLR achieves state-of-the-art performance.

Affiliations: PCA Laboratory, Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; School of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China; School of Computer Science, The University of Sydney, Camperdown, NSW, Australia; Science and Technology on Complex System Control and Intelligent Agent Cooperation Laboratory, Beijing, China

Abstract:
Traditional Semi-Supervised Learning (SSL) classification methods focus on leveraging unlabeled data to improve the model performance under the setting where labeled set and unlabeled set share the same classes. Nevertheless, the above-mentioned setting is often inconsistent with many real-world circumstances. Practically, both the labeled set and unlabeled set often hold some individual classes, leading to an intersectional class-mismatch setting for SSL. Under this setting, existing SSL methods are often subject to performance degradation attributed to these individual classes. To solve the problem, we propose a Class-wise Contrastive Prototype Learning (CCPL) framework, which can properly utilize the unlabeled data to improve the SSL classification performance. Specifically, we employ a supervised prototype learning strategy and a class-wise contrastive separation strategy to construct a prototype for each known class. To reduce the influence of the individual classes in unlabeled set (i.e., out-of-distribution classes), each unlabeled example can be weighted reasonably based on the prototypes during classifier training, which helps to weaken the negative influence caused by out-of-distribution classes. To reduce the influence of the individual classes in labeled set (i.e., private classes), we present a private assignment suppression strategy to suppress the improper assignments of unlabeled examples to the private classes with the help of the prototypes. Experimental results on four benchmarks and one real-world dataset show that our CCPL has a clear advantage over fourteen representative SSL methods as well as two supervised learning methods under the intersectional class-mismatch setting.

Abstract:
Cropping box regression algorithms re-frame the images with predicted cropping boxes for better composition quality, which can save considerable manpower and time for massive image retouching work. Yet, recent learning-based cropping box regression algorithms require expert annotations, which makes the scale of training limited. This consequently incurs a performance bottleneck. To address this issue, previous works seek the help from auxiliary datasets of related tasks, e.g., the composition classification. However, the domain gap between related tasks and the likewise restricted scale of auxiliary datasets are still limiting factors. Hence, our work provides a novel semi-supervised framework that can learn better re-framing knowledge with unlimited unlabeled data. We make use of the unlabeled data via pseudo-labeling, where the model learns from the pseudo labels generated from a temporal ensemble version of itself. To prevent the model learns from its own mistakes, a.k.a. the problem of confirmation bias, we propose to rectify the mistakes by fusing multiple candidate pseudo labels into the better ones. The fusion procedure is based on the uncertainty estimation for each boundary of the candidate cropping boxes. The multiple candidates are from the proposed aesthetic region proposal network. Extensive experimental results explain how the uncertainty-based pseudo label fusion procedure overcomes the confirmation bias and demonstrate the superiority of our semi-supervised cropping box regression framework.

Abstract:
The rapid dissemination of fake news and rumors through the Internet and social media platforms poses significant challenges and raises concerns in the public sphere. Automatic detection of fake news plays a crucial role in mitigating the spread of misinformation. While recent approaches have focused on leveraging neural networks to improve textual and visual representations in multi-modal fake news analysis, they often overlook the potential of incorporating knowledge information to verify facts within news articles. In this paper, we present a vision and language model that incorporates knowledge to enhance multi-modal fake news detection. Our proposed model integrates information from large scale open knowledge graphs to augment its ability to discern the veracity of news content. Unlike previous methods that utilize separate models to extract textual and visual features, we synthesize a unified model capable of extracting both types of features simultaneously. To represent news articles, we introduce a graph structure where nodes encompass entities, relationships extracted from the textual content, and objects depicted in associated images. By utilizing the knowledge graph, we establish meaningful relationships between nodes within the news articles. Experimental evaluations on a real-world multi-modal dataset from Twitter demonstrate significant performance improvement by incorporating knowledge information.

Abstract:
Human pose estimation (HPE) has many wide applications such as multimedia processing, behavior understanding and human-computer interaction. Most previous studies have encountered many constraints, such as restricted scenarios and RGB inputs. To mitigate constraints to estimating the human poses in general scenarios, we present an efficient human pose estimation model (i.e., EHPE) with joint direction cues and Gaussian coordinate encoding. Specifically, we propose an anisotropic Gaussian coordinate coding method to describe the skeleton direction cues among adjacent keypoints. To the best of our knowledge, this is the first time that the skeleton direction cues is introduced to the heatmap encoding in HPE task. Then, a multi-loss function is proposed to constrain the output to prevent the overfitting. The Kullback-Leibler divergence is introduced to measure the predication label and its ground truth one. The performance of EHPE is evaluated on two HPE datasets: MS COCO and MPII. Experimental results demonstrate that EHPE can obtain robust results, and it significantly outperforms existing state-of-the-art HPE methods. Lastly, we extend the experiments on infrared images captured by our research group. The experiments achieved the impressive results regardless of insufficient color and texture information.

Abstract:
Inverse tone mapping, a technique to restore a high dynamic range (HDR) image from a single low dynamic range (LDR) image, exhibits wide versatility since it may be easily applied to any camera device. Besides, the recent advancement in deep learning has produced great performance improvement in the field of inverse tone mapping. However, it remains a difficult task to accurately restore a wide-range HDR image from a single LDR image. A recent study attempts a spatially adaptive exposure value (EV) condition generated from luminance values to create a pseudo-multi-exposure stack. However, by adopting only luminance values as input, the conditioning method cannot precisely reflect the input image information when generating the EV condition, resulting in the loss of color expression. Moreover, there are some concerns regarding how to apply the EV condition to the image feature. Thus, the key idea of this study is to directly adopt image features in generating EV conditions that are adaptive to both color and brightness. To do this, we design a condition generation network with an encoder-decoder structure and propose a novel multi-exposure stack generation network, which bidirectionally synthesizes the image features and EV-conditioned features. Additionally, to better preserve the feature information in the synthesis of the features, we propose a spatially-adaptive feature transformation block. Our proposed method exhibits outstanding results in restoring the multi-exposure stacks for HDR image synthesis. Furthermore, our method achieves state-of-the-art performance compared to existing methods in multi-exposure stack generation and stack-based HDR restoration.

Abstract:
Deep learning (DL)-based Low-dose CT (LDCT) image denoising methods may face domain shift problem, where data from different domains (i.e., hospitals) may have similar anatomical regions but exhibit different intrinsic noise characteristics. Therefore, we propose a plug-and-play model called Low- and High-frequency Alignment (LHFA) to address this issue by leveraging semantic features and aligning noise distributions of different CT datasets, while maintaining diagnostic image quality and suppressing noise. Specifically, the LHFA model consists of a Low-frequency Alignment (LFA) module that preserves semantic features (i.e., low-frequency components) with fewer perturbations from both domains for reconstruction. Notably, a High-frequency Alignment (HFA) module is proposed to quantify the discrepancy between noise representations (i.e., high-frequency components) in a latent space mapped by an auto-encoder. Experimental results demonstrate that the LHFA model effectively alleviates the domain shift problem and significantly improves the performance of DL-based methods on cross-domain LDCT image denoising task, outperforming other domain adaptation-based methods.

Abstract:
The computational cost of the vision and language pretrained models (VL-PTMs) limits their deployment in resource-constrained devices that require low latency. One existing solution is to apply the early exiting (EE) strategy to accelerate the inference. This technique can force model prediction using only a few former transformer layers. However, these former layers behave differently with the final classifier, inevitably resulting in performance decline. To counter such limitation, self-distillation has been commonly introduced to enhance the representation abilities of the EE classifiers. This results in a semantic gap since EE classifiers are directly trained to mimic the outputs of the final classifier without access to the modality-specific behaviors. This study proposes a multimodality self-distillation method for the fast inference of VL-PTMs. To fill the semantic gap between modalities, we split the multimodalities into separate modalities and added them as extra inputs to encourage the effective distillation of each modality. Furthermore, the mean squared error (MSE) is introduced to minimize the distance of feature maps and further enhance the representation ability of the EE classifiers. Experiments show that the proposed method outperforms the previous EE strategies with the same inference time, and performs competitively even if the model exited very early.

Abstract:
Deep Neural Network (DNN)-based video analytics significantly improves recognition accuracy in computer vision applications. Deploying DNN models at edge nodes, closer to end users, reduces inference delay and minimizes bandwidth costs. However, these resource-constrained edge nodes may experience substantial delays under heavy workloads, leading to imbalanced workload distribution. While previous efforts focused on optimizing hierarchical device-edge-cloud architectures or centralized clusters for video analytics, we propose addressing these challenges through collaborative distributed and autonomous edge nodes. Despite the intricate control involved, we introduce EdgeVision, a Multiagent Reinforcement Learning (MARL)-based framework for collaborative video analytics on distributed edges. EdgeVision enables edge nodes to autonomously learn policies for video preprocessing, model selection, and request dispatching. Our approach utilizes an actor-critic-based MARL algorithm enhanced with an attention mechanism to learn optimal policies. To validate EdgeVision, we construct a multi-edge testbed and conduct experiments with real-world datasets. Results demonstrate a performance enhancement of 33.6% to 86.4% compared to baseline methods.

Affiliations: School of Information Science and Engineering, Huaqiao University, Xiamen, China; School of Engineering, Huaqiao University, Quanzhou, China; Department of Computer Science and Information Engineering, Ilan University, Yilan, Taiwan; Machine Learning and I-health International Cooperation Base of Zhejiang Province, Hangzhou Dianzi University, Hangzhou, China; College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China

Abstract:
Screen content coding (SCC) in Versatile Video Coding (VVC) improves the coding efficiency of screen content videos (SCVs) significantly but results in high computational complexity due to the quad-tree plus multi-type tree (QTMT) structure of the coding unit (CU) partitioning. Therefore, we make the first attempt to reduce the encoding complexity from the perspective of CU partitioning for SCC in VVC. To this end, a fast CU partition prediction method is technically developed for VVC-SCC. First, to solve the problem of lacking sufficient SCC training data, SCVs are collected to establish a database containing CUs of various sizes and corresponding partition labels. Second, to determine the partition decision in advance, a novel WA-CNN model is proposed, which is capable of predicting two large CUs for VVC-SCC by adjusting the feature channels based on the size of input CU blocks. Finally, considering the imbalanced proportion of diverse partition decisions, a loss function with the weight that equalizes the contribution of imbalanced data is formulated to train the proposed WA-CNN model. Experimental results show that the proposed model reduces the SCC intra-encoding time by 35.65%～ 38.31% with an average of 1.84%～ 2.42% BDBR increase.

Abstract:
Tracking multiple tiny objects is highly challenging due to their weak appearance and limited features. Existing multi-object tracking algorithms generally focus on single-modality scenes, and overlook the complementary characteristics of tiny objects captured by multiple remote sensors. To enhance tracking performance by integrating complementary information from multiple sources, we propose a novel framework called HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking). Specifically, we first employ a Transformer-based encoder to embed images from different modalities. Subsequently, we utilize Heterogeneous Graph Transformer to aggregate spatial and temporal information from multiple modalities to generate detection and tracking features. Additionally, we introduce a target re-detection module (ReDet) to ensure tracklet continuity by maintaining consistency across different modalities. Furthermore, this paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking. Extensive experiments are conducted on VT-Tiny-MOT, and the results have demonstrated the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of MOTA (Multiple-Object Tracking Accuracy) and ID-F1 score.

Affiliations: Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; Center for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, China; Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China; School of Computer and Informatics, Hefei University of Technology, Hefei, China

Abstract:
Instance segmentation in medical imaging plays a crucial role in clinical diagnostic tasks, and have shown promising performance in practical applications. In this article, we discuss a more fine-grained instance segmentation task: dental structured instance segmentation based on panoramic radiographs. However, direct segmentation of tooth structures encounters inherent challenges. Traditional instance segmentation networks often fall short in capturing intricate internal features, and exacerbated by the frequent blurring found in medical imaging, which can result in the deficiency of anatomical details. To deal with these problems, we propose a novel framework called DSIS-DPR, which combines a dental structured instance segmentation (DSIS) network with an enhanced diffusion prior refinement (DPR) method. Specifically, our innovatively designed structure-aware network leverages fine-grained feature fusion, acquiring a richer representation of internal anatomical structures. With the integration of adversarial learning, the model is primed to deliver holistic and subtle predictions of tooth structures. Furthermore, taking inspiration from dentists' inherent ability to utilize prior knowledge, such as understanding dental structures to label invisible anatomical structures, we propose a diffusion inpainting to refine the results of DSIS without additional annotations. Equipped with built-in structure learning, DPR is capable of modifying anomalies within each predicted segmentation, resulting in a more robust and complete structured segmentation result. Meanwhile, we ensure rigorous oversight over the reconstruction of areas affected by abnormalities, ensuring that any introduced adjustments minimally disrupt the well-predicted structured segmentation results. Extensive experiments have demonstrated that our DSIS-DPR outperforms all existing classical instance segmentation networks.

Abstract:
Hyperspectral image (HSI) clustering is challenging to divide all pixels into different clusters because of the absent labels, large spectral variability and complex spatial distribution. Anchor strategy provides an attractive solution to the computational bottleneck of graph-based clustering for large HSIs. However, most existing methods require separated learning procedures and ignore noisy as well as spatial information. In this paper, we propose a bipartite graph-based projected clustering (BGPC) method with local region guidance for HSI data. To take full advantage of spatial information, HSI denoising to alleviate noise interference and anchor initialization to construct bipartite graph are conducted within each generated superpixel. With the denoised pixels and initial anchors, projection learning and structured bipartite graph learning are simultaneously performed in a one-step learning model with connectivity constraint to directly provide clustering results. An alternating optimization algorithm is devised to solve the formulated model. The advantage of BGPC is the joint learning of projection and bipartite graph with local region guidance to exploit spatial information and linear time complexity to lessen computational burden. Extensive experiments demonstrate the superiority of the proposed BGPC over the state-of-the-art HSI clustering methods.

Abstract:
Medical vision-language pre-training (Med-VLP) models have recently accelerated the fast-growing medical diagnostics application. However, most Med-VLP models learn task-specific representations independently from scratch, thereby leading to great inflexibility when they work across multiple fine-tuning tasks. In this work, we propose UniDCP, a Unified medical vision-language model with Dynamic Cross-modal learnable Prompts, which can be plastically applied to multiple medical vision-language tasks within a unified model. Specifically, we explicitly construct a unified framework to harmonize diverse inputs from multiple pre-training tasks by leveraging cross-modal prompts for unification, which accordingly can accommodate heterogeneous medical fine-tuning tasks within a same model. Furthermore, we conceive a dynamic cross-modal prompt optimizing strategy that optimizes the prompts within the shareable space for implicitly processing the shareable clinic knowledge. UniDCP is the first Med-VLP model capable of performing all 8 medical uni-modal and cross-modal tasks over 14 corresponding datasets, consistently yielding superior results over diverse state-of-the-art methods.

Abstract:
Category-level 6D object pose estimation aims to estimate the pose and size of unseen objects with known categories. Existing methods mainly focus on capturing geometric features to handle shape variations, and are prone to failure in occlusion and noisy environments. In this paper, we propose TG-Pose, a unified pose estimation framework that delves into topology and geometry to deal with the above issues. To exploit topological properties, we first propose a topological feature predictor and a topological label generator to dig into the underlying structural details from encoded features using persistent homology. Then, the topological and geometric features are employed to facilitate the symmetry reconstruction of the original point cloud to obtain a reliable and coherent object shape, which, in turn, guides the pose estimation. For each object category, we construct geometric and topological templates by leveraging inherent intra-class similarities. These templates enhance the reliability of pose estimation and the completeness of object structure through geometric alignment and topological guidance, especially when handling incomplete objects. Moreover, a pose-aware enhancement strategy is designed to enhance the encoder in learning pose-sensitive features and robustness to noisy point clouds. Experimental results show that TG-Pose outperforms the State-of-the-Art solutions on public benchmarks and achieves better generalization in real-world datasets.

Abstract:
Traditional multi-view algorithms typically require data to be complete, and these algorithms may not be suitable or effective when dealing with incomplete data. As a result, the research field has witnessed the emergence of methods specifically designed for addressing incomplete multi-view clustering. In contrast, most of existing incomplete multi-view algorithms primarily emphasize capturing global information while neglecting the importance of local information. To overcome these challenges, we put forward a novel approach, named tensor low-rank graph embedding and learning for one-step incomplete multi-view clustering. We combine the beneficial aspects of graph embedding and tensor low-rank in this method, which not only focuses on local relationships but also provides insights into the global structure. Firstly, the initial similarity graph matrix is constructed using the inter-dependency between views. Secondly, the similarity graph matrix is structured as a third-order tensor and constrained by the tensor nuclear norm. This constraint enables the capturing of higher-order correlations across multiple views. Thirdly, graph embedding is added to obtain a common feature representation. Finally, the algorithm incorporates clustering to determine the optimal clustering labels. We conducted experiments comparing our algorithm with seven different incomplete multi-view methods using four different evaluation metrics. The experimental results indicate that our algorithm achieved the best clustering performance across five datasets under various missing rates. Especially when faced with a 50% missing rate in the Extended YaleB dataset, the clustering accuracy and normalized mutual information of our method are improved by 28.23% and 28.11%, respectively, compared to the second-best algorithm.

Abstract:
Multisource remote sensing image fusion classification aims to produce accurate pixel-level classification maps by combining complementary information from different sources of remote sensing data. Existing methods based on Convolutional Neural Networks (CNN-based) utilize a patch-based learning framework, which has a high computational cost, leading to poor real-time performance. In contrast, methods based on Fully Convolutional Networks (FCN-based) can process the entire image directly, achieving fast inference. However, FCN-based methods require high computational resources and exhibit shortcomings in feature fusion, hindering practical applications. In this paper, a lightweight FCN-based Progressive Hierarchical Fusion Network (PHFNet) is tailored for multisource remote sensing image classification. PHFNet comprises a pyramid dual-path encoder and a pyramid decoder. In the encoder, cross-source features are hierarchically fused via the adaptive modulation fusion module (AMF), which leverages style calibration for cross-source alignment and promotes the complementarity of the fusion feature. In the decoder, we introduced an improved convolutional gated recurrent unit (iConvGRU) to progressively integrate the semantic and detailed information of hierarchical features, producing a context-enhanced global representation. In addition, we consider the relation between the channel number, convolutional kernel size, and parameter count to make the model as lightweight as possible. Comprehensive evaluations on three multisource remote sensing datasets demonstrate that PHFNet improves overall accuracy by 1.5% to 2.8% with a low computational overhead compared to state-of-the-art methods.

Abstract:
Comparing to adapting the pre-trained backbone to a single image recognition task, multi-task image recognition enables the backbone to perform better when the tasks are related. An interesting research field in multi-task learning (MTL) is to learn the parameter sharing pattern among the involved tasks. Most existing works obtain the sharing pattern, ignoring the task grouping information among the involved tasks. In this work, we aim to build the task parameter sharing pattern based on automatically acquiring the task grouping information. The task grouping information together with the task specific information is then utilized to yield the task adaptive weights. Our method, called Task Grouping prompt-based Adaptive Weight generator (TGAW), consists of Prompt-based Task Representation (PTR) and Prompt-based Weight Generator (PWG). The PTR is modeled as task prompts consisting of task grouping prompt and task specific prompt. The task grouping prompt is automatically chosen from a candidate pool for each task, and tasks selecting the same grouping prompts are divided into the same group. Then, PWG generates task adaptive weights based on the task prompts. The experimental results show that TGAW achieves comparable performance with less than 30% amount of trainable parameters of the pre-trained backbone.

Abstract:
Partial multi-label learning (PML) is defined as the construction of robust multi-label classification models from a training set where all instances are correlated with a corresponding group of candidate labels that are only partially accurate. Existing PML approaches have attempted to elicit reliable labels by parsing the guideline information of candidate labels. However, the other side information in the labels that describes what the sample does not contain is largely ignored. Moreover, existing PML approaches only focus on distinguishing the noise information and lack the effective use of noise information. To this end, a partial multi-label learning disambiguation approach guided by negative labels and noise information is proposed. Specifically, a negative label information-inducing paradigm is established based on the constructed negative label encoding matrix. Meanwhile, the negative correlation information guides an iterative label propagation process to induce ground-truth labels with high credibility. In addition, the truth and noise labels are formalized in a unified framework by constructing a regularizer. Moreover, the multi-label predictor is induced by discriminating regularization and disambiguation of label-specific features using the identified noise feature information. Extensive experiments on existing and constructed datasets have demonstrated that the negative label information bootstrapping strategy can be more effective in finding truth labels hidden in candidate labels. Moreover, noisy feature information-induced multi-label prediction outperforms state-of-the-art approaches.

Abstract:
Image-Text Sentiment Analysis task has garnered increased attention in recent years due to the surge in user-generated content on social media platforms. Previous research efforts have made noteworthy progress by leveraging the affective concepts shared between vision and text modalities. However, emotional cues may reside exclusively within one of the prevailing modalities, owing to modality independent nature and the potential absence of certain modalities. In this study, we aim to emphasize the significance of modality-independent emotional behaviors, in addition to the modality-invariant behaviors. To achieve this, we propose a novel approach called Crossmodal Translation-Based Meta Weight Adaption (CTMWA). Specifically, our approach involves the construction of the crossmodal translation network, which serves as the encoder. This architecture captures the shared concepts between vision content and text, empowering the model to effectively handle scenarios where either the vision or textual modality is missing. Building upon the translation-based framework, we introduce the strategy of unimodal weight adaption. Leveraging the meta-learning paradigm, our proposed strategy gradually learns to acquire unimodal weights for individual instances from a few hand-crafted meta instances with unimodal annotations. This enables us to modulate the gradients of each modality encoder based on the discrepancy between modalities during model training. Extensive experiments are conducted on three benchmark image-text sentiment analysis datasets, namely MVSA-Single, MVSA-Multiple, and TumEmo. The empirical results demonstrate that our proposed approach achieves the highest performance across all conventional image-text databases. Furthermore, experiments under modality missing settings and case study for reliable sentiment prediction are also conducted further exhibiting superior robustness as well as reliability of the propose approach.

Abstract:
Recently, person re-identification has gained significant attention from both academic and industry fields due to its potential applications in surveillance and security. However, the security of re-identification systems has not been widely investigated, and they are vulnerable to adversarial attacks, which can significantly degrade their performance. Although numerous sophisticated adversarial training methods have been proposed for image classification, metric analysis systems such as person re-identification have not been fully explored. In this paper, we develop a novel adversarial training framework with a dynamic attack strategy for person re-identification, to further enhance the robustness of the model. Specifically, we gradually increase the perturbation budget during the generation until the generated adversarial examples reach a certain level of attack strength. As the iterations progress, the model becomes more robust, and our framework can generate stronger adversarial examples to continuously explore the robustness bounds of the model. Moreover, to alleviate the conflict between the adversarial robustness and natural generalization of the model, we design a novel performance alignment loss to further constrain the adversarial example generation process, which can make the generated adversarial examples as close as possible to the clean samples in terms of performance. Experiments on two widely used person re-ID benchmark datasets demonstrate the effectiveness and superiority of our proposed method.

Abstract:
Infrared target recognition and anti-interference in complex battlefields is one of the key technologies enabling the precise strike capability of aircraft. Currently, infrared-guided aircraft face complex interference such as natural backgrounds and artificial decoys, leading to a decrease in the performance of infrared target recognition. A particular challenge to infrared target recognition and anti-interference capabilities is the strong interference situation caused by the combination of target maneuvering and the dense, continuous, and coordinated deployment of infrared decoys. To address extreme issues such as complete loss of target feature information and inability to identify due to target occlusion, we develop an anti-interference recognition method based on a visually inspired Spatio-Temporal Semantic Reasoning Model (STSRM). Firstly, inspired by the functional characteristics of visual semantic reasoning, the STSRM is proposed to simplify the reasoning of relationships among multiple regions into modeling relationships between corresponding region node features in a graph-based module. Secondly, an anti-occlusion target recognition model based on STSRM is constructed, which introduces a reasoning graph module connecting node regions to infer semantic information and predict targets between regions. The test results on the infrared dataset established in this paper indicate that the proposed anti-interference recognition model can make accurate target predictions in large-scale or full-occlusion conditions, and we achieve 13.9% and 3.1% improvement on mAP scores and mIoU scores, compared to current advanced method on our simulated infrared dataset.

Abstract:
LiDAR and camera are the most common used sensors to percept the road scenes in autonomous driving. Current methods tried to fuse the two complementary information to boost 3D object detection. However, there are still two burning problems for multi-modality 3D object detection. One is the detection problem for the objects with sparse point clouds. The other is the misalignment of different sensors caused by the fixed physical locations. Therefore, this paper argues that explicitly fusing information from the two modalities with the physical misalignment is suboptimal for multi-modality 3D object detection. This paper presents a novel virtual point generation network, VirPNet, to overcome the multi-modality fusion challenges. On one hand, it completes sparse point cloud objects from image source and improves the final detection accuracy. On the other hand, it directly detects 3D targets from raw point clouds to avoid the physical misalignment between LiDAR and camera sensors. Different from previous point cloud completion methods, VirPNet fully utilizes the geometric information of pixels and point clouds and simplifies 3D point cloud regression into a 2D distance regression problem through a virtual plane. Experimental results on KITTI 3D object detection dataset and nuScenes dataset demonstrate that VirPNet improves the detection accuracy with the help of the generated virtual points.

Abstract:
Depth estimation extracting scenes' structural information is a key step in various light field(LF) applications. However, most existing depth estimation methods are based on the Lambertian assumption, which limits the application in non-Lambertian scenes. In this paper, we discover a unique transparent cheating problem for non-Lambertian scenes which can effectively spoof depth estimation algorithms based on photo consistency. It arises because the spatial consistency and the linear structure superimposed on the epipolar plane image form new spurious lines. Therefore, we propose centrifugal consistency and centripetal consistency for separating the depth information of multi-layer scenes and correcting the error due to the transparent cheating problem, respectively. By comparing the distributional characteristics and the number of minimal values of photo consistency and centrifugal consistency, non-Lambertian regions can be efficiently identified and initial depth estimates obtained. Then centripetal consistency is exploited to reject the projection from different layers and to address transparent cheating. By assigning decreasing weights radiating outward from the central view, pixels with a concentration of colors close to the central viewpoint are considered more significant. The problem of underestimating the depth of background caused by transparent cheating is effectively solved and corrected. Experiments on synthetic and real-world data show that our method can produce high-quality depth estimation under the transparency and the reflectivity of 90% to 20%. The proposed triple-consistency-based algorithm outperforms state-of-the-art LF depth estimation methods in terms of accuracy and robustness.

Abstract:
Video compression leads to compression artifacts, among which Perceivable Encoding Artifacts (PEAs) degrade user perception. Most of existing state-of-the-art Video Compression Artifact Removal (VCAR) methods indiscriminately process all artifacts, thus leading to over-enhancement in non-PEA regions. Therefore, accurate detection and location of PEAs is crucial. In this paper, we propose the largest-ever Fine-grained PEA database (FPEA). First, we employ the popular video codecs, VVC and AVS3, as well as their common test settings, to generate four types of spatial PEAs (blurring, blocking, ringing and color bleeding) and two types of temporal PEAs (flickering and floating). Second, we design a labeling platform and recruit sufficient subjects to manually locate all the above types of PEAs. Third, we propose a voting mechanism and feature matching to synthesize all subjective labels to obtain the final PEA labels with fine-grained locations. Besides, we also provide Mean Opinion Score (MOS) values of all compressed video sequences. Experimental results show the effectiveness of FPEA database on both VCAR and compressed Video Quality Assessment (VQA). We envision that FPEA database will benefit the future development of VCAR, VQA and perception-aware video encoders. The FPEA database has been made publicly available.

Abstract:
Video moment retrieval (VMR) aims to locate corresponding moments in an untrimmed video via a given natural language query. While most existing approaches treat this task as a cross-modal content matching or boundary prediction problem, recent studies have started to solve the VMR problem from a reading comprehension perspective. However, the cross-modal interaction processes of existing models are either insufficient or overly complex. Therefore, we reanalyze human behaviors in the document fragment location task of reading comprehension, and design a specific module for each behavior to propose a 3-level human-like moment retrieval framework (Tri-MRF). Specifically, we summarize human behaviors such as grasping the general structures of the document and the question separately, cross-scanning to mark the direct correspondences between keywords in the document and in the question, and summarizing to obtain the overall correspondences between document fragments and the question. Correspondingly, the proposed Tri-MRF model contains three modules: 1) a gist-oriented intra-modal comprehension module is used to establish contextual dependencies within each modality; 2) a content-oriented fine-grained comprehension module is used to explore direct correspondences between clips and words; and 3) a target-oriented integrated comprehension module is used to verify the overall correspondence between the candidate moments and the query. In addition, we introduce a biconnected GCN feature enhancement module to optimize query-guided moment representations. Extensive experiments conducted on three benchmarks, TACoS, ActivityNet Captions and Charades-STA demonstrate that the proposed framework outperforms State-of-the-Art methods.

Abstract:
High-frequency surface wave radar (HFSWR) is a powerful tool for ship detection and surveillance. blackHowever, the use of pre-trained deep learning (DL) networks for ship detection is challenging due to the limited training samples in HFSWR and the substantial differences between remote sensing images and everyday images. To tackle these issues, this paper proposes a coarse-to-fine target detection approach that combines traditional methods with DL, resulting in improved performance. The contributions of this work include: 1) a two-stage learning pipeline that integrates spatial-frequency analysis (SFA) with subnet-based neural networks, 2) an automatic linear thresholding algorithm for plausible target region (PTR) detection, and 3) a robust subnet neural network for fine target detection. The advantage of using SFA and subnet network is that the SFA reduces the need for extensive training data, while the subnet neural network excels at localizing ships even with limited training data. Experimental results on the HFSWR-RD dataset affirm the model's superior performance compared to rival algorithms.

Abstract:
Learning to recognize novel concepts from just a few image samples is very challenging as the learned model is easily overfitted on the few data and results in poor generalizability. One promising but underexplored solution is to compensate for the novel classes by generating plausible samples. However, most existing works of this line exploit visual information only, rendering the generated data easy to be distracted by some challenging factors contained in the few available samples. Being aware of the semantic information in the textual modality that reflects human concepts, this work proposes a novel framework that exploits semantic relations to guide dual-view data hallucination for few-shot image recognition. The proposed framework enables generating more diverse and reasonable data samples for novel classes through effective information transfer from base classes. Specifically, an instance-view data hallucination module hallucinates each sample of a novel class to generate new data by employing local semantic correlated attention and global semantic feature fusion derived from base classes. Meanwhile, a prototype-view data hallucination module exploits semantic-aware measure to estimate the prototype of a novel class and the associated distribution from the few samples, which thereby harvests the prototype as a more stable sample and enables resampling a large number of samples. We conduct extensive experiments and comparisons with state-of-the-art methods on several popular few-shot benchmarks to verify the effectiveness of the proposed framework.

Abstract:
Composed image retrieval (CIR) is an emerging and challenging research task that combines two modalities, a reference image, and a modification text, into one query to retrieve the target image. In online shopping scenarios, the user would use the modification text as feedback to describe the difference between the reference and the desired image. In order to handle the task, there must be two main problems needed to be addressed. One is the localization problem: how to precisely find those spatial areas of the image mentioned by the text. The other is the modification problem: how to effectively modify the image semantics based on the text. However, existing methods merely fuse information coarsely from the two-modality, while the accurate spatial and semantic correspondence between these two heterogeneous features tends to be neglected. Therefore, image details cannot be precisely located and modified. To this end, we consider integrating information from the two modalities more accurately from spatial and semantic aspects. Thus, we propose an end-to-end framework for the CIR task, which contains three key components, i.e., Multi-level Collaborative Localization module (MCL), Differential Semantics Discrimination module (DSD), and Image Difference Enhancement constraints (IDE). Specifically, to solve the localization problem, MCL precisely locates the text to the image areas by collaboratively using text positioning information on multiple image layers. For the modification problem, DSD builds a distribution to evaluate the modification possibility of each image semantic dimension, and IDE effectively learns the modification patterns of text against image embedding based on the distribution. Extensive experiments on three datasets show that the proposed method achieves outstanding performance against the SOTA methods.

Abstract:
The purpose of multi-label image classification is to assign multiple labels for multiple objects presented in one image. Recent research efforts exploit graph convolution network (GCN) to learn the label co-occurrence dependencies for enhancing the semantic representation. Although these methods have achieved promising results, they can not capture the intrinsic correlation between objects in images and do not consider the inter-channel relationship. In addition, the previous methods treat each single image independently and fail to explore the relationship between different images. To address the above challenges, we propose a novel Dual Relation Graph Network (DRGN) model, which adopts a double branch structure to excavate rich semantic information from intra-image and cross-image simultaneously. Specifically, we first develop an intra-image channel-relation mining (ICM) module to mine the inter-channel relationship in features while learning the importance of different channels. Secondly, we design a new GCN-based intra-image spatial-relation exploring (ISE) module to capture the correlation between objects in individual image. Notably, ISE module and ICM module can complement and promote each other from the spatial and channel dimensions of images to improve the correlation between objects in individual image. Thirdly, we propose a novel GCN-based cross-image semantic learning (CSL) module to learn the semantic relationship between different images in the mini-batch. Through graph reasoning, our CSL module can iteratively refine input image features by acquiring common semantic information from other images in the mini-batch. Extensive experiments on the MS-COCO 2014, PASCAL VOC 2007, and VG-500 datasets demonstrate that the proposed DRGN model outperforms current state-of-the-art methods.

Abstract:
In the field of computer vision, fine-grained image retrieval is an extremely challenging task due to the inherently subtle intra-class object variations. In addition, the high-dimensional real-valued features extracted from large-scale fine-grained image datasets slow the retrieval speed and increase the storage cost. To solve above issues, existing fine-grained image retrieval methods mainly focus on finding more discriminative local regions for generating discriminative and compact hash codes, which achieve limited fine-grained image retrieval performance due to the large quantization errors and the confounding granularities and context of discriminative parts, i.e., the correct recognition of fine-grained objects mainly attribute to the discriminative parts and their context. To learn robust causal features and reduce the quantization errors, we propose a deep progressive asymmetric quantization (DPAQ) method based on causal intervention to learn compact and robust descriptions for fine-grained image retrieval task. Specifically, we introduce a structural causal model to learn robust casual features via causal intervention for fine-grained visual recognition. Subsequently, we design a progressive asymmetric quantization layer in the feature embedding space, which can preserve the semantic information and reduce the quantization errors sufficiently. Finally, we incorporate both the fine-grained image classification and retrieval tasks into an end-to-end deep learning architecture for generating robust and compact descriptions. Experimental results on several fine-grained image retrieval datasets demonstrate that the proposed DPAQ method performs the best for fine-grained image retrieval task and surpasses the state-of-the art fine-grained hashing methods by a large margin.

Abstract:
Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classification will be inefficient and even infeasible. In this work, we propose a method for few-shot class-incremental audio classification, which can continually recognize novel audio classes without forgetting old ones. The framework of our method mainly consists of two parts: an embedding extractor and a classifier, and their constructions are decoupled. The embedding extractor is the backbone of a ResNet based network, which is frozen after construction by a training strategy using only samples of base audio classes. However, the classifier consisting of prototypes is expanded by a prototype adaptation network with few samples of novel audio classes in incremental sessions. Labeled support samples and unlabeled query samples are used to train the prototype adaptation network and update the classifier, since they are informative for audio classification. Three audio datasets, named NSynth-100, FSC-89 and LS-100 are built by choosing samples from audio corpora of NSynth, FSD-MIX-CLIP and LibriSpeech, respectively. Results show that our method exceeds baseline methods in average accuracy and performance dropping rate. In addition, it is competitive compared to baseline methods in computational complexity and memory requirement.

Abstract:
Exploring sample relationships within each mini-batch has shown great potential for learning image representations. Existing works generally adopt the regular Transformer to model the visual content relationships, ignoring the cues of semantic/label correlations between samples. Also, they generally adopt the ‘full’ self-attention mechanism which are obviously redundant and also sensitive to the noisy samples. To overcome these issues, in this paper, we design a simple yet flexible Batch-Graph Transformer (BGFormer) for mini-batch sample representations by deeply capturing the relationships of image samples from both visual and semantic perspectives. BGFormer has three main aspects. (1) It employs a flexible graph model, termed Batch Graph to jointly encode both visual and semantic relationships of samples within each mini-batch. (2) It explores the neighborhood relationships of samples by borrowing the idea of sparse graph representation which thus performs robustly, w.r.t., noisy samples. (3) It devises a novel specific Transformer architecture that mainly adopts dual structure-constrained self-attention (SSA), together with graph normalization, FFN, etc, to carefully exploit the batch graph information for sample tokens (nodes) representations. As an application, we apply BGFormer to the metric learning tasks. Extensive experiments on four popular datasets demonstrate the effectiveness of the proposed model.

Abstract:
In the current research, many researchers have focused on instance-level pose tracking, which requires a 3D model of the object in advance, making it challenging to apply in practice. To address this limitation, some researchers have proposed the category-level object pose tracking method. Achieving accurate and speedy monocular category-level pose tracking is an essential research goal. In this article, we propose CatTrack, a new single-stage keypoints-based monocular category-level multi-object pose tracking network. A significant issue in object pose tracking tasks is utilizing the information from the previous frame to guide pose estimation for the next frame. However, as the object poses and camera information in each frame are different, we need to remove irrelevant information and emphasize useful features. To this end, we propose a transformer-based temporal information capture module to leverage the position information of keypoints from the previous frame. Furthermore, we propose a new keypoint matching module to enable the grouping and matching of object keypoints in complex scenes. We have successfully applied CatTrack to the Objectron dataset and achieved superior results in comparison to existing methods. Furthermore, we have also evaluated the generalization of CatTrack and successfully applied it to track the 6D pose of unseen real-world objects.

Abstract:
The video-based non-contact respiration detection technology can be used in many application scenarios to unobtrusively and ubiquitously monitor the physical state of living beings, and various researchers are currently working on this technology. The optical flow method in tandem with crossover point method is rather effective for respiration rate extraction. However, each method has one disadvantage: 1) the redundant feature points in the traditional optical flow method increase the computational effort and reduce the estimation accuracy; and 2) the traditional crossover point method suffers from crossover points unrelated to breathing movements. For these two challenges, two optimization points are proposed in this work: 1) optimize feature point space by combining spatio-temporal information; and 2) use negative feedback design to adaptively remove crossovers that are not related to respiratory movements. The performance of the proposed algorithm is validated by the Large-scale Bedside Respiration Dataset for Intensive Care (LBRD-IC), which is established using the actual surveillance videos acquired from ICU wards. The validity of the above two optimization points is verified by the ablation experiments. The influential analysis of computation time and video resolution on the performance of the proposed algorithm demonstrates that the proposed algorithm can be deployed to various application terminals to monitor the respiration rate of living organisms in real-time and with high accuracy. In addition, field measurements in the ICU ward have shown that our algorithm can measure respiratory signals of the single patient and multiple patients when only one surveillance camera is present.

Abstract:
Compared with RGB images, hyperspectral images (HSIs) offer a distinct advantage in that they can record continuous spectral bands of light reflectance in each pixel, reflecting the physical and chemical characteristics of materials. This capability enables differentiation between objects that may have similar textures but different spectral characteristics. It is desirable to recover spectral information from RGB images to improve semantic segmentation accuracy. Additionally, semantic information can serve as a guide for spectral information recovery, thereby ensuring the quality of the recovered spectral information. The two tasks are mutually beneficial in this regard. In light of these considerations, we propose a multi-task framework that exploits the complementary relationship between spectral recovery and semantic segmentation tasks, comprising a complementary spectral-semantic attentive fusion model (CSSF) that enables the two tasks to mutually facilitate each other by fusing information from both branches. Specifically, the proposed CSSF incorporates a window-based spectral-semantic attentive fusion (WSSAF) module to incorporate recovered spectral information into the segmentation process effectively, and a pixel-shuffle-based fusion (PSF) module to provide semantic guidance for spectral recovery. To evaluate the effectiveness of our approach, we built the first flower hyperspectral image dataset (FHRS) with corresponding segmentation annotations and RGB images. By doing so, we have made the first attempt to explore the complementary relationship between semantic segmentation and spectral recovery. Experimental results on both the FHRS dataset and the publicly available LIB-HSI dataset demonstrate that our proposed method has the ability to enhance both tasks by utilizing their complementary relationship, indicating the generalization ability of our method.

Abstract:
Conventional knowledge graphs (KGs) are composed solely of entities, attributes, and relationships, which poses challenges for enhancing multimodal knowledge representation and reasoning. To address the issue, this article proposes a multimodal deep learning-based approach to build a multimodal knowledge base (MMKB) for better multimodal feature (MMF) utilization. First, we construct a multimodal computation sequence (MCS) model for structured multimodal data storage. Then, we propose multimodal node, relationship, and dictionary models to enhance multimodal knowledge representation. Various feature extractors are used to extract MMFs from text, audio, image, and video data. Finally, we leverage generative adversarial networks (GANs) to facilitate MMF representation and update the MMKB dynamically. We examine the performance of the proposed method by using three multimodal datasets. BOW-, LBP-, Volume-, and VGGish-based feature extractors outperform the other methods by reducing at least 1.13%, 22.14%, 39.87, and 5.65% of the time cost, respectively. The average time costs of creating multimodal indexes improve by approximately 55.07% and 68.60% exact matching rates compared with the baseline method, respectively. The deep learning-based autoencoder method reduces the search time cost by 98.90% after using the trained model, outperforming the state-of-the-art methods. In terms of multimodal data representation, the GAN-CNN models achieve an average correct rate of 82.70%. Our open-source work highlights the importance of flexible MMF utilization in multimodal KGs, leading to more powerful and diverse applications that can leverage different types of data.

Abstract:
Images are composed of “things” (i.e., structured objects) and “stuff” (i.e., textured surfaces), which have completely different effects on the human visual system (HVS). A good image quality assessment (IQA) method should fully consider the visual salience effects of image structures and the masking effects of image textures. In this article, we propose a perceptual quality analysis model using structure separation and high-order moments (SSHMPQA) in the deep domain. First, we use a total variation (TV) model to separate the perceptual structures in images from their deep feature maps, thereby maintaining meaningful object shapes with texture suppression and defining perceptual structure-aware distances in the deep domain. Then, we use the first- to fourth-order moments to calculate the mean, skewness and kurtosis of the probability distributions of the deep features. On this basis, we define a perceptual texture-aware distance in the deep domain. We then formulate the final model by solving a well-defined perceptual optimization problem. The proposed SSHMPQA model has good interpretability and is data-driven; moreover, the model does not require a complex and long training process because the optimization problem is convex and has an exact analytical solution. To verify the effectiveness of our model, comprehensive experiments are conducted. The experimental results show that the proposed model is superior to other state-of-the-art traditional and deep learning-based full-reference (FR) IQA methods.

Abstract:
Salient object detection (SOD) is an important preprocessing operation for various computer vision tasks. Most of existing RGB-D SOD models employ additive or connected strategies to directly aggregate and decode multi-scale features to predict salient maps. However, due to the large differences between the features of different scales, these aggregation strategies adopted may lead to information loss or redundancy, and few methods explicitly consider how to establish connections between features at different scales in the decoding process, which consequently deteriorates the detection performance of the models. To this end, we propose a cascaded and aggregated Transformer Network (CATNet) which consists of three key modules, i.e., attention feature enhancement module (AFEM), cross-modal fusion module (CMFM) and cascaded correction decoder (CCD). Specifically, the AFEM is designed on the basis of atrous spatial pyramid pooling to obtain multi-scale semantic information and global context information in high-level features through dilated convolution and multi-head self-attention mechanism, enhancing high-level features. The role of the CMFM is to enhance and thereafter fuse the RGB features and depth features, alleviating the problem of poor-quality depth maps. The CCD is composed of two subdecoders in a cascading fashion. It is designed to suppress noise in low-level features and mitigate the differences between features at different scales. Moreover, the CCD uses a feedback mechanism to correct and repair the output of the subdecoder by exploiting supervised features, so that the problem of information loss caused by the upsampling operation during the multi-scale features aggregation process can be mitigated. Extensive experimental results demonstrate that the proposed CATNet achieves superior performance over 14 state-of-the-art RGB-D methods on 7 challenging benchmarks.

Abstract:
How to solve the problem of geometric distortion is the key for salient object detection (SOD) in 360° omnidirectional images. Most of the current methods integrate global and local visual cues through the fusion of the 360° equirectangular images and corresponding 360° cube-map images. The fusion in a single level cannot effectively utilize the information between the 360° equirectangular images and corresponding 360° cube-map images. In this work, we innovatively propose a semantics and context feature aggregation network (SCFANet) by fully exploring the interactivity between the two projection data. Specifically, we use Vision Transformer (ViT) to capture global visual cues for 360° equirectangular images and Convolutional Neural Network (CNN) to capture local visual cues for 360° cube-map images. To achieve effective fusion of the two projection data, we design a semantic guidance module (SGM), in which semantic features are used to guide the information fusion of the 360° equirectangular images and corresponding 360° cube-map images at each level. Then, a context fusion module (CFM) containing one local input and two context inputs is designed to integrate multi-scale features, where the local input extracts its own multi-scale information, and the context inputs complements their fine details and location information. Finally, we use feature aggregation and refinement module (FARM) to aggregate semantics and context feature and adopt a deep supervision strategy for training. Extensive experiments on two public 360° datasets show that our SCFANet exhibits competitive performance compared to other state-of-the-art (SOTA) 360° salient object detection models.

Abstract:
To accurately retrieve similar objects from different domains, domain adaptive retrieval method is applied to cope with the domain shift problem in information retrieval. However, existing methods still have two problems: a) they fail to filter out low-confidence samples, leading to error accumulation; and b) they ignore the negative effect of domain discrepancy. To address these two issues, we propose an efficient method called Dynamic Confidence Sampling and Label Semantic Guidance Learning (DCS-LSG). First, Dynamic Confidence Sampling (DCS) is employed to dynamically select high-confidence samples from the target domain so as to improve the effectiveness of learning. Second, Label Semantic Guidance (LSG) learning is presented to enhance the label semantics of features during domain adaptive retrieval. In addition, we introduce a Dual-Projection Relaxation (DPR) strategy to learn more effective features on two specific projection spaces. At last, a two-step hashing strategy is used to generate high-quality hash codes. Experiments on multiple cross-domain retrieval datasets demonstrate that the proposed DCS-LSG can achieve a significant performance improvement.

Abstract:
Image compressive sensing (CS), recovering an unknown image by resorting to a small number of its measurements, has become an increasingly popular topic in multimedia technology and applications. For a better reconstruction, diverse priors, from the original sparse prior to the new deep prior, have been exploited. Despite the powerful learning capability and satisfactory reconstruction performance, the deep prior is known as a black box and loses clear interpretability. In this article, we first revisit image CS with different priors and observe that the method with hand-crafted sparse prior could still outperform state-of-art methods with deep prior or no prior when under the same settings, while the interpretability is well preserved. Then, towards a better performance of the sparse-prior-based method, we propose a Channel Adaptive Thresholding Network, namely CAT-Net. CAT-Net draws the support from channel correlation calculation to extend the single thresholding in the iterative soft thresholding algorithm (ISTA) into channel-wise thresholding. The channel adaptive thresholding conducts soft thresholding operation in each channel of the image features and can be adjusted adaptively to the inputs, which can reconstruct more precisely than a single static thresholding. The careful CAT operation can preserve patterns both in detail and holistically well. Experimental results demonstrate the proposed method outperforms the state-of-the-art image CS methods with both traditional and deep priors.

Abstract:
Traditional discriminative correlation filter (DCF) has received great popularity due to its high computational efficiency. However, the lightweight framework of DCF cannot promise robust performance when the tracker faces appearance variations within the background. These unpredictable appearance variations always distract the filter. Most existing DCF-based trackers either utilize deep convolutional features or incorporate additional constraints to elevate tracking robustness. Despite some improvements, both of them hamper the tracking speed and can only roughly alleviate the distractions of appearance variations. In this paper, a novel spatial reliability enhanced learning strategy is proposed to handle the problems aforementioned. By monitoring the variation of response produced in detection phase, a dynamic reliability map is generated to indicate the reliability of each background subregion. Then, label adjustment is conducted to repress the distractions of these unreliable areas. Compared with the conventional way of constraint where a new term is always added to realize the desired goal, label adjustment is simultaneously more efficient and effective. Moreover, to promise the accuracy and dependability of the reliability map, an adaptively updated response pool recording reliable historical response values is proposed. Extensive and exhaustive experiments on three challenging unmanned aerial vehicle (UAV) benchmarks, i.e., UAV123@10fps, DTB70 and UAVDT, which totally include 243 video sequences, validate the superiority of the proposed method against other state-of-the-art trackers and exhibit a remarkable generality in a variety of scenarios. Meanwhile, the tracking speed of 65.2FPS on a cheap CPU makes it suitable for real-time UAV applications.

Abstract:
Fully supervised salient object detection (SOD) has made considerable progress based on expensive and time-consuming data with pixel-wise annotations. Recently, to relieve the labeling burden while maintaining performance, some scribble-based SOD methods have been proposed. However, learning precise boundary details from scribble annotations that lack edge information is still difficult. In this article, we propose to learn precise boundaries from our designed synthetic images and labels without introducing any extra auxiliary data. The synthetic image creates boundary information by inserting synthetic concave regions that simulate the real concave regions of salient objects. Furthermore, we propose a novel self-consistent framework that consists of a global integral branch (GIB) and a boundary-aware branch (BAB) to train a saliency detector. GIB aims to identify integral salient objects, whose input is the original image. BAB aims to help predict accurate boundaries, whose input is the synthetic image. These two branches are connected through a self-consistent loss to guide the saliency detector to predict precise boundaries while identifying salient objects. Experimental results on five benchmarks demonstrate that our method outperforms the state-of-the-art weakly supervised SOD methods and further narrows the gap with the fully supervised methods.

Abstract:
This article introduces a novel approach to unsupervised image-to-image translation, aiming to overcome the limitations of existing methods in accurately capturing the shape of the source domain and the style of the target domain. The proposed method, called Semantic Cooperative Shape Perception (SCSP), focuses on enhancing the quality of generated images by addressing two key aspects. Firstly, the SCSP model employs a fusion generator that divides the mapping process into a unique texture part and a shared semantic part. By using different network structures and constraints, each part learns specific information. The unique texture generator emphasizes the style and texture details of the target domain, while the shared semantic generator focuses on the semantic information present in the source domain. This separation enables the sub-generators to extract and restore different aspects of the target domain more effectively. Secondly, a shape perception loss is introduced to improve the similarity of semantic images. It enhances the shared semantic generator's ability to perceive semantic information related to the same object by imposing constraints on the semantic graph of both the generated and input images. Therefore, the proposed method ensures semantic consistency during the translation process, leading to improved authenticity and image quality. Experimental results on four datasets, including horse2zebra, tiger2leopard, summer2winter, and photo2vangogh, demonstrate that the SCSP model achieves state-of-the-art visualization results and favorable evaluation metrics.

Abstract:
Due to the low computational cost of Hamming distance, hashing-based image retrieval has been universally acknowledged. Therefore, it is becoming increasingly important to quickly generate high-precision hash codes (also hash features) from images. However, the existing deep hashing methods are vulnerable to image content variations; that is, it is difficult to generate stable and consistent hash codes for similar images. In addition, generating hash codes of different lengths requires retraining the model, which is expensive in training time. To address these problems, this paper proposes a deep hashing network (DHN) with a hybrid attention mechanism and adaptive weighting (HAAW) learning. It mainly consists of a feature extraction module, feature refinement module, classification layer, hash layer and an adaptive weight layer. In particular, the hybrid attention mechanism combines bottom-up pixel saliency and top-down semantic constraints, in which the former is achieved through channel and spatial attention (CSA) and the latter is supervised by classification labels. In this way, it encourages the network to focus on dominant semantic features without being disturbed by irrelevant objects so that semantically similar images can be mapped to approximate hash codes. We further propose an adaptive weighting learning algorithm to generate weights for each bit of the hash code generated by the deep network. Then, we directly generate shorter hash codes from the available long hash code according to the importance of bits represented by the weights. This avoids retraining the network for learning hash codes of different lengths. Extensive experiments on public CIFAR-10, NUS_WIDE and ImageNet datasets show that our method has achieved substantial improvements over the counterparts in terms of precision and speed.

Abstract:
Unsupervised domain adaptation (UDA) involves the transfer of knowledge from a labelled source domain to an unlabelled target domain. Recent studies have introduced the concept of intermediate domains to handle significant domain discrepancies between source and target domains. Constructing an appropriate intermediate domain is a crucial step in handling scenarios with substantial domain differences. In this article, we propose a novel progressive UDA method called adaptive multi-scale intermediate domain via progressive training (AMPT), which has achieved remarkable effectiveness in alleviating large discrepancies between domains. We design a multi-scale similarity metrics module to solve the issue of different scales across different domains by simultaneously computing the pairwise distance between source domain images and the target domain images on multiple scales. Furthermore, we explore a progressive training strategy that facilitates smooth adaptation of the source domain to the target domain by utilizing the intermediate domain. During the progressive training process, considering a positive feedback mechanism, we iteratively leverage task losses and distillation loss through a dynamic threshold. This ensures that the trained intermediate domain branch progressively constrains the target domain branch, and the intermediate domain can be generated dynamically, leading to a smoother gradual adaptation from the source domain to the target domain. Extensive experimental results demonstrate the superiority of our proposed method AMPT on well-known UDA datasets, including Office-Home, Office-31 and DomainNet.

Abstract:
Estimating the 6-D poses of objects from RGB-D images holds great potential for several applications. However, given that the 6-D pose estimation accuracy is significantly affected by occlusion and noise between the objects in an image, this paper proposes a novel 6-D pose estimation method based on Efficient feature extraction and Point-pair feature matching. Specifically, we develop the Efficient channel attention Convolutional Neural Network (ECNN) and SO(3)-Encoder modules to extract 2-D features from the RGB image and SO(3)-equivariant features from the depth image, respectively. These features are fused in the DenseFusion module to obtain 3-D features in the camera space. Meanwhile, we exploit CAD model priors to obtain 3-D features in the model space through the model feature encoder, and then we globally regress the 3-D features in the camera and model space. According to these features, we generate oriented point clouds in each space, and then conduct point-pair feature matching to obtain pose information. Finally, we perform direct pose regression on the 3-D features in the camera and model space, and then resulting point-pair feature matching pose information is combined with the direct point-wise pose regression information to enhance pose prediction accuracy. Experimental results on three widely used benchmarking datasets demonstrate that our method achieves state-of-the-art performance, particularly for severe occluded scenes.

Abstract:
Objective Image Quality Assessment (IQA) aims to design computational models that can automatically predict the perceived quality of images. The state-of-the-art full-reference IQA metric – Deep Image Structure and Texture Similarity (DISTS), neglects the fact that natural images often consist of local structure and texture, and requires supervised training on the annotated dataset. In this article, we introduce multiple adaptive strategies to improve DISTS, resulting in an opinion-unaware IQA metric, named A-DISTS. Specifically, A-DISTS first uses the dispersion index as a statistical feature to adaptively localize structure and texture regions at different scales. Second, it adaptively assigns the spatial weights between local structure and texture similarity measurements according to the estimated structure or texture probability maps. Finally, it calculates the entropy of image representation to adaptively weigh the importance of each feature map. As a result, A-DISTS is adapted to local image content and does not require any training. The experimental results demonstrated that the proposed metric correlates well with human rating in the standard and algorithm-dependent IQA databases, and exhibits competitive performance in the optimization tasks of single image super-resolution, motion deblurring, and multi-distortion removal.

Abstract:
Channel pruning can efficiently reduce the computation and memory footprint within a reasonable accuracy drop by removing unnecessary channels from convolutional neural networks (CNNs). Among the various channel pruning approaches, sparsity training is the most popular because of its convenient implementation and end-to-end training. It automatically identifies the optimal network structures by applying regularization to parameters. Although this sparsity training has achieved a remarkable performance in terms of the trade-off between accuracy and network size reduction, it needs to be accompanied by a time-consuming fine-tuning process. Moreover, although activation functions with high performance are being continuously developed, the existing sparsity training does not display remarkable scalability for these new activation functions. To address these problems, this study proposes a novel pruning method, trunk pruning, which can produce a compact network by minimizing the accuracy drop during inference even without the fine-tuning process. In the proposed method, one kernel of the next convolutional layer absorbs all the information of the kernels to be pruned, considering the effects of the batch normalization (BN) shift parameters remaining after the sparsity training. Therefore, it is possible to eliminate the fine-tuning process because trunk pruning can effectively reproduce the output of the unpruned network after the sparsity training by removing the pruning loss. Furthermore, because trunk pruning is a technique that can effectively control only the shift parameters of the BN in the CONV layer, it has the significant advantage of being compatible with all BN-based sparsity training schemes and can address various activation functions.

Abstract:
As one of the classic problems of pattern recognition, the online Handwritten Chinese Character Recognition (OLHCCR) has attracted the attention of many researchers. Yet, it remains challenging due to complex glyphs, numerous strokes, and huge categories. Existing methods utilize temporal features or spatial features to recognize handwritten characters, which results in recognition errors due to the character with non-standard stroke order. This paper proposes a new OLHCCR model based on 1-D Convolution and Two-Streams Transformers. The model has a 1-D Transformer and a Vision Transformer, and the 1-D Transformer contains a 1-D Convolution layer and Transformers, that is, the model has overall structure of Two-Streams Transformers with 1-D Convolution. So, the model is named as C-TST. It can fuse temporal and spatial features of Chinese character to achieve high recognition accuracy and fast recognition speed. Specifically, each online handwritten Chinese character is represented by a trajectory sequence. The original trajectory sequence is preprocessed to enhance the information density of each trajectory point and features difference among trajectory points. Then, the result after preprocessing is input into the 1-D convolution layer to obtain shallow temporal features, which are used also as the input of the Transformers to capture the temporal features. Simultaneously, character image is generated by processing the original trajectory sequence, and then fed into the Vision Transformer to capture the spatial features. By fusing the captured temporal and spatial features of online handwritten Chinese character, the proposed C-TST achieves a recognition accuracy of 97.90% on ICDAR-2013 and a state-of-the-art recognition accuracy of 97.38% on IAHCC-UCAS2016.

Abstract:
Just Recognizable Distortion (JRD) refers to the minimum distortion that notably affects the recognition performance of a machine vision model. If a distortion added to images or videos falls within this JRD threshold, the degradation of the recognition performance will be unnoticeable. Based on this JRD property, it will be useful to Video Coding for Machine (VCM) to minimize the bit rate while maintaining the recognition performance of compressed images. In this study, we propose a deep learning-based JRD prediction model for image and video compression. We first construct a large image dataset of Object-Wise JRD (OW-JRD) containing 29 218 original images with 80 object categories, and each image was compressed into 64 distorted versions using Versatile Video Coding (VVC). Secondly, we analyze the distribution of the OW-JRD, formulate JRD prediction as binary classification problems and propose a deep learning-based OW-JRD prediction framework. Thirdly, we propose a deep learning based binary OW-JRD predictor to predict whether an image object is still detectable or not under different compression levels. Also, we propose an error-tolerance strategy that corrects misclassifications from the binary classifier. Finally, extensive experiments on large JRD image datasets demonstrate that the Mean Absolute Errors (MAEs) of the predicted OW-JRD are 4.90 and 5.92 on different numbers of the classes, which is significantly better than the state-of-the-art JRD prediction model. Moreover, ablation studies on deep network structures, object sizes, features, data padding strategies and image/video coding schemes are presented to validate the effectiveness of the proposed JRD model.

Abstract:
Accurate 3D object segmentation in point clouds is a basis for industrial robot applications, such as robot manipulation and digital twin, which require an understanding of the 3D environment. However, the unstructured and disordered nature of point clouds makes it challenging, especially for the incomplete 3D data under a single view in the real-world scenario. To this end, this article proposes a novel 3D object segmentation framework (3DT-Seg) based on Cross-Window Point Transformer (CP-Former). CP-Former captures the long-range dependencies between local windows and latent semantic boundaries to enhance the point-wise features extracted from irregular point clouds via a bidirectional cross-attention mechanism. In addition, a contrastive learning loss and an adaptive dual aggregation strategy are introduced on semantic transition regions during the semantic supervising and instance clustering process, respectively. In this way, the latent boundary information is further utilized to improve the overall segmentation performance. Experiments on the popular benchmark dataset (S3DIS) show the state-of-the-art performance of the proposed approach in terms of semantic and instance segmentation. Furthermore, a real-world point cloud dataset (IP-Cloud) for the robotic grasping task is presented to fully validate the effectiveness of our method in practice, where it also achieves remarkable performance.

Abstract:
Automated radiology report generation aims to generate accurate and radiologist-like descriptions for the patient's images, which can greatly relieve the workload of radiologists. However, due to the data bias and long report problems, medical report generation has been a challenging task. In this article, we propose a Flexible Multi-view Paradigm (FMVP) for medical report generation in a novel observation-to-concept manner. It first makes some medical observations automatically or with the help of a radiologist on the patient's image to obtain patient-related priori knowledge, just as radiologists do in practice. Furthermore, to bridge the gap between pretrain and generation phases, the hierarchical alignment is proposed to jointly conduct the implicit alignment between region-tag and the explicit global alignment of the image-report pair. Finally, a compatible decoder towards decoding the fused multi-view knowledge is proposed to capture more complementary information for the report generation, which breaks the traditional entrenched decoding mechanism guided by visual information. Extensive quantitative and qualitative experiments on the public MIMIC-CXR and IU-Xray datasets show that our model achieves competitive performance compared to state-of-the-art methods.

Abstract:
Image copy-move forgery detection (CMFD) has become a challenging problem due to increasingly powerful editing software that makes forged images increasingly realistic. Existing algorithms that directly connect multiple scales of features in the encoder part may not effectively aggregate contextual information, resulting in poor performance. In this paper, an end-to-end context multiscale cross-fusion network (CMCF-Net) is proposed to detect image copy-move forgery. The proposed network consists of a multiscale feature extraction fusion (MSF) module and a multi-information fusion decoding (MFD) module. Multiscale information is efficiently extracted and fused in the MSF module utilizing stacked-scale feature fusion, which improves the network's forgery localization ability on objects of different scales. The MFD module employs contextual information combination and weighted fusion of multiscale information to guide the network in obtaining relevant clues from correlated information at multiple different scales. Experimental results and analysis have demonstrated that the proposed CMCF-Net achieves the best localization results with higher robustness.

Abstract:
Video-based point cloud compression (V-PCC) is a state-of-the-art moving picture experts group (MPEG) standard for point cloud compression. V-PCC can be used to compress both static and dynamic point clouds in a lossless, near lossless, or lossy way. Many objective quality metrics have been proposed for distorted point clouds. Most of these metrics are full-reference metrics that require both the original point cloud and the distorted one. However, in some real-time applications, the original point cloud is not available, and no-reference or reduced-reference quality metrics are needed. Three main challenges in the design of a reduced-reference quality metric are how to build a set of features that characterize the visual quality of the distorted point cloud, how to select the most effective features from this set, and how to map the selected features to a perceptual quality score. We address the first challenge by proposing a comprehensive set of features consisting of compression, geometry, normal, curvature, and luminance features. To deal with the second challenge, we use the least absolute shrinkage and selection operator (LASSO) method, which is a variable selection method for regression problems. Finally, we map the selected features to the mean opinion score in a nonlinear space. Although we have used only 19 features in our current implementation, our metric is flexible enough to allow any number of features, including future more effective ones. Experimental results on the Waterloo point cloud dataset version 2 (WPC2.0) and the MPEG point cloud compression dataset (M-PCCD) show that our method, namely PCQAML, outperforms state-of-the-art full-reference and reduced-reference quality metrics in terms of Pearson linear correlation coefficient, Spearman rank order correlation coefficient, Kendall's rank-order correlation coefficient, and root mean squared error.

Abstract:
With the ease of accessing large unlabeled datasets, studies on semi-supervised learning for object detection (SSOD) have become increasingly popular. Among these SSOD studies, the pseudo-labeling method significantly depends on the accuracy of the pseudo-labels; thus, inaccurate annotations must be filtered to prevent performance degradation. This study classifies annotation errors that occur in pseudo-labeling methods as false negative (FN) and false positive (FP), and solutions to address each type of error are proposed using uncertainty information obtained through Gaussian modeling. Network performance is improved by preventing the background learning of the FN objects based on the uncertainty of the network output. In addition, based on the uncertainty of the annotations, low-reliability annotations are filtered out, and the learning reflectivity of FP objects is determined. Considering the network performance improvement and training complexity, the proposed method employs one-phase learning, including a single pseudo-label update, to achieve maximum performance with the minimum learning process. Moreover, an algorithm is proposed for an optimal update point search to increase the expected performance improvement. Experiments on the Pascal VOC, COCO, and Cityscapes datasets show that the SSD network improves accuracy by 3.3%, 4.7%, and 4.1%, respectively, with negligible computational complexity compared to the baseline.

Abstract:
Pedestrian reidentification (ReID) is a challenging task that involves identifying and retrieving specific pedestrians across different cameras and scenes. This problem has significant implications for security surveillance, and has thus received substantial attention in recent years. However, traditional convolutional neural networks (CNNs) have limited receptive fields and cannot capture global information. Moreover, transformer networks, which excel in long-range feature capture, are prone to accuracy degradation due to loss of details. To address these limitations, we propose a transformer-based pedestrian ReID network with double-branch information mutual gain (DIMGNet), which leverages hierarchical parallel levels to support multi-granularity feature information mutual gain. Our model also incorporates an auxiliary camera information (ACI) module to improve feature representation ability. We further embed a cross-attention mechanism into the architecture to enhance mutual gain between multi-granularity features and improve feature discrimination. Finally, we introduce a shuffling technique to increase the robustness of the extracted features. We evaluate the proposed method on several benchmark datasets, including Market-1501 (Zhou et al., 2022), MSMT17 (Wei et al., 2018), DukeMTMC-reID (Ristani et al., 2016), and Occluded-Duke (Miao et al., 2019), achieving mAP values of 90.7%, 68.4%, 83.7%, and 60.6%, respectively. Our method outperforms most state-of-the-art methods, demonstrating the effectiveness of our method.

Abstract:
Point clouds have become increasingly prevalent in representing 3D scenes within virtual environments, alongside 3D meshes. Their ease of capture has facilitated a wide array of applications on mobile devices, from smartphones to autonomous vehicles. Notably, point cloud compression has reached an advanced stage and has been standardized. However, the availability of quality assessment datasets, which are essential for developing improved objective quality metrics, remains limited. In this paper, we introduce BASICS, a large-scale quality assessment dataset tailored for static point clouds. The BASICS dataset comprises 75 unique point clouds, each compressed with four different algorithms including a learning-based method, resulting in the evaluation of nearly 1500 point clouds by 3500 unique participants. Furthermore, we conduct a comprehensive analysis of the gathered data, benchmark existing point cloud quality assessment metrics and identify their limitations. By publicly releasing the BASICS dataset, we lay the foundation for addressing these limitations and fostering the development of more precise quality metrics.

Abstract:
Video-based point cloud compression (V-PCC) is a promising technique for compressing 3D point clouds. V-PCC projects the 3D point cloud into patches and encodes the generated 2D images using state-of-the-art video codecs. To maintain temporal consistency between frames, V-PCC supports global patch packing methods and one notable approach is Global Patch Allocation (GPA), which packs the global matched patches into the same location in each frame across the sequence. Additionally, frames are subdivided into groups (i.e., sub-contexts) to balance packing compactness and patch similarity within the groups. While video coding typically employs a Group of Picture (GOP) as the basic unit for encoding, GPA in V-PCC currently does not consider the reference relationship between images within or between GOPs, resulting in limited similarity between the current and the reference images, ultimately leading to reduced encoding efficiency. This paper presents an improved technique for GPA. We propose a dynamic sub-context and GOP determination technique, enhancing the similarity between images within the same GOP. Furthermore, we introduce a priority-based patch packing (PBPP) technique to reduce differences between frames in adjacent GOPs. Experimental results demonstrate the superiority of our proposed method over the anchor, achieving an average BD-rate savings of 3.09%, 3.04%, and 2.33% for D1-PSNR, D2-PSNR, and Y-PSNR, respectively.

Abstract:
Infographics, which usually contain many well-designed visual elements, have significant advantages in delivering information efficiently and accurately. Previous research shows that proportion-related infographics make up the majority of all infographics. However, the creation of proportion-related infographics is difficult for general users. Recently, many researchers focus on generating infographics from the text with a single proportional fact. Our further research found that users tend to create infographics with multiple proportional facts. Existing research lacks modeling of relations between different facts, resulting in poor performance when generating infographics with multiple facts. In this paper, we model the relationship of different proportional facts based on the results of our investigation and design a deep learning-based model to classify them. At the same time, we also optimize the ability to extract multiple proportional facts from text. The experiments show that our model outperforms existing models when visualizing text with multiple proportional facts.

Abstract:
Despite the astounding progress made in semi-supervised learning (SSL) and imbalanced supervised learning (ISL), there has been little attention devoted to the research of imbalanced semi-supervised learning (ISSL). The “Matthew effect”, a phenomenon where a disparity in data representation becomes more severe in a class-imbalanced dataset during training, could be amplified in a semi-supervised setting. In this study, we addressed two key challenges in ISSL: maintaining the reliability of pseudo-labels and ensuring a balanced representation of features. Specifically, we propose a class-aware feature-diffusion constraint and reliable pseudo-labeling (DCRP) framework to address these issues. In the DCRP, we counteract the overconfidence problem of softmax by adding an extra class to the typical K class problem without the need for additional parameters. Moreover, we introduced a flexible class-aware feature diffusion constraint in the feature extractor, promoting a more balanced feature diversity. Experimental validations on various datasets, such as CIFAR10-LT, CIFAR100-LT, SVHN-LT, and Small ImageNet-127, demonstrated consistent improvements in accuracy with our DCRP method. In particular, we achieved a steady improvement in accuracy of approximately 1% under the newly published ACR prototype across most settings.

Abstract:
Few-shot segmentation (FSS) aims to segment the novel class with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to the biased prediction towards base class, which is caused by the class-specific feature level interactions. To solve this issue, we propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity. Specifically, the class-relevant textual and visual features are first transformed to class-agnostic prior in the form of probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including multiple General Assemble Units (GAUs) is introduced. It considers diverse and plug-and-play interactions, such as visual-textual, inter- and intra-image, training-free, and high-order ones. Lastly, to ensure the class-agnostic ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed to flexibly exploit the assembled masks and low-level features, without relying on any class-specific information. It achieves new state-of-the-art results in the FSS task, with mIoU of 77.6 on \textPASCAL-5^i and 59.4 on \textCOCO-20^i in 1-shot scenario. Beyond this, we show that without extra re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot segmentation framework capable of accommodating diverse weak or pixel annotations.

Abstract:
Video-based person re-identification (Re-ID) is designed to retrieve target pedestrians in video sequences under non-overlapping cameras. At present, mainstream approaches post-process the feature map extracted by the convolutional neural network backbone to obtain a global representation or a fine-grained local representation for higher accuracy. However, they still suffer from challenges, such as information loss for global-based methods and spatio-temporal feature fragmentation for local-based methods. To alleviate these problems, this article proposes a Spatio-Temporal Feature Enhancement (STFE) network from a spatio-temporal comprehensive perspective, combining the advantages of the above methods to obtain more comprehensive information from video tracklets. STFE consists of two main modules: Feature Space Projection Module (FSPM) and Global Low-frequency Enhancement Module (GLEM). FSPM mathematically converts continuous video information into a discrete feature space and selectively retains more useful information, thus avoiding spatio-temporal information loss. Meanwhile, FSPM applies global features instead of dividing feature maps spatially, thereby avoiding spatio-temporal feature fragmentation. In addition, GLEM which is based on transformer, acts as a broadband low-pass filter to mine richer global comprehensive information. Finally, by combining FSPM with GLEM, STFE can obtain spatio-temporal comprehensive video representation. Extensive experiments were conducted on two widely-used video Re-ID datasets. The experimental results verify our idea and demonstrate the effectiveness of the proposed STFE with 95.5% Rank-1 accuracy on MARS benchmarks, which surpasses previous state-of-the-arts by a large margin of +4%.

Abstract:
Previous Sentiment Analysis (SA) studies have demonstrated that exploring sentiment cues from multiple synchronized modalities can effectively improve the SA results. Unfortunately, until now there is no publicly available dataset for multimodal SA of the stock market. Existing datasets for stock market SA only provide textual stock comments, which usually contain words with ambiguous sentiments or even sarcasm words expressing opposite sentiments of literal meaning. To address this issue, we introduce a Fine-grained Multimodal Sentiment Analysis dataset built upon 1,247 Stock Comment videos, called FMSA-SC. It provides both multimodal sentiment annotations for the videos and unimodal sentiment annotations for the textual, visual, and acoustic modalities of the videos. In addition, FMSA-SC also provides fine-grained annotations that align text at the phrase level with visual and acoustic modalities. Furthermore, we present a new fine-grained multimodal multi-task framework as the baseline for multimodal SA on the FMSA-SC.

Abstract:
In the image dehazing task, the haze density is a key feature that affects the performance of dehazing methods. The haze density difference, which has rarely been utilized in previous methods, can guide networks to perceive different global densities and focus on local areas with high density or that are difficult to dehaze. In this paper, we propose a density-aware dehazing method named the Density Feature Refinement Network (DFR-Net), which extracts haze density features from density differences and leverages density differences to refine density features. In DFR-Net, we first generate a proposal image that has a lower overall density than the hazy input, resulting in global density differences. Additionally, the dehazing residual of the proposal image reflects the level of dehazing performance and provides local density differences that indicate localized hard dehazing or high-density areas. Subsequently, we introduce a Global Branch (GB) and a Local Branch (LB) to achieve density awareness. In GB, we use Siamese networks for feature extraction of hazy inputs and proposal images, and we propose a Global Density Feature Refinement (GDFR) module that can refine features by pushing features with different global densities further away. In LB, we explore local density features from the dehazing residuals between hazy inputs and proposal images and introduce an Intermediate Dehazing Residual Feedforward (IDRF) module to update local features and pull them close to clear image features. Sufficient experiments demonstrate that the proposed method outperforms state-of-the-art methods on various datasets.

Abstract:
This paper studies a practical Source-free unsupervised domain adaptation (SFUDA) problem, which transfers knowledge of source-trained models to the target domain, without accessing the source data. It has received increasing attention in recent years, while the prior arts focus on designing adaptation strategies, ignoring that different target samples exhibit different transfer abilities on the source model. Additionally, we observe pixel-wise class prediction is typically accompanied by ambiguity issue, i.e., prediction errors often occur between several confusing classes. In this study, we propose a dual-branch collaborative learning framework that aims to achieve reliable knowledge transfer from important samples to the rest by fully mining confident prototypes in the target data. Concretely, we first partition the target data into confident samples and uncertain samples via a new class-ranking reliability score and then utilize the latent features from the confident branch as guidance to promote the learning of the uncertain branch. For ambiguity issue, we propose a feature relabelling module, which exploits reliable prototypes in the mini-batch as well as in the target data to refine labels of uncertain features. We further deploy the proposed framework to commonly used CNN and state-of-the-art Transformer architectures and reveal the potential to promote the generalization ability of backbone models. Experimental results on both natural and medical benchmark datasets verify that our proposed approach exceeds state-of-the-art SFUDA methods with large margins, and achieves comparable performance to existing UDA methods.

Abstract:
The usual active learning is to sample unlabeled set by designing efficient sample information evaluation algorithms. However, information redundancy between candidate sets is often overlooked. This can cause similar data to be labeled repeatedly, producing ineffective gains for the model. In this paper, we proposed an Unsupervised Redundant Feature Elimination Active Learning module (URFEAL), which utilizes the information feature coincidence of the unlabeled set to eliminate information redundant data, thus guaranteeing the validity of each candidate data. URFEAL consists of feature clusterer and eliminator. The feature clusterer computes class boundaries based on feature densities to discretize each class of the candidate set, and the eliminator judges data similarity by overlapping degree to eliminate redundant data features. Furthermore, we propose an anti-noise sampling strategy Outlier Feature Elimination (OFE) in URFEAL to filter mislabeled sets for relabeling in the data sampling stage. We extensively evaluate our method by image classification and perform experimental validation on CIFAR-10, CIFAR-100 and CALTECH-101. The experimental results show that the improvements we make are especially significant for most existing active learning algorithms in the low data stage, which demonstrates the effectiveness and generality of URFEAL.

Abstract:
Multilayer analytic learning plays a crucial role in data mining and representation learning. Nevertheless, most of them encounter inefficiencies in latent space encoding, resulting in less effective data representations. Aimed at addressing this limitation, this paper introduces two potent analytic learning methods, the progressive learning-based hierarchical subnet neural network (P-HSNN) and the robust P-HSNN (RP-HSNN). The contributions are as follows. First, two progressive learning astrategies based on subnetwork nodes are proposed. Second, the RP-HSNN is a Laplacian matrix-based algorithm, where label information and input representations are utilized simultaneously to optimize the subspace feature. Third, the dimension of subnetwork node is gradually increased. The global-level representation is formed by combining the features from the subnetworks. The model's convergence is thoroughly demonstrated through rigorous mathematical proof. Experimental analyses across various domains, spanning a wide range of training samples from 2,754 to 1,623,114, confirm the superior performance of the proposed algorithms over state-of-the-art multilayer analytic learning methods.

Abstract:
Text spotting in natural scenes is of increasing interest and significance due to its critical role in several applications, such as visual question answering, named entity recognition and event rumor detection on social media. One of the newly emerging challenging problems is Tattoo Text Spotting (TTS) in images for assisting forensic teams and for person identification. Unlike the generally simpler scene text addressed by current state-of-the-art methods, tattoo text is typically characterized by the presence of decorative backgrounds, calligraphic handwriting and several distortions due to the deformable nature of the skin. This paper describes the first approach to address TTS in a real-world application context by designing an end-to-end text spotting method employing a Hilbert transform-based Generative Adversarial Network (GAN). To reduce the complexity of the TTS task, the proposed approach first detects fine details in the image using the Hilbert transform and the Optimum Phase Congruency (OPC). To overcome the challenges of only having a relatively small number of training samples, a GAN is then used for generating suitable text samples and descriptors for text spotting (i.e., both detection and recognition). The superior performance of the proposed TTS approach, for both tattoo and general scene text, over the state-of-the-art methods is demonstrated on a new TTS-specific dataset (publicly available) as well as on the existing benchmark natural scene text datasets: Total-Text, CTW1500 and ICDAR 2015.

Affiliations: Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia; Department of Information Engineering and Computer Science, University of Trento, Povo-Trento, Italy; Huawei, Bantian Campus, Longgang District, Shenzhen, China; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Mathematics and Computer Science, University of Catania, Catania, CT, Italy; Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Abstract:
In the ever-evolving domain of multimedia, the significance of multi-modality understanding cannot be overstated. As multimedia content becomes increasingly sophisticated and ubiquitous, the ability to effectively combine and analyze the diverse information from different types of data, such as text, audio, image, video and point clouds, will be paramount in pushing the boundaries of what technology can achieve in understanding and interacting with the world around us. Accordingly, multi-modality understanding has attracted a tremendous amount of research, establishing itself as an emerging topic. Pre-trained models, in particular, have revolutionized this field, providing a way to leverage vast amounts of data without task-specific annotation to facilitate various downstream tasks.

Abstract:
Recognizing violence in videos is significant for the automatic identification and assessment of violence content to restrict the access to violence for specific audiences such as children. Existing methods focus on violence detection, which is only able to recognize whether there exists violence or not. Differently, this paper handles the problem of video violence rating, which provides a more granular classification of violence levels. However, there is no publicly available database for video violence rating since it asks for fine-grained violence level annotations. Therefore, this paper introduces a large-scale violence rating database, which will be publicly released. Furthermore, we propose a multimodal violence rating model. Different from existing models, our model makes use of the token-based interaction and contrastive learning techniques. The token-based interaction is able to strengthen the feature representations and make full use of multimodal features. The contrastive learning can improve the performance of the model. To evaluate our model, a wide range of experiments are conducted, and experiment results show that our model outperforms existing methods.

Abstract:
Bio-inspired event cameras record a scene as sparse and asynchronous “events” by detecting per-pixel brightness changes. Such cameras show great potential in challenging scene understanding tasks, benefiting from the imaging advantages of high dynamic range and high temporal resolution. Considering the complementarity between event and standard cameras, we propose a multi-modal fusion network (EISNet) to improve the semantic segmentation performance. The key challenges of this topic lie in (i) how to encode event data to represent accurate scene information and (ii) how to fuse multi-modal complementary features by considering the characteristics of two modalities. To solve the first challenge, we propose an Activity-Aware Event Integration Module (AEIM) to convert event data into frame-based representations with high-confidence details via scene activity modeling. To tackle the second challenge, we introduce the Modality Recalibration and Fusion Module (MRFM) to recalibrate modal-specific representations and then aggregate multi-modal features at multiple stages. MRFM learns to generate modal-oriented masks to guide the merging of complementary features, achieving adaptive fusion. Based on these two core designs, our proposed EISNet adopts an encoder-decoder transformer architecture for accurate semantic segmentation using events and images. Experimental results show that our model outperforms state-of-the-art methods by a large margin on event-based semantic segmentation datasets.

Abstract:
Nowadays, capturing cherished moments results in an abundance of photos, which necessitates the selection of the finest one from a pool of akin images'a process both intricate and time-intensive. Thus, series photo selection (SPS) techniques have been developed to recommend the optimal moment from nearly identical photos through the use of aesthetic quality assessment. However, addressing SPS proves demanding due to the subtle nuances within such imagery. Existing approaches predominantly rely on diverse feature types (e.g., color, layout, generic features) extracted from original images to discern the qualified shot, yet they disregard disentangling generality and specificity at the feature level. This study aims to detect subtle aesthetic distinctions among akin photos. We propose a feature separation model that captures all label-relevant information through an encoder. We introduce Information Bottleneck (IB) learning to obtain non-redundant representations of image pairs and filter out noise information from the representations. Our model segregates image features into shared and specific attributes by employing feature constraints to boost mutual information across images and guide meaningful information within individual images. This process filters out extraneous data within individual images, thus significantly enhancing the representation of similar image pairs. Extensive experiments on the Phototriage dataset show that our model can accentuate subtle disparities and achieve superior results when compared to alternative methods.

Abstract:
Point cloud semantic segmentation is a fundamental task in 3D scene understanding and has recently achieved remarkable progress. The success of existing approaches is attributed to recent advanced deep networks for point clouds and the availability of a large amount of labeled training data. However, creating such fully annotated training datasets for supervised point cloud semantic segmentation methods is a time-consuming and labor-intensive process, which increases the difficulty of extending supervised approaches to new application scenarios. To alleviate the data-hungry nature of deep learning, we propose PCL, the point contrast and labeling framework for weakly supervised point cloud semantic segmentation with small percentages of point-level annotations. The core idea of this method is to exploit contrastive learning to help learn a larger number of discriminative feature representations with limited annotations. By introducing two types of contrastive relationships, cross-sample point contrast and low-level similarity-based point contrast, our proposed framework can directly regularize the learned feature space, considering not only the low-level similarity within each point cloud but also the discriminative semantics within and across point clouds on both labeled and unlabeled points via pseudo labels. In addition, we propose a pseudo label refinery module to generate robust and reliable pseudo labels online, reducing the negative impact of incorrect pseudo labels. Our method achieves state-of-the-art performance on a diverse set of label-efficient semantic segmentation tasks.

Abstract:
Images captured in adverse weather conditions, such as haze, fog,smog, or mist, have reduced visibility, contrast, and color fidelity. These impairments challenge various computer vision applications, such as intelligent transportation, video surveillance, weather forecasting, and remote sensing. While many daytime dehazing techniques exist, they are less effective for nighttime images, which have additional issues, such as nonuniform illumination, texture blurring, glow effects, color distortion, noise, and low light. This paper proposes a novel method for improving the quality of nighttime images affected by haze and low-light conditions. Our method, Nighttime Dehazing, Low-Light Enhancement, and Light Suppression (NDELS), integrates three key processes: enhancing visibility, brightening low-light areas, and suppressing glare from bright light sources. We also introduce a novel method for generating training data to help our model learn light suppression better. We evaluate our method against eight state-of-the-art algorithms on four diverse datasets. The simulation results show that the presented method outperforms the state-of-the-art methods quantitatively and qualitatively. For example, our method i) improves the overall image quality, color fidelity, and edges and ii) achieves 8.8% higher PSNR, and 4.5% higher SSIM scores, and a better subjective rating. Moreover, our method enhances real-world object detection tasks, surpassing other methods in performance.

Abstract:
This paper focuses on the challenge of accurately estimating the subjective quality of multimedia content from noisy opinion scores gathered from end-users. State-of-the-art methods rely on parametric statistical models to capture the subject's scoring behavior and recover quality estimates. However, these approaches have limitations, as they often require restrictive assumptions to achieve numerical stability during parameter estimation, leading to a lack of robustness when the modeling hypotheses do not fit the data. To overcome these limitations, we propose a paradigm shift towards non-parametric statistical methods. Specifically, we introduce a threefold contribution: i) in contrast to the prevailing approach in subjective quality recovery assuming a parametric score distribution, we propose a non parametric approach that guarantees greater accuracy by measuring reliability per subject and per stimulus, overcoming the limits of existing approaches that measure only per subject reliability; ii) we propose ESQR, a non-parametric algorithm for subjective quality recovery, demonstrating experimentally that it has higher robustness to noise compared to numerous state-of-the-art algorithms, thanks to the weaker assumptions made on data compared to parametric approaches; iii) the proposed approach is theoretically grounded, i.e., we define a non-parametric statistic and prove mathematically that it provides a measure of score reliability.

Abstract:
Audio-visual segmentation (AVS) aims to segment the object instances that produce sound at the time of the video frames. Existing related solutions focus on designing cross-modal interaction mechanisms, which try to learn audio-visual correlations and simultaneously segment objects. Despite effectiveness, the close-coupling network structures become increasingly complex and hard to analyze. To address these problems, we propose a simple but effective method, ‘Each Performs Its Functions (PIF),’ which focuses on task decomposition and feature assignment. Inspired by human sensory experiences, PIF decouples AVS into two subtasks, correlation learning, and segmentation refinement, via two branches. Correlation learning aims to learn the correspondence between sound and visible individuals and provide the positional prior. Segmentation refinement focuses on fine segmentation. Then we assign different level features to perform the appropriate duties, i.e., using deep features for cross-modal interaction due to their semantic advantages; using rich textures of shallow features to improve segmentation results. Moreover, we propose the recurrent collaboration block to enhance interbranch communication. Experimental results on AVSBench show that our method outperforms related state-of-the-art methods by a large margin (e.g., +6.0% mIoU and +7.6% F-score on the Multi-Source subset). In addition, by purposely boosting subtasks' performance, our approach can serve as a strong baseline for audio-visual segmentation.

Abstract:
Natural Language Video Localization (NLVL) has recently attracted much attention because of its practical significance. However, the existing methods still face the following challenges: 1) When the models learn intra-modal semantic association, the temporal causal interaction information and contextual semantic discriminative information are ignored, resulting in the lack of intra-modal semantic context connection; 2) When learning fusion representations, existing cross-modal interaction modules lack hierarchical attention function to extract inter-modal similarity information and intra-modal self-correlation information, resulting in insufficient cross-modal information interaction; and 3) When the loss function is optimized, the existing models ignore the correlation of causal inference between the start and end boundaries, resulting in inaccurate start and end boundary calibrations. To conquer the above challenges, we proposed a novel NLVL model, called Discriminative Parallel and Hierarchical Attention Network (DPHANet). Specifically, we emphasized the importance of temporal causal interaction information and contextual semantic discriminative information and correspondingly proposed a Discriminative Parallel Attention Encoder (DPAE) module to infer and encode the above critical information. Besides, to overcome the shortcomings of the existing cross-modal interaction modules, we designed a Video-Query Hierarchical Attention (VQHA) module, which can perform cross-modal interaction and intra-modal self-correlation modeling in a hierarchical manner. Furthermore, a novel deviation loss function was proposed to capture the correlation of causal inference between the start and end boundaries and force the model to focus on the continuity and temporal causality in the video. Finally, extensive experiments on three benchmark datasets demonstrated the superiority of our proposed DPHANet model, which has achieved about 1.5% and 3.5% average performance improvement and about 2.5% and 7.5% maximum performance improvement on the Charades-STA and TACoS datasets respectively.

Abstract:
The existing personalized text-to-image generation models face issues such as repeated training and insufficient generalization capabilities. We present an adaptive Style-Guided Diffusion Model (SGDM). When provided with a set of stylistically consistent images and prompts as inputs, SGDM can generate images that align with the prompts while maintaining style consistency with the input images. SGDM first extracts features from the input style image and then combines style features from different depths. Last, style features are injected into the noise generation process of the original Stable Diffusion (SD) model by the style-guided module we propose. This strategy fully leverages the generative and generalization capabilities of the pre-trained text-to-image model to ensure the accuracy of the generated image's content. We present a dataset construction method suitable for style personalized generation tasks of this kind, enabling the trained model to generate stylized images adaptively instead of re-training for each style. We also present an evaluation metric, StySim, to measure the style similarity between two images, and this metric shows that the style personalization capability of SGDM is the best. And metrics such as FID, KID, and CLIPSIM indicate that SGDM maintains good performance in text-to-image generation.

Abstract:
Running deep visual analytics models for real-time applications is challenging for mobile devices. Offloading the computation to edge server can mitigate computation bottleneck at the mobile device, but may decrease the analytics performance due to the necessity of compressing the image data. We consider a “split computing” system to offload a part of the deep learning model's computation and introduce a novel learned feature compression approach with lightweight computation. We demonstrate the effectiveness of the split computing pipeline in performing computation offloading for the problems of object detection and image classification. Compared to compressing the raw images at the mobile, and running the analytics model on the decompressed images at the server, the proposed feature-compression approach can achieve significantly higher analytics performance at the same bit rate, while reducing the complexity at the mobile. We further propose a scalable feature compression approach, which facilitates adaptation to network bandwidth dynamics, while having comparable performance to the non-scalable approach.

Abstract:
This study proposes an innovative network to fuse infrared and visible images, called HitFusion, which uses the cross-feature transformer module and is compatible with high-level vision tasks. Firstly, existing image fusion approaches primarily concentrate on optimizing human visual perception and image metrics. To enhance the performance of the fusion network in subsequent high-level vision tasks, a segmentation network and a corresponding loss are introduced into the fusion network training process. Specifically, we devise a three-stage training strategy to render the fusion network more suitable for high-level vision tasks, guided by the segmentation network and broadening the fusion network's training set to boost its generalization capability. Secondly, current transformer-based image fusion methods neglect the interaction between visible texture features and infrared contrast features. To tackle this, the cross-feature transformer module is proposed, allowing the fusion network to learn the cross-feature correlation and long-range dependencies between source images, thus achieving fusion results with good complementarity. Finally, a dual-branch fusion network is proposed, based on the distinct characteristics of different images, that targets the extraction of deep features from source images utilizing contrast residual and texture enhancement modules to achieve improved fusion results. Extensive experimental results reveal that our HitFusion method excels in both qualitative and quantitative assessments, while also demonstrating superior performance in addressing high-level vision tasks.

Abstract:
The Diffusion model has a strong ability to generate wild images. However, the model can just generate inaccurate images with the guidance of text, which makes it very challenging to directly apply the text-guided generative model for virtual try-on scenarios. Taking images as guiding conditions of the diffusion model, this paper proposes a brand new personalized virtual try-on model (PE-VITON), which uses the two stages (shape control and texture guidance) to decouple the clothing attributes. Specifically, the proposed model adaptively matches the clothing to human body parts through the Shape Control Module (SCM) to mitigate the misalignment of the clothing and the human body parts. The semantic information of the input clothing is parsed by the Texture Guided Module (TGM), and the corresponding texture is generated by directional guidance. Therefore, this model can effectively solve the problems of weak reduction of clothing folds, poor generation effect under complex human posture, blurred edges of clothing, and unclear texture styles in traditional try-on methods. Meanwhile, the model can automatically enhance the generated clothing folds and textures according to the human posture, and improve the authenticity of the virtual try-on. In this paper, qualitative and quantitative experiments are carried out on high-resolution paired and unpaired datasets, the results show that the proposed model outperforms the state-of-the-art model.

Abstract:
Supplementing product attribute information is a critical step for E-commerce platforms, which further benefits various downstream tasks, including product recommendation, product search, and product knowledge graph construction. Intuitively, the visual information available on e-commerce platforms can effectively function as a primary source for certain product attributes. However, existing works either extract attribute values solely from textual product descriptions or leverage limited visual information (e.g., image features or optical character recognition tokens) to assist extraction, without mining the fine-grained visual cues linked with the products effectively. In this paper, we propose a novel task - Multimodal Joint Slot Filling (MuJo-SF) - that aims to combine multimodal information from both product descriptions and their corresponding product images to jointly fill values into the pre-defined product attribute set. To this end, we develop MAVP, a new dataset with 79 k instances of product description-image pairs. Specifically, we present a strategy to fulfill visualized saliency ascription, which aims to distinguish between text-dependent and image-dependent attributes. For those image-dependent attributes, we annotate the corresponding values from images using distant supervision. Then, we design a model for MuJo-SF, which combines multimodal representations and fills image-dependent and text-dependent attributes separately. Finally, we conduct extensive experiments on MAVP and provide rich results for MuJo-SF, which can be used as baselines to facilitate future research.

Abstract:
Multi-view speech emotion recognition (SER) based on the pre-trained model has gained attention in the last two years, which shows great potential in improving the model performance in speaker-independent scenarios. However, the existing work either relies on various fine-tuning methods or uses excessive feature views with complex fusion strategies, causing the increase of complexity with limited performance benefit. In this paper, we improve multi-view SER based on the pre-trained model from the perspective of a low-level speech feature. Specifically, we forgo fine-tuning the pre-trained model and instead focus on learning effective features hidden in the low-level speech feature mel-scale frequency cepstral coefficient (MFCC). We propose a two-stream pooling channel attention (TsPCA) module to discriminatively weight the channel dimensions of the features derived from MFCC. This module enables inter-channel interaction and learning of emotion sequence information across channels. Furthermore, we design a simple but effective feature view fusion strategy to learn robust representations. In the comparison experiments, our method achieves the WA and UA of 73.97%/74.69% and 74.61%/75.66% on the IEMOCAP dataset, 97.21% and 97.11% on the Emo-DB dataset, 77.08% and 77.34% on the RAVDESS dataset, and 74.38% and 71.43% on the SAVEE dataset. Extensive experiments on the four datasets demonstrate that our method consistently surpasses existing methods and achieves a new State-of-the-Art result.

Abstract:
Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as “specialized images”. This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with an additional pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already closely approximated the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective representation capability.

Abstract:
Dental plaque segmentation is crucial for maintaining oral health. However, accurately segmenting dental plaque in unconstrained environments can be challenging due to its low contrast and high variability in appearance. While existing transformer-based networks rely on attention mechanisms for each pixel, they do not take into account the relationships between neighboring pixels. Consequently, feature extraction is limited, making it difficult to achieve accurate segmentation of low-contrast images. To address this issue, we propose a simple yet efficient cluster center transformer that improves dental plaque segmentation by clustering image pixels based on multiple levels of feature maps' intensity and texture information. By grouping similar pixels into regions, the proposed method enables the transformers to focus on the local contour and edge around the teeth regions, adapting to the low contrast and high variability of plaque appearance, leading to more accurate and efficient segmentation of dental plaque in dental images. Additionally, we designed Multiple Granularity Perceptions using a pyramid fusion mechanism to capture multiple scales of vision features, thereby enhancing the low-contrast vision features. The proposed method can benefit the dental diagnosis and treatment planning process by improving the accuracy and efficiency of dental plaque segmentation. Our proposed method achieved state-of-the-art results on the dental plaque dataset (Li et al., 2020), with intersection over union (IoU) of 60.91% and pixel accuracy (PA) of 76.81%, all of which were the highest among all methods, demonstrating its effectiveness in plaque segmentation in unconstrained environments.

Abstract:
With the advancement of wireless technology, the fifth-generation mobile communication network (5G) has the capability to provide exceptionally high bandwidth for supporting high-quality video streaming services. Nevertheless, this network exhibits substantial fluctuations, posing a significant challenge in ensuring the reliability of video streaming services. This research introduces a novel algorithm, the Multi-type data perception-based Meta-learning-enabled adaptive Video Streaming algorithm (MMVS), designed to adapt to diverse network conditions, encompassing 3G and mmWave 5G networks. The proposed algorithm integrates the proximal policy optimization technique with the meta-learning framework to cope with the gradient estimation noise in network fluctuation. To further improve the robustness of the algorithm, MMVS introduces meta advantage normalization. Additionally, MMVS treats network information as multiple types of input data, thus enabling the precise definition of distinct network structures for perceiving them accurately. The experimental results on network trace datasets in real-world scenarios illustrate that MMVS is capable of delivering an additional 6% average QoE in mmWave 5G network, and outperform the representative benchmarks in six pairs of heterogeneous networks and user preferences.

Abstract:
To facilitate human gait recognition, this paper proposes a new frontal-view gait recognition method using gait dynamics and deep learning. Rather than adopting lateral-view parameters as gait features in the literature, we employ improved frontal-view features and classification methods to avoid the recognition rate drops due to the complicated surveillance environment. Specifically, we characterize the binary walking silhouettes with three different kinds of frontal-view gait features, including kinematic features, spatial ratio features and area features. In addition, we capture the gait dynamics underlying the time-varying gait features to reflect temporal dynamics information of human walking. Furthermore, we incorporate the deep feature learning information into the recognition procedure to take advantage of the deep learning technique. To obtain the optimal recognition accuracy and robustness performance against walking condition variations, we calculate the similarity between the appearing test gait dynamics and the trained gait dynamics, and propose an error-based feature fusion scheme for gait recognition. To validate the efficacy of the proposed method, we conduct experiments on published gait databases by comparing with other existing gait recognition methods.

Abstract:
Hazy images captured under ill-posed scenarios with scattering medium (i.e. haze, fog, or smoke) are contaminated in visibility. Inevitably, these images are further degraded by noises owing to real-world imaging. Most existing hazy image enhancement methods perform image dehazing and denoising stage by stage, with the undesirable result that the estimation error of the former stage has to be propagated and amplified in the latter stage, e.g., noise amplification after dehazing. To address this inconsistent degradation, we present an Unsupervised Unified Image Dehazing and Denoising Network, U2D2Net, to remove the haze and suppress the noise simultaneously for a single hazy image. U2D2Net is mainly comprised of an unsupervised dehazing module, an unsupervised denoising module, and a region-similarity fusion strategy. Specifically, we propose an unsupervised transmission-aware dehazing module to restore visibility and suppress depth-dependent noise propagation in the dehazing module. Besides, we design an unsupervised network with a Mean/Max Sub-Sampler in the denoising module. To exploit the correlation and complementary between the previous outputs, a region-similarity fusion strategy is developed to compute the final qualified result. Extensive experiments on both synthetic and real-world datasets illustrate that U2D2Net outperforms other state-of-the-art dehazing and denoising methods in terms of PSNR, SSIM, and subjective visual effects.

Abstract:
Reduced-reference light field image (LFI) quality assessment (RR LFIQA) automatically assesses image quality with only partial information about the reference LFI is available. Existing RR LFIQA has difficulty extracting effective RR information and perceptual features to represent the LFI quality. In this article, we propose an RR LFIQA model based on pseudo LFI (PLFI) and four-dimensional (4D) wavelet transform. To extract RR information related to LFI perceptual quality, a PLFI is created as the RR information of the LFI using a view synthesis algorithm. Considering that the high-dimensional characteristics of the PLFI, 4D wavelet transform is used to decompose the original and distorted PLFIs. The 4D wavelet transform essentially performs a continuous 1D wavelet transform for the 4D signal to enable the local 4D structure of the PLFIs to be characterized effectively in the 4D wavelet domain. A novel spatial-angular weighting strategy is proposed to describe the importance of each location for quality evaluation, to further improve the performance of the proposed method. Experimental results on four benchmark datasets show that the proposed model performs better than the representative 2DIQA and LFIQA models.

Abstract:
Recently, image enhancement approaches yield impressive progress. However, most methods are still based supervised-learning, which requires plenty of paired data. Meanwhile, owing to the complex illumination condition in a real-world scenario, those methods trained on synthetic images cannot restore details in extremely dark or bright areas and lead to exposure errors. The traditional losses that deem all pixels the same in training also produce blurry edges in the result. To handle these problems, in this article, we present an effective semi-supervised framework for severely underexposed image enhancement. Our network consists of a supervised and an unsupervised branch, which shares weights and can make full use of paired data and plenty of unpaired data. Meanwhile, a multi-exposure fusion module is designed to adaptively fuse the corrected images to address the low contrast and color bias issues occurring in some extreme situations. Moreover, we propose a supervised context attention module to better use the edge information as supervision to recover fine image details. Extensive experiments have proved that the proposed method outperforms state-of-the-art approaches in enhancing exposure images.

Abstract:
Blind image quality assessment (BIQA) has received increasing attention in the past decades. However, it still remains inadequately researched on BIQA for night-time images suffering from the diverse authentic degradations. Since the intrinsic content degradations of night-time images are highly related to the illumination, how to use the connection between content and illumination to enhance the feature representation ability is the key issue in designing BIQA methods for night-time images. In this article, we first construct an ultra-high-definition night-time image dataset (UHD-NID) with high image resolution and abundant parameter settings. UHD-NID contains 1600 images with a high resolution of 5616 × 3744, and each group of images contains ten exposure levels. Then, we conduct subjective assessment and analyze the subjective data to obtain a mean opinion score to each image in UHD-NID. To enhance the feature representation ability in content and illumination, we propose a Progressive Bidirectional Feature Extraction and Enhancement Network (PBFEE-Net). In addition, we use a decomposition network to decompose the input image into the reflectance and illumination, which can facilitate the ability of feature extraction to some extent. The experimental results show that our proposed method achieves superior performance in evaluating the quality of night-time images.

Abstract:
In this study, we address the challenge of estimating 3D body pose, shape, and depth relationships from single RGB images in crowded scenes. The difficulty lies in the limited availability of in-the-wild training samples, which feature densely populated scenes. To mitigate this issue, we introduce a synthesis-based approach that fuses multiple human samples into a single composite scene. Our innovative scene-aware blending technique maintains human-scene consistency by positioning individuals within plausible locations and adjusting their scales to conform to 3D settings. Furthermore, our method enables flexible per-subject occlusion management during the blending process, bolstering the robustness of 3D human body representations through a novel de-occlusion training scheme. We present a one-stage model, CBD, designed to learn monocular regression of 3D people in crowds by leveraging blending and de-occlusion techniques. Our quantitative and qualitative evaluations on four benchmark datasets reveal that CBD surpasses existing state-of-the-art approaches in terms of 3D human pose and mesh regression accuracy, thereby establishing it as a promising solution for monocular 3D human mesh recovery in densely populated scenes.

Abstract:
With the rapid development of wearable cameras, it is now feasible to considerably increase the collection of egocentric video for first-person visual perception. However, the development is hindered by a shortage of multi-modal egocentric activity datasets. Furthermore, the catastrophic forgetting problem of multimodal continual activity learning, as a branch of continual learning, has not been thoroughly explored, which makes accumulating a larger collection of multi-modal activity data more urgent. To address this shortage, we propose a multi-modal egocentric activity dataset for continual activity learning named UESTC-MMEA-CL in this paper. The dataset is collected using our self-developed glasses with a first-person camera and wearable sensors, and it contains synchronized data of video, accelerometers, and gyroscopes for 32 types of daily activities performed by 10 participants who wore our glasses. Statistical analysis of the sensor data is given to show the auxiliary effects of activity recognition. We report the results of egocentric activity recognition of three modalities (RGB, acceleration, and gyroscope) separately and jointly on a base network architecture. We thoroughly evaluated four baseline methods with different multimodal combinations to explore the catastrophic forgetting in continual learning on UESTC-MMEA-CL. We hope that the UESTC-MMEA-CL dataset can act as a facilitator for future studies on continual learning for first-person activity recognition in wearable applications. You can download preliminary data from https://ivipclab.github.io/publication_uestc-mmea-cl/mmea-cl. The data is currently used to solve the problems of multimodal continual learning of activities.

Abstract:
Illustrating a multi-sentence story with visual content is a significant challenge in multimedia research. While previous works have focused on sequential story-to-visual representations at the image level or representing a single sentence with a video clip, illustrating a long multi-sentence story with coherent videos remains an under-explored area. In this paper, we propose the task of video-based story illustration that focuses on the goal of visually illustrating a story with retrieved video clips. To support this task, we first create a large-scale dataset of coherent video stories in each sample, consisting of 85 K narrative stories with 60 pairs of consistent clips and texts. We then propose the Story Context-Enhanced Model, which leverages local and global contextual information within the story, inspired by sequence modeling in language understanding. Through comprehensive quantitative experiments, we demonstrate the effectiveness of our baseline model. In addition, qualitative results and detailed user studies reveal that our method can retrieve coherent video sequences from stories.

Abstract:
To distribute the storage and computation load caused by growing capacity of deep neural network (DNN), collaborative intelligence (CI) framework has been proposed, where a deep model is split and executed in two distributed devices respectively. Intermediate feature must be transferred from the front end to the back in order to perform distributed inference, thus transmission process is the bottleneck that influences the inference efficiency in terms of accuracy and delay. Specifically for a bandwidth-limited human-in-loop visual analysis task, feature compression approach needs exploration to reduce the data volume to be transmitted, in order to achieve low transmission delay as well as maintain analysis performance and human perception ability. In this article, the redundancy of intermediate feature both in spatial and statistical levels are firstly analyzed. A mathematical expression for the goal of feature compression is formulated, based on which a two-level redundancy removal based low-rate feature compression approach is proposed. For the front-end device, an information squeezing (IS) module is developed to squeeze the key information of input image and inject them into a low-resolution image. Then a backbone network is split into two parts with respects to the application demands of CI, and can be deployed at the front and back ends correspondingly. With a specifically designed objective function, IS module and the partitioned backbone network are optimized collaboratively to reduce the two-level redundancy, thus compressing the intermediate feature. A generative adversarial network (GAN)-based restoration module is proposed to recover an image with original resolution from the compressed feature, for satisfying human perception. Comprehensive experiments are conduct to validate the efficiency of the proposed method.

Abstract:
Voxel is one of the common structural representation of 3D point cloud. Due to the sparsity of point cloud generated by light detection and ranging (LiDAR), there is the extreme imbalance in the foreground and background voxels. It decreases the accuracy of 3D object detection, has the negative effect on intelligent driving safety. To overcome this problem, we present a saliency prediction based 3D object detector SP-Det in this article. Although foreground voxels have the sufficient feature of object, it is difficult to localize the foreground region from voxel space with the larger background region. We design an auxiliary learning task, saliency prediction (SP). It benefits 3D detector in identifying the foreground region. SP task uses label diffusion to alleviate the label imbalance. It reduces the learning difficulty of saliency in voxel and bird's eye view (BEV) spaces. After that, to strengthen feature interaction from the sparse foreground region, we design saliency fusion (SF) module to fuse the learning result in SP task. It utilizes voxel and BEV saliency maps as progressive attention to resist the redundant feature from background region. To aggregate more foreground feature inside 3D and BEV region of interest (RoI), we design hybrid grid maps based RoI pooling (Hybrid-RoI pooling). Experiments are conducted in STF dataset. The adverse weather enlarges the sparsity of LiDAR point cloud, increasing the difficulty of object detection. SP-Det identifies and leverages foreground region, and achieves the performance better than the current methods. Hence, we believe that SP-Det benefits to LiDAR based 3D scene understanding in the adverse weather.

Abstract:
Varifocal multiview (VFMV) images are dense views that focus on variable focal planes. Thus, VFMV images are highly redundant in the angular, spatial and focal dimensions. In this article, the redundancies of VFMV images are analyzed and represented by full parallaxes and focal inconsistency. To exploit these distinctive redundancies, we propose a hierarchical independent coding scheme based on angular-focal joint prediction. The scheme is constructed by hierarchical independent prediction structure (HIPS) and angular-focal joint prediction (AFJP). The HIPS separates all views into several independent subdivisions and assigns different hierarchies inside each subdivision, which enhances random access capability and scalability. The AFJP conducts motion estimation and focal approximation simultaneously to predict parallaxes and focal inconsistency. Therefore, the redundancies in the angular and focal dimensions can be exploited by the proposed coding scheme. We construct a VFMV dataset with 10 test sequences for different acquisition methods. The experimental results on these test sequences demonstrate that the proposed scheme outperforms all comparison schemes in objective quality, subjective quality and random access capability. Specifically, the proposed coding scheme achieves up to 2.661 dB PSNR gains and 52.817% bitrate savings compared with the HEVC random access benchmark scheme.

Abstract:
Viewers of 360-degree videos are provided with both visual modality to characterize their surrounding views and audio modality to indicate the sound direction. Though both modalities are important for saliency prediction, little work has been done by jointly exploiting them, which is mainly due to the lack of audio-visual saliency datasets and insufficient exploitation of the multi-modality. In this article, we first construct an audio-visual saliency dataset with 57 360-degree videos watched by 63 viewers. Through a deep analysis of the constructed dataset, we find that the human gaze can be attracted by the auditory cues, resulting in a more concentrated saliency map if the sound source's location is further provided. To jointly exploit the visual and audio features and their correlation, we further design a saliency prediction network for 360-degree videos (SVGC-AVA) based on spherical vector-based graph convolution and audio-visual attention. The proposed spherical vector-based graph convolution can process visual and audio features directly in the sphere domain, thus avoiding projection distortion incurred by traditional CNN-based predictors. In addition, the audio-visual attention scheme explores self-modal and cross-modal correlation for both modalities, which are further hierarchically processed with the U-Net's multi-scale structure of SVGC-AVA. Evaluations on both our and public datasets validate that SVGC-AVA can achieve higher prediction accuracy, both qualitatively and subjectively.

Abstract:
Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to improve the performance of this task, but the existing methods mainly rely on the single visual or dual audio-visual modal input, while completely ignoring the text modal input (subtitle). Since the text data has a similar data representation as video caption words, it is conducive to the performance improvement of video captioning. In this article, we propose a novel framework, called the multi-stage fusion transformer network (MS-FTN), to realize multi-modal dense video captioning by fusing the text, the audio, and the visual features in stages. We present a multi-stage feature fusion encoder that first fuses audio and visual modalities at a lower level and then fuses them with a global-shared text representation at a higher level to generate a set of multi-modal complementary context features. In addition, an anchor-free event proposal module is proposed to efficiently generate a set of event proposals without the complex anchor calculation. Extensive experiments on the subsets of the ActivityNet Captions dataset show that our proposed MS-FTN achieves superior performance and efficient computation. Moreover, the ablation studies demonstrate that the global-shared text representation is more suitable for multi-modal dense video captioning.

Abstract:
Near-lossless compression of point clouds is suitable for the application scenarios with low distortion tolerance and certain requirements on the rate. Near-lossless attribute compression usually adopts a level-of-detail structure, where the dependencies between the layers make it possible to improve the rate-distortion (R-D) performance by using different quantization parameters for different layers. In this work, a theoretical analysis of the dependencies between adjacent layers is carried out, based on which the dependent Distortion-Quantization and Rate-Quantization models are established for point cloud attribute compression. Then an algorithm for quantization parameter cascading based on R-D optimization is proposed and implemented for near-lossless compression of point cloud attributes. The experimental results show that the proposed method has a superior performance gain compared to state-of-the-art for the Hausdorff R-D performance. At the same time, the proposed method improves subjective quality and is well adapted to various categories of point clouds.

Affiliations: School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China; School of Artificial Intelligence, Xidian University, Xi'an, China; Faculty of Information Technology, the Engineering Research Center of Intelligent Perception and Autonomous Control, Ministry of Education, the Beijing Laboratory of Smart Environmental Protection, the Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing, China

Abstract:
Quality assessment for stitched panoramic images (SPIQA) is of great significance for the stitching algorithm optimization. By contrast, this task is much more challenging and arduous than traditional IQA task due to the high resolution of stitched panoramic images and the particularity and complexity of stitching distortions. For this task, we propose an effective method based on patch registration and bidimensional feature aggregation (PRBFA). First, inspired by the attention mechanism of the human visual system and the limited range of human vision, a soft patch segmentation and selection method is presented to determine the key patches in panoramic images to participate in the following patch matching and feature alignment stages, achieving patch registration between the panoramic image and the corresponding constituent images. Further, to fully simulate the human visual perception process from local viewport to panorama, the feature exploration is successively performed from local to global, which is also adaptive to the complexity of the distortions in stitched panoramic images. For performance testification, extensive experiments are conducted on the publicly released SPIQA database, the results of which prove the performance superiority of the PRBFA method.

Abstract:
Due to the increased demand of 360^\circ images (e.g., virtual reality), estimating 360^\circ depths via deep learning has drawn attention recently. However, all previous studies share the same fundamental limitation: a lack of data. To address the issue of data insufficiency, self-supervised learning and uncertainty-aware learning based on a mixture density network (MDN) have been actively studied and have achieved great success on various tasks. Unfortunately, under the harsh training environment of 360^\circ depth estimation tasks (e.g., a large field-of-view, distortion), we observe that the practical difficulties of self-supervised learning and MDN-based uncertainty-aware learning become a critical factor degrading the depth results. In this article, we propose an adversarial mixture density network (AMDN) and uncertainty-based joint learning to improve the depth qualities by addressing the data insufficiency issue properly. For the AMDN, architectures and objective functions of the MDN are redesigned in an adversarial manner. For uncertainty-based joint supervised and self-supervised learning, the negative effects of the self-supervised learning of 360^\circ depths are filtered out based on the epistemic uncertainties of the AMDN. Therefore, only the positive effects of self-supervised learning can be realized. Through extensive experiments, we demonstrate that the proposed approaches achieve much more accurate depths as compared with very recent studies for various datasets. Moreover, the proposed approaches also yield sophisticated uncertainties in a single forward path, in which previous studies could not.

Abstract:
Reversible data hiding-based contrast enhancement (RDHCE) can be used in contrast enhancement for medical images, and it has been a popular research topic in recent years. However, the existing RDHCE methods suffer from the problem of inaccurate segmentation of the region of interest (ROI) in medical images, which can impact the contrast enhancement effect of the images. Moreover, some methods face limitations in their universality for ROI histograms with few empty bins on both sides, which results in unsatisfactory embedding capacity and contrast enhancement effect. To solve these problems, this study proposes an improved RDHCE method for medical images. The proposed method uses the UNet3+ network model, which makes the segmented ROI histograms more consistent with the subjective judgment of doctors compared to those obtained by traditional segmentation approaches. In addition, a multi-group stretching method is proposed to address the limitation of histogram expansion caused by the empty bins on both histogram sides, enabling adaptation to different ROI histograms with varying gray distributions. Compared to state-of-the-art RDHCE methods, the proposed method offers better generalizability, superior contrast enhancement performance and a larger ROI embedding capacity. It can greatly improve the visual quality of medical images in the field of medical imaging and aid doctors in making more accurate diagnoses.

Abstract:
Image quality assessment is a fundamental problem in the field of image processing, and due to the lack of reference images in most practical scenarios, no-reference image quality assessment (NR-IQA), has gained increasing attention recently. With the development of deep learning technology, many deep neural network-based NR-IQA methods have been developed, which try to learn the image quality based on the understanding of database information. Currently, Transformer has achieved remarkable progress in various vision tasks. Since the characteristics of the attention mechanism in Transformer fit the global perceptual impact of artifacts perceived by a human, Transformer is thus well suited for image quality assessment tasks. In this paper, we propose a Transformer based NR-IQA model using a predicted objective error map and perceptual quality token. Specifically, we firstly generate the predicted error map by pre-training one model consisting of a Transformer encoder and decoder, in which the objective difference between the distorted and the reference images is used as supervision. Then, we freeze the parameters of the pre-trained model and design another branch using the vision Transformer to extract the perceptual quality token for feature fusion with the predicted error map. Finally, the fused features are regressed to the final image quality score. Extensive experiments have shown that our proposed method outperforms the state-of-the-art methods in both authentic and synthetic image datasets. Moreover, the attentional map extracted by the perceptual quality token also does conform to the characteristics of the human visual system.

Abstract:
Blind Image Quality Assessment (BIQA) is susceptible to poor transferability when the distribution shift occurs, e.g., from synthesis degradation to authentic degradation. To mitigate this, some studies have attempted to design unsupervised domain adaptation (UDA) based schemes for BIQA, which intends to eliminate the domain shift through adversarial-based feature alignment. However, the feature alignment is usually taken at the low-frequency space of features since the global average pooling operation. This ignores the transferable perception knowledge in other frequency components and causes the sub-optimal solution for the UDA of BIQA. To overcome this, from a novel frequency perspective, we propose an effective alignment strategy, i.e., Frequency Alignment (dubbed FreqAlign), to excavate the perception-oriented transferability of BIQA in the frequency space. Concretely, we study what frequency components of features are more proper for perception-oriented alignment. Based on this, we propose to improve the perception-oriented transferability of BIQA by performing feature frequency decomposition and selecting the frequency components that contained the most transferable perception knowledge for alignment. To achieve a stable and effective frequency selection, we further propose the frequency movement with a sliding window to find the optimal frequencies for alignment, which is composed of three strategies, i.e., warm up with pre-training, frequency movement-based selection, and perturbation-based finetuning. Extensive experiments under different domain adaptation settings of BIQA have validated the effectiveness of our proposed method.

Abstract:
Reversible data hiding in encrypted images (RDHEI) has gained significant popularity among security and privacy researchers as well as users, because of its features such as reversibility, embedding capacity (EC), and security. To enlarge the EC while ensuring the complete reversibility and security, we propose a bit-plane based RDHEI method based on multi-level blocking with quad-tree. The proposed method uses median edge detector (MED) as well as difference predictor to transform the original input image into a low-magnitude difference matrix. The difference matrix is then encoded by first employing a novel quad-tree based bit-plane representation strategy to exploit the intra-bit plane correlation and subsequently by inter bit-plane redundancy mitigation strategy to exploit inter bit-plane level correlation, for significantly condensing their size. Thus, a bigger room is reserved inside the cover image for embedding, so that a large amount of secret data can be hidden while ensuring the complete reversibility of the image. Experimental results validate the superiority of the proposed method over the state-of-the-art methods.

Abstract:
Video memorability measures the degree to which a video is remembered by different viewers and has shown great potential in various contexts, including advertising, education, and health care. While extensive research has been conducted on image memorability, the study of video memorability is still in its early stages. Existing methods in this field primarily focus on coarse-grained spatial feature representation and decision fusion strategies, overlooking the crucial interactions between spatial and temporal domains. Therefore, we propose an end-to-end collaborative spatial-temporal network called VMemNet, which incorporates targeted attention mechanisms and intermediation fusion strategies. This enables VMemNet to capture the intricate relationships between spatial and temporal information and uncover more elements of memorability within video visual features. VMemNet integrates spatially and semantically guided attention modules into a dual-stream network architecture, allowing it to simultaneously capture static local cues and dynamic global cues in videos. Specifically, the spatial attention module is used to aggregate more memorable elements from spatial locations, and the semantically guided attention module is used to achieve semantic alignment and intermediate fusion of the local and global cues. In addition, two types of loss functions with complementary decision rules are associated with the corresponding attention modules to guide the training process of the proposed network. Experimental results obtained on a publicly available dataset verify that the proposed VMemNet approach outperforms all current single- and multi-modal methods in terms of video memorability prediction.

Abstract:
The exceptionally high bandwidth requirement for delivering high-quality live 360^\circ video poses a significant challenge to current network capacity. Mitigating such bandwidth starvation necessitates accurate field-of-view (FoV) prediction to focus limited resources on the viewer's area of interest. However, FoV prediction for live 360^\circ streaming can be complex due to the time-sensitive nature of live content and the limited knowledge available for model training. Our paper introduces a novel framework, CoLive, for predicting the FoV in 360^\circ live streaming. CoLive accelerates FoV prediction by offloading model training from viewers to the edge and migrating saliency feature detection to the server side. Observations on user clustering of viewing behaviors further motivate us to propose a novel dynamic clustered learning algorithm. The algorithm dynamically groups users according to their model update gradients and enables them to train a shared model that better suits their viewing preferences. We conduct extensive experiments on the public 360^\circ video datasets and demonstrate that CoLive outperforms state-of-the-art solutions in terms of prediction performance and bandwidth savings.

Abstract:
Graph convolutional network-based methods have recently shown promising performance in skeleton-based data processing. However, these methods have two critical issues in skeleton-based motion prediction tasks: First, graph modeling of motion poses is based on the fixed graph according to the physical connection of human joints and ignores the exploration of deep implicit information based on human dynamic kinetics. Second, existing methods usually use motion information in a single semantic space to model the whole motion sequences, underestimating diverse semantic patterns for improving the modeling ability. To address the first issue, we propose the Attention-based Dynamic Graph Convolution method, which tries to capture implicit semantic information dynamically. To address the second issue, we propose the Kinematic-based Semantics Aggregation Block (KSAB), which combines various semantic features from four semantic perspectives to rich motion representation. Integrating the above two designs, we propose a novel Multi-Semantics Aggregation Network (MANet), resulting in more comprehensive feature extraction in dynamic implicit semantics learning to enhance motion prediction. Extensive experiments are conducted to validate the effectiveness of MANet, which outperforms state-of-the-art methods by 10.9%, 6.6%, and 19.6% in terms of MPJPE for motion prediction on Human3.6M, CMU Mocap, and 3DPW datasets, respectively.

Abstract:
In adaptive video streaming, the design of an adaptive bitrate (ABR) strategy is critical for the quality-of-experience (QoE) perceived by users. Though current learning-based ABR algorithms achieve state-of-the-art performance for users with a given QoE metric setting for training, they may unfortunately suffer the poor generalization issue for other users with different QoE preferences. Besides, how to quantitatively characterize the distinct QoE preference for a user has also not been extensively studied yet. In this paper, we propose STEER, a successor feature-based transfer reinforcement learning framework for fast learning the ABR strategies on heterogeneous QoE preferences. Specifically, we first develop a QoE preference analysis scheme to infer the personal QoE preference of a single user based on the user's actual viewing history. We then formulate the personalized QoE maximization problem as a reinforcement learning (RL) task, which optimizes the ABR strategy to maximize the overall QoE perceived by the user. Further, we model the QoE maximization problem for multiple users with heterogeneous QoE preferences as a multi-task RL problem, with each task distinguished by the user-distinct QoE preference. To efficiently address this problem, the proposed STEER solves for each RL-based ABR task by learning its optimal successor feature (SF) function, which can be exploited as shared knowledge across tasks to facilitate the transfer between tasks. With SF functions, STEER can quickly evaluate the optimal policies of previously learned tasks on a new task, and further use the generalized policy improvement operation to obtain a jumpstart policy. Both theoretically and empirically, we show that this jumpstart policy is a good initial policy with a performance guarantee for better generalization in the new task, and can also lead to a faster convergence to the optimal policy of the new task.

Abstract:
Existing point-of-interest (POI) recommendation methods only show the direct recommendation results and lack the proper reasons for recommendation. In recent years, explainable recommendation has become an increasingly important subfield in recommendation systems. The aim of explainable recommendation is to provide a reason why an item is recommended to a user. In this way, it helps to improve the transparency, persuasiveness and user satisfaction of recommendation systems. The explainable recommendation should indicate users' preferences for POIs, such as the category and the price. In addition, to increase the diversity of the results, we take emotional intensity into account in our model to generate more vivid reasons. To this end, we propose a hierarchical attention-based transformer model to generate reasons with specific topics and different emotions. With a hierarchical attention mechanism, we can capture the word-level and attribute-level preferences of users. In addition, we also learn the latent representation of the emotion score to generate diverse recommendation reasons. We evaluate the proposed model on a new real-world dataset collected from three travel service websites. The experimental results demonstrate that our method outperforms the related approaches for reason generation.

Abstract:
Multi-focus image fusion is a technique to fuse the images focused on different depth ranges to generate an all-in-focus image. Existing deep learning approaches to multi-focus image fusion can be categorized as end-to-end methods and decision map based methods. End-to-end methods can generate natural fusion near the focus-defocus boundaries (FDB), but the output is often inconsistent with the input in the areas far from the boundaries (FFB). On the contrary, decision map based methods can preserve original images in the FFB areas, but often generate artifacts near the FDB. In this article, we propose a dual-branch network for multi-focus image fusion (DB-MFIF) to exploit the best of both worlds, achieving better results in both FDB and FFB areas, i.e. with naturally sharper FDB areas and more consistent FFB areas with the inputs. In our DB-MFIF, an end-to-end branch and a decision map based branch are proposed to mutually assist each other. In addition, to this end, two map-based loss functions are also proposed. Experiments show that our method surpasses existing algorithms on multiple datasets, both qualitatively and quantitatively, and achieves the state-of-the-art performance.

Abstract:
Virtual-to-real registration is a crucial aspect of 3D registration, which presents a more challenging multimodal 3D registration problem due to the different data structures between virtual and real models. In this paper, we utilize point cloud registration algorithm to align virtual and real models, transforming the multimodal 3D registration problem into a cross-source point cloud registration problem. We propose a method for extracting macro and micro structures to represent the shared features of virtual and real models, combined with a multi-constraint registration algorithm, to achieve high-accuracy virtual-to-real registration tasks. This method can register unseen 3D objects using virtual prior knowledge, and allow partial point cloud registration without the need for a 360-degree scan of the model. Our approach can effectively resist interference from typical cross-source point cloud registration problems such as varying densities, missing data, and distribution changes. Furthermore, by processing only 0.2% of the original number of point clouds through downsampling, we can effectively diminish the effects of noise and outlier, as well as significantly decrease time consuming. Experimental results show that our algorithm outperforms other advanced point cloud registration algorithms in cross-source point cloud registration for virtual-to-real registration.

Abstract:
Capturing images at night are susceptible to inadequate illumination conditions and motion blurring. Given the typical coupling of these two forms of degradation, a pioneer work takes a compact approach of brightening followed by deblurring. However, this sequential approach may compromise informative features and elevate the likelihood of generating unintended artifacts. In this paper, we observe that the co-existing low light and blurs intuitively impair multiple perceptions, making it difficult to produce visually appealing results. To meet these challenges, we propose perceptual decoupling with heterogeneous auxiliary tasks (PDHAT) for joint low-light image enhancement and deblurring. Based on the crucial perceptual properties of the two degradations, we construct two individual auxiliary tasks: coarse preview prediction (CPP) and high-frequency reconstruction (HFR), so that the perception of color, brightness, edges, and details are decoupled into heterogeneous auxiliary tasks to obtain task-specific representations for parallel assisting the main task: joint low-light enhancement and deblurring (LLE-Deblur). Furthermore, we develop dedicated modules to build the network blocks in each branch based on the exclusive properties of each task. Comprehensive experiments are conducted on LOL-Blur and Real-LOL-Blur datasets, showing that our method outperforms existing methods on quantitative metrics and qualitative results.

Abstract:
Knowledge-based visual question answering (KBVQA) aims to retrieve the external knowledge out of images to answer questions. However, current methods always introduce various irrelevant knowledge due to two drawbacks: (1) Synonymy issue. Existing methods heavily rely on words from questions or object labels in images to match knowledge from databases, which disregards the same word may hold multiple meanings within different contexts. (2) Knowledge uncertainty issue. Due to the absence of supervisory signals, recent methods can not determine which knowledge is applicable for answer inference, which can mislead to admit useless knowledge. To address these two problems, we propose to supervise the process of knowledge retrieval over a tree structure for KB-VQA task. For the synonymy issue, we construct a hierarchical knowledge tree to capture the subordination information between knowledge facts, mitigating the impact of synonyms on knowledge retrieval. For the knowledge uncertainty issue, we use the retrieval history as the ground truth to supervise the knowledge retrieval, which facilitates the QA model to form an explicit path of knowledge facts for answer understanding. Finally, we integrate the image, question, and retrieved knowledge into a variant of transformer to predict answers. Experimental results validate the effectiveness of the proposed method on KR-VQA, OK-VQA and VQA v2 datasets.

Abstract:
To drive upgrades of Depth-Image-Based Rendering (DIBR) algorithms, depth image refinement, etc., quality assessment models for DIBR-synthesized images in 3D video systems are developed. However, most of these models could not effectively evaluate distortion due to irregular stretching (e.g., crumbling), which is more complex and common than black holes and regular stretching (e.g., horizontal stretching) in synthesized images. To make an attempt at this issue, a new quality assessment method is proposed for DIBR views. First, feature point matching and affine transformation are adopted to remove and compensate for the global object shift between reference and synthesized view images. Second, multi-scale discrete wavelet transform is utilized to extract multi-scale structure distortion; gradient magnitude similarity is further integrated to highlight the distortion features; morphological open operation and median filtering are adopted to exclude perceptually unimportant features. Third, scores are obtained by standard deviation pooling on distortion feature maps for each wavelet scale and sub-band. Experimental results demonstrate that our proposed model outperforms the state-of-the-art handcrafted feature-based DIBR-synthesized image quality assessment models on IETR database, and performs the best on average on IETR and IRCCyN/IVC databases.

Abstract:
Heavy rain significantly reduces image visibility, hindering tasks like autonomous driving and video surveillance. Many existing rain removal methods, while effective in light rain, falter under heavy rain due to their reliance on purely spatial features. Recognizing this challenge, we introduce the Wavelet-Spatial Dual Attention Transformer Framework (WSDformer). This innovative architecture adeptly captures both frequency and spatial characteristics, anchored by the wavelet-spatial dual attention (WSDA) mechanism. While the spatial attention zeroes in on intricate local details, the wavelet attention leverages wavelet decomposition to encompass diverse frequency information, augmenting the spatial representations. Furthermore, addressing the persistent issue of incomplete structural detail restoration, we integrate the PriorFormer Block (PFB). This unique module, underpinned by the Prior Fusion Attention (PFA), synergizes residual channel prior features with input features, thereby enhancing background structures and guiding precise rain feature extraction. To navigate the intrinsic constraints of U-shaped transformers, such as semantic discontinuities and subdued multi-scale interactions from skip connections, our Cross Interaction U-Shaped Transformer Network is introduced. This design empowers superior semantic layers to streamline the extraction of their lower-tier counterparts, optimizing network learning. Empirical analysis reveals our method's leading prowess across rainy image datasets and achieves state-of-the-art performance, with notable supremacy in heavy rainfall conditions. This superiority extends to diverse visual challenges and real-world rainy scenarios, affirming its broad applicability and robustness.

Abstract:
High-quality underwater imaging is crucial for underwater exploration. However, particle scattering and light absorption by seawater significantly degrade image clarity. To address these issues, we propose a novel underwater image enhancement (UIE) method that combines pixel distribution remapping (PDR) with a multi-priority Retinex variational model. We design a pre-compensation method for severely attenuated channels that effectively prevents new color artifacts during color correction. By combining the inter-channel coupling relationships, we compute a limiting factor to remap pixel distribution curves to improve image contrast. In addition, considering the significant noise interference, we introduce the prior knowledge, including underwater noise and texture priors, while constructing the variational model, and design penalty terms that match the underwater characteristics to remove excessive noise in the reflectance component. Our approach efficiently decouples the illumination and reflectance components using a rapid solver. Subsequently, gamma correction adjusts the illumination component, and the corrected illumination and reflectance components are fused to reconstruct the final natural output image. Comprehensive evaluations across various datasets reveal that our approach significantly surpasses current state-of-the-art (SOTA) methods. These results demonstrate the effectiveness of our method in correcting color bias and compensating for luminance losses in underwater imagery.

Abstract:
In recent years, there have been advancements in developing Depth-Image-Based Rendering (DIBR) views. However, the quality of these synthesized views is often degraded by inefficient in-painting techniques and synthesis procedures, leading to geometric and structural distortions. This paper introduces two novel approaches to evaluate the quality of DIBR synthesized views, using full reference (FR) and no-reference (NR) metrics. The proposed FR quality assessment (QA) metric is based on the observation that the deep features of the Non-Subsampled Contourlet Transform (NSCT) maps capture the perceptually important characteristics of the images. By calculating the difference between these deep feature vectors of the reference and distorted views, we determine the quality of the image. Moreover, a lot of existing NR metrics typically divide an image into blocks and assign the same subjective quality scores to each block for training a deep learning model. However, this approach is not suitable for DIBR synthesized views, as distortions are often localized in specific areas rather than affecting the entire view. Consequently, the performance of existing block-based deep-learning algorithms suffers due to the absence of accurate ground truth scores for each image block. To address this limitation, this work proposes an innovative method for calculating ground truth scores for individual image blocks. This process is similar to the proposed FR metric. Firstly, we obtain the deep features of NSCT map of an image block and the quality score for each block is calculated using its and the reference block's feature vector. These block-wise ground truth scores are used to train a deep learning model which serves as an NR metric for estimating the quality of a given test block. Finally, the predicted block-level quality values are aggregated to determine the overall quality of the entire image. Experimental results demonstrate that both the proposed algorithms perform better than the existing objective metrics for DIBR synthesized views.

Abstract:
Recently, three-dimensional (3D) point-cloud analysis has been extensively utilized in the domain of machine vision, encompassing tasks include shape classification and segmentation. However the inherent disorder in point clouds poses a challenge in capturing relationships among points, particularly when dealing with mutilated and occluded data. To this end, We propose the Point Geometry Transformation (PointGT) method for 3D point-cloud classification and part segmentation, by exploring the underlying geometric structure in the local and global of points. Specifically, the efficacy of PointGT arises from the integration of a local abstraction (LA) module and an optimization strategy. The LA module is tailored to address the localized features inherent to point clouds. This module encapsulates the multidimensional attributes of local edge and inside points. The bi-directional cross-attention mechanism amalgamates these two constituents into the native channel with the primary objective of optimizing the exploitation of edge and inside delineations, thereby judiciously mitigating noise artifacts. Ultimately, the channel residual connections disseminate the post-downsampling point attributes, thereby inheriting the edge and inside delineations gleaned via post bi-directional attention. The effectiveness of the proposed method was verified through the validation of point-cloud classification and segmentation datasets. The empirical findings confirmed the efficacy of PointGT; accuracies of 93.2% and 87.8% were achieved for the ModelNet40 and ScanObjectNN datasets, respectively.

Abstract:
In real-world applications, image degeneration caused by adverse weather is always complex and changes with different weather conditions from days and seasons. Systems in real-world environments constantly encounter adverse weather conditions that are not previously observed. Therefore, it practically requires adverse weather removal models to continually learn from incrementally collected data reflecting various degeneration types. Existing adverse weather removal approaches, for either single or multiple adverse weathers, are mainly designed for a static learning paradigm, which assumes that the data of all types of degenerations to handle can be finely collected at one time before a single-phase learning process. They thus cannot directly handle the incremental learning requirements. To address this issue, we made the earliest effort to investigate the continual all-in-one adverse weather removal task, in a setting closer to real-world applications. Specifically, we develop a novel continual learning framework with effective knowledge replay (KR) on a unified network structure. Equipped with a principal component projection and an effective knowledge distillation mechanism, the proposed KR techniques are tailored for the all-in-one weather removal task. It considers the characteristics of the image restoration task with multiple degenerations in continual learning, and the knowledge for different degenerations can be shared and accumulated in the unified network structure. Extensive experimental results demonstrate the effectiveness of the proposed method to deal with this challenging task, which performs competitively to existing dedicated or joint training image restoration methods.

Abstract:
Self-supervised learning has achieved great success in both natural language processing and 2D vision, where masked modeling is a quite popular pre-training scheme. However, extending masking to 3D point cloud understanding that combines local and global features poses a new challenge. In our work, we present Point-LGMask, a novel method to embed both local and global contexts with multi-ratio masking, which is quite effective for self-supervised feature learning of point clouds but is unfortunately ignored by existing pre-training works. Specifically, to avoid fitting to a fixed masking ratio, we first propose multi-ratio masking, which prompts the encoder to fully explore representative features thanks to tasks of different difficulties. Next, to encourage the embedding of both local and global features, we formulate a compound loss, which consists of (i) a global representation contrastive loss to encourage the cluster assignments of the masked point clouds to be consistent to that of the completed input, and (ii) a local point cloud prediction loss to encourage accurate prediction of masked points. Equipped with our Point-LGMask, we show that our learned representations transfer well to various downstream tasks, including few-shot classification, shape classification, object part segmentation, as well as real-world scene-based 3D object detection and 3D semantic segmentation. Particularly, our model largely advances existing pre-training methods on the difficult few-shot classification task using the real-captured ScanObjectNN dataset by surpassing over 4% to the second-best method. Also, our Point-LGMask achieves 0.4% AP_25 and 0.8% AP_50 gains on 3D object detection task over the second-best method. For semantic segmentation, our Point-LGMask surpasses the second-best method by 0.4% mAcc and 0.5% mIoU.

Abstract:
Recent years have witnessed an exponential increase in the demand for face video compression, and the success of artificial intelligence has expanded the boundaries beyond traditional hybrid video coding. Generative coding approaches have been identified as promising alternatives with reasonable perceptual rate-distortion trade-offs, leveraging the statistical priors of face videos. However, the great diversity of distortion types in spatial and temporal domains, ranging from the traditional hybrid coding frameworks to generative models, present grand challenges in compressed face video quality assessment (VQA) that plays a crucial role in the whole delivery chain for quality monitoring and optimization. In this paper, we introduce the large-scale Compressed Face Video Quality Assessment (CFVQA) database, which is the first attempt to systematically understand the perceptual quality and diversified compression distortions in face videos. The database contains 3,240 compressed face video clips in multiple compression levels, which are derived from 135 source videos with diversified content using six representative video codecs, including two traditional methods based on hybrid coding frameworks, two end-to-end methods, and two generative methods. The unique characteristics of CFVQA, including large-scale, fine-grained, great content diversity, and cross-compression distortion types, make the benchmarking for existing image quality assessment (IQA) and VQA feasible and practical. The results reveal the weakness of existing IQA and VQA models, which challenge real-world face video applications. In addition, a FAce VideO IntegeRity (FAVOR) index for face video compression was developed to measure the perceptual quality, considering the distinct content characteristics and temporal priors of the face videos. Experimental results exhibit its superior performance on the proposed CFVQA dataset.

Abstract:
Several approaches have been proposed to estimate quality in subjective experiments while highlighting peculiar subject behaviors. However, there is some room for improvement in existing approaches, both in terms of robustness to noise and the ability to accurately indicate several peculiar subject behaviors in subjective experiments. This work advances the state-of-the-art in three main directions: i) A new approach to estimate the subjective quality from noisy ratings is proposed and is shown to be more robust to noise than are four state-of-the-art approaches; ii) a novel subject scoring model is proposed that makes it possible to highlight several peculiar behaviors typically observed in subjective experiments; and iii) our proposed probabilistic subject scoring model results from the proof of a theorem, whereas in previous approaches a probabilistic scoring model is assumed a priori. This represents an important first step toward models supported by a stronger theoretical foundation. Numerical experiments conducted on several datasets highlight the effectiveness of our proposal.

Abstract:
In real-world environments, outdoor imaging systems are often affected by disturbances such as rain degradation. Especially, in nighttime driving scenes, insufficient and uneven lighting shrouds the scenes in darkness, resulting degradation of both the image quality and visibility. Particularly, in the field of autonomous driving, the visual perception ability of RGB sensors experiences a sharp decline in such harsh scenarios. Additionally, driving assistance systems suffer from reduced capabilities in capturing and discerning the surrounding environment, posing a threat to driving safety. Single-view information captured by single-modal sensors cannot comprehensively depict the entire scene. To address these challenges, we developed an image de-raining framework tailored for rainy nighttime driving scenes. It aims to remove rain artifacts, enrich scene representation, and restore useful information. Specifically, we introduce cooperative learning between visible and infrared images captured by different sensors. By cross-view fusion of these multi-source data, the scene within the images gains richer texture details and enhanced contrast. We constructed an information cleaning module called CleanNet as the first stage of our framework. Moreover, we designed an information fusion module called FusionNet as the second stage to fuse the clean visible images with infrared images. Using this stage-by-stage learning strategy, we obtain de-rained fusion images with higher quality and better visual perception. Extensive experiments demonstrate the effectiveness of our proposed Cross-View Cooperative Learning (CVCL) in adverse driving scenarios in low-light rainy environments. The proposed approach addresses the gap in the utilization of existing rain removal algorithms in specific low-light conditions. It also holds promise for extending the application of image de-raining and image fusion methods to computer vision tasks.

Abstract:
Correctly labeled data significantly impacts the success of deep learning for image classification and other computer vision tasks. However, the accuracy of labels annotated by humans often decreases as the amount of data increases, and a rise in the noisy label rate deteriorates a neural network's performance. Thus, deep learning with noisy labels has attracted extensive attention. In this paper, we propose a novel multi-network method that achieves robust performances in image classification with noisy labels. In this method, two models are trained at each epoch using supervised learning and semi-supervised learning. As the selection of the labeled data has a great impact on the semi-supervised learning performance, we improve the sample selection method to obtain a better division of the dataset. Specifically, we divide the semi-supervised learning dataset using a combination of the per-sample losses and model memory. To enhance the semi-supervised learning performance, we propose a novel method to train the unlabeled data by combining negative learning and feature space renormalization. Finally, we verify the performance of our method for learning with noisy labels on four benchmark datasets, which include Cifar10, Cifar100 and Clothing1M. The experimental results show the effectiveness and robustness of our method.

Abstract:
Night-time image enhancement (NIE) aims at boosting the intensity of low-light regions while suppressing noises or light effects in night-time images, and numerous efforts have been made for this task. However, few explorations focus on the quality evaluation issue of enhanced night-time images (ENTIs), and how to fairly compare the performance of different NIE algorithms remains a challenging problem. In this paper, we firstly construct a new Real-world Night-Time Image Enhancement Quality Assessment (i.e., RNTIEQA) dataset that includes two typical types of night-time scenes (i.e., extremely low light and uneven light scenes), and carry out human subjective studies to compare the quality of ENTIs obtained by a set of representative NIE algorithms. Afterwards, a new objective ranking method that comprehensively considering image intrinsic and impairment attributes is proposed for automatically predicting the quality of ENTIs. Experimental results on our RNTIEQA dataset demonstrate that the proposed method outperforms the off-the-shelf competitors. Our dataset and code will be released at https://github.com/Leilei-Huang-work/RNTIEQA-dataset.

Abstract:
As a cost-effective alternative to standard multi-label learning, the multi-label image recognition with partial positive labels (MLR-PPL) task attracts increasing attention, in which merely a portion of positive labels are given while the rest of positive labels and all negative labels are missing. To facilitate this task, we propose a novel framework that leverages semantic correlation among different images in a category-adaptive manner to complement unknown labels accurately. Specifically, the proposed framework consists of two complementary modules. 1) A category-adaptive label discovery (CALD) module is designed to measure the semantic similarity between positive samples and then complement unknown labels with high similarities. 2) A category-adaptive noise rejection (CANR) module is designed to compute the sample weights based on semantic similarities from different samples and discard noisy labels with low weights. Due to the various degrees of confidence calibration among different categories, searching appropriate thresholds for each category in the proposed framework is highly time-consuming. To avoid such a resource-intensive manual tuning, we introduce a category-adaptive threshold updating algorithm that introduces the category-specific positive and negative similarity to adjust the threshold adaptively. Extensive experiments on various benchmarks show that the proposed framework performs better than current state-of-the-art algorithms.

Abstract:
Owing to the capacity of performing full-time target searches, cross-modality vehicle re-identification based on unmanned aerial vehicles (UAV) is gaining more attention in both video surveillance and public security. However, this promising and innovative research has not been studied sufficiently due to the issue of data inadequacy. Meanwhile, the cross-modality discrepancy and orientation discrepancy challenges further aggravate the difficulty of this task. To this end, we pioneer a cross-modality vehicle Re-ID benchmark named UAV Cross-Modality Vehicle Re-ID (UCM-VeID), containing 753 identities with 16015 RGB and 13913 infrared images. Moreover, to meet cross-modality discrepancy and orientation discrepancy challenges, we present a hybrid weights decoupling network (HWDNet) to learn the shared discriminative orientation-invariant features. For the first challenge, we proposed a hybrid weights siamese network with a well-designed weight restrainer and its corresponding objective function to learn both modality-specific and modality shared information. In terms of the second challenge, three effective decoupling structures with two pretext tasks are investigated to flexibly conduct orientation-invariant feature separation task. Comprehensive experiments are carried out to validate the effectiveness of the proposed method.

Affiliations: Center for Future Media and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; National University of Singapore, Singapore; Center for Future Media, and the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China; Center for Future Multimedia and the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract:
We study the task of image retrieval with text feedback, where a reference image and modification text are composed to retrieve the desired target image. To accomplish this goal, existing methods always get the multimodal representations through different feature encoders and then adopt different strategies to model the correlation between the composed inputs and the target image. However, the multimodal query brings more challenges as it requires not only the synergistic understanding of the semantics from the heterogeneous multimodal inputs but also the ability to accurately build the underlying semantic correlation existing in each inputs-target triplet, i.e., reference image, modification text, and target image. In this paper, we tackle these issues with a novel Align and Retrieve (AlRet) framework. First, our proposed methods employ the contrastive loss in the feature encoders to learn meaningful multimodal representation while making the subsequent correlation modeling process in a more harmonious space. Then we propose to learn the accurate correlation between the composed inputs and target image in a novel composition-and-decomposition paradigm. Specifically, the composition network couples the reference image and modification text into a joint representation to learn the correlation between the joint representation and target image. The decomposition network conversely decouples the target image into visual and text subspaces to exploit the underlying correlation between the target image with each query element. The composition-and-decomposition paradigm forms a closed loop, which can be optimized simultaneously to promote each other in the performance. Massive comparison experiments on three real-world datasets confirm the effectiveness of the proposed method.

Abstract:
Reversible data hiding in encrypted images (RDHEI) is an effective technology of protecting private data. In this paper, a high-capacity RDHEI method with asymmetric coding and bit-plane block compression is proposed. Our major contributions are twofold. (1) We propose an asymmetric coding technique for processing prediction error (PE) blocks before encryption. The proposed asymmetric coding technique does not generate the sign bit-plane and facilitates massive zeros converging on the high bit-planes. This is beneficial to reserve the embedding room. (2) We present a bit-plane block compression technique for improving the embedding capacity. This technique divides the PE codes in a block into two parts which are both compressed and thus contribute a large embedding room. Experimental results demonstrate that the average embedding rates of the proposed method are 4.174, 4.08 and 3.467 bpp on the BOSSBase, BOWS-2 and UCID datasets, respectively. Comparisons show that our average embedding rates on the three datasets are all bigger than those of some state-of-the-art methods.

Abstract:
(k,n) Threshold secret image sharing (SIS) hides a secret image within n shadows, and at least k shadows are needed for recovery. Due to the popularity and frequency of JPEG recompression, there is a need for robust secret image sharing (ROSIS) designed for JPEG images that is resilient to recompression for practical SIS applications. The current state-of-the-art ROSIS, which relies on error-correcting codes (ECC), is effective only for JPEG compression with quality factors (QFs) of 99 and 100. However, it generates noise-like shadow images that are confined to the spatial domain. In this paper, we present SBC-ROSIS (Robust Secret Image Sharing Scheme Resistant to JPEG Recompression Based on Stable Block Condition), a novel ROSIS scheme that utilizes a stable block condition to guarantee the invariance of discrete cosine transform (DCT) coefficients during JPEG recompression, significantly enhancing the robustness of the scheme. By employing a polynomial-based secret sharing (SS) algorithm, we construct DCT blocks that adhere to stable block condition either directly or through strategic global regulation. Additionally, we carefully consider the similarity between the generated DCT blocks and the original cover DCT blocks. Furthermore, we devised a tailored evaluation methodology specifically for ROSIS. Extensive experimental results indicate that SBC-ROSIS can effectively process JPEG images, achieving a balance among security, robustness, concealment, and adherence to the (k, n) threshold, without relying on steganography, ECC, or pixel expansion, and demonstrating robust performance in realistic recompression scenarios.

Abstract:
The dense captioning task aims at detecting multiple salient regions of an image and describing them separately in natural language. Although significant advancements in the field of dense captioning have been made, there are still some limitations to existing methods in recent years. On the one hand, most dense captioning methods lack strong target detection capabilities and struggle to cover all relevant content when dealing with target-intensive images. On the other hand, current transformer-based methods are powerful but neglect the acquisition and utilization of contextual information, hindering the visual understanding of local areas. To address these issues, we propose a common and distinct knowledge-mining network with content interaction for the task of dense captioning. Our network has a knowledge mining mechanism that improves the detection of salient targets by capturing common and distinct knowledge from multi-scale features. We further propose a content interaction module that combines region features into a unique context based on their correlation. Our experiments on various benchmarks have shown that the proposed method outperforms the current state-of-the-art methods.

Abstract:
Photography, like painting, allows artists to express themselves through their unique style. In digital photography, this is achieved not only with the choice of the subject and the composition but also by means of post-processing operations. The automatic identification of a photographer from the style of a photo is a challenging task, for many reasons, including the lack of suitable datasets including photos taken by a diverse panel of photographers with a clear photographic style. In this paper we present PhotoStyle60, a new dataset including 5708 photographs from 60 professional and semi-professional photographers. Additionally, we selected a reduced version of the dataset, called PhotoStyle10 containing images from 10 clearly distinguishable experts. We designed the dataset to address two tasks in particular: photo authorship attribution and photographic style transfer. In the former, we conducted an extensive analysis of the dataset through several classification experiments. In the latter, we explored the potential of our dataset to transfer a photographer's style to images from the Five-K dataset. Additionally, we propose also a simple but effective multi-image style transfer method that uses multiple samples of the target style. A user study demonstrated that such a method was able to reach accurate results, preserving the semantic content of the source photograph with very few artifacts.

Abstract:
RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image pairs could significantly degrade their performance. In this paper, we make the first attempt to address RGBT SOD for initially captured RGB and thermal image pairs without manual alignment. Specifically, we propose a Semantics-guided Asymmetric Correlation Network (SACNet) that consists of two novel components: 1) an asymmetric correlation module utilizing semantics-guided attention to model cross-modal correlations specific to unaligned salient regions; 2) an associated feature sampling module to sample relevant thermal features according to the corresponding RGB features for multi-modal feature integration. In addition, we construct a unified benchmark dataset called UVT2000, containing 2000 RGB and thermal image pairs directly captured from various real-world scenes without any alignment, to facilitate research on alignment-free RGBT SOD. Extensive experiments on both aligned and unaligned datasets demonstrate the effectiveness and superior performance of our method.

Abstract:
The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in the learned image compression community, we introduce a few large kernel-based depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter point-wise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by 9.49%, 9.47%,\;\textand\; 10.94% on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.

Abstract:
Point Cloud Quality Assessment (PCQA) plays an essential role in optimizing point cloud acquisition, encoding, transmission, and rendering for human-centric visual media applications. In this paper, we propose an objective PCQA model using Complementary Features from 3D and 2D spaces, called CF-PCQA, to measure the visual quality of colored point clouds. First, we develop four effective features in 3D space to represent the perceptual properties of colored point clouds, which include curvature, kurtosis, luminance distance and hue features of points in 3D space. Second, we project the 3D point cloud onto 2D planes using patch projection and extract a structural similarity feature of the projected 2D images in the spatial domain, as well as a sub-band similarity feature in the wavelet domain. Finally, we propose a feature selection and a learning model to fuse high dimensional features and predict the visual quality of the colored point clouds. Extensive experimental results show that the Pearson Linear Correlation Coefficients (PLCCs) of the proposed CF-PCQA were 0.9117, 0.9005, 0.9340 and 0.9826 on the SIAT-PCQD, SJTU-PCQA, WPC2.0 and ICIP2020 datasets, respectively. Moreover, statistical significance tests demonstrate that the CF-PCQA significantly outperforms the state-of-the-art PCQA benchmark schemes on the four datasets.

Abstract:
Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and diverse information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.

Abstract:
With the increasing popularity of digital video, video steganography has become a hot research topic in the field of covert communication and privacy protection. The existing prediction unit (PU) based video steganography often tends to result in large bit rate increase, which is also easily noticeable to the steganography analyst. To solve this problem, an adaptive steganography for HEVC video based on attention-net and PU partition modes is proposed. First, the distortion of modified PUs is analyzed from the perspective of rate distortion optimization at the group of pictures (GOP) level, and we find that modifying PU will lead to distortion accumulation and abnormal bitrate increase. Therefore, an adaptive distortion function based on the improved rate distortion cost is designed, and the embedding distortion is minimized by using Syndrome-Trellis Code (STC) steganography coding. Meanwhile, a super-resolution convolutional neural network with non-local sparse attention-net filter is proposed to replace the in-loop filter in HEVC to reconstruct the reference frame, thereby reducing the bitrate cost and improving the visual quality of stego-video. Experimental results show that the proposed algorithm can achieve superior perceptual quality and bitrate performance comparing with the sate-of-the-art works.

Abstract:
In recent years, a number of image encryption schemes based on DNA coding and nonlinear dynamics have been proposed. Generally, these DNA-based schemes first encode plaintext images into DNA sequences and then encrypt them with pseudorandom elements produced by chaotic systems or other nonlinear dynamics. Although ciphertexts can pass some security tests, many image encryption schemes are being shown to have intrinsic flaws and that they cannot guarantee a high level of security. In this article, we cryptanalyze a family of image encryption schemes for which the encryption kernel is DNA coding or its variant. The complex DNA operation can be simplified as a substitution box (S-box). The whole cryptosystem's security level is thus significantly decreased and is vulnerable to the chosen-plaintext attack. Applications of this concept to break five ciphers are theoretically presented and experimentally verified. In addition, some suggestions for resisting similar attacks are also given in this article.

Abstract:
Global context information is particularly important for comprehensive scene understanding. It helps clarify local confusions and smooth predictions to achieve fine-grained and coherent results. However, most existing light field processing methods leverage convolution layers to model spatial and angular information. The limited receptive field restricts them to learn long-range dependency in LF structure. In this article, we propose a novel network based on deep efficient transformers (i.e., LF-DET) for LF spatial super-resolution. It develops a spatial-angular separable transformer encoder with two modeling strategies termed as sub-sampling spatial modeling and multi-scale angular modeling for global context interaction. Specifically, the former utilizes a sub-sampling convolution layer to alleviate the problem of huge computational cost when capturing spatial information within each sub-aperture image. In this way, our model can cascade more transformers to continuously enhance feature representation with limited resources. The latter processes multi-scale macro-pixel regions to extract and aggregate angular features focusing on different disparity ranges to well adapt to disparity variations. Besides, we capture strong similarities among surrounding pixels by dynamic positional encodings to fill the gap of transformers that lack of local information interaction. The experimental results on both real-world and synthetic LF datasets confirm our LF-DET achieves a significant performance improvement compared with state-of-the-art methods. Furthermore, our LF-DET shows high robustness to disparity variations through the proposed multi-scale angular modeling.

Abstract:
Convolutional neural networks (CNNs) are commonly employed for image emotion recognition owing to their ability to extract local features; however, they have difficulty capturing the global representations of images. In contrast, self-attention modules in a visual transformer network can capture long-range dependencies as global features. Some studies have shown that an image's local and global features determine the emotions of the image and that some local regions can generate an emotional prioritization effect. Therefore, we proposed combining global self-attention features and a local multiscale features network (CGLF-Net) to recognize an image's emotion, extracting image features from global and local perspectives. Specifically, the cross-scale transformer network is employed instead of convolution operations in the global feature branch to enhance its model feature representation. In the local feature branch, the improved feature pyramid module is applied to extract features from different sensory fields, thereby combining semantic information with different scales. Furthermore, the local attention module based on class activation maps guides the network to focus on locally salient regions. In addition, using multibranch loss functions, local and global feature branches are combined to enhance the ability to capture a comprehensive set of features. Consequently, the proposed network achieves recognition accuracies of 75.61% and 65.01% on the FI-8 benchmark dataset and Emotion-6 benchmark dataset, respectively. These results show that the proposed CGLF-Net reliably address the difficulty of extracting global features using CNNs, representing the classification performance of the state-of-the-art.

Abstract:
Recently, deep learning technique has been widely employed to deal with face super-resolution (FSR) problem. It aims to predict the nonlinear relationship between the low-resolution (LR) face images and corresponding high-resolution (HR) ones, which could recover the high-frequency details from the LR degraded textures. However, either CNN-based or Transformer-based approaches mostly enhance the details by exploiting the relationship of local pixels or patches on LR features, the nonlocal features are not fully taken into account for producing high-frequency textures. To improve the above problem, we design a novel dual-branch module which consists of Transformer and CNN respectively. The Transformer branch extracts multiple scale feature embeddings and explores local and nonlocal self-attention simultaneously. Thus, the parallel self-attention mechanism has superior capabilities to capture the local and nonlocal dependencies on face image in the face reconstruction. Furthermore, the traditional CNNs usually extract features by combining pixels in a local convolutional kernel, it may be not effective to recover lost high-frequency details since the variations of local pixels are not well measured, which is important in recovering vivid edges and contours. To this end, we propose the local variation based attention block on the CNN branch, which could enhance the capabilities by directly extracting features from the variation of neighboring pixels. Finally, the Transformer-branch and CNN-branch are combined together by the modulation block to fuse both nonlocal and local advantages from two branches. Experimental results demonstrate the effectiveness of the proposed method when compared with state-of-the-art approaches.

Abstract:
An accurate computational model for image quality assessment (IQA) benefits many vision applications, such as image filtering, image processing, and image generation. Although the study of face images is an important subfield in computer vision research, the lack of face IQA data and models limits the precision of current IQA metrics on face image processing tasks such as face superresolution, face enhancement, and face editing. To narrow this gap, in this article, we first introduce the largest annotated IQA database developed to date, which contains 20,000 human faces – an order of magnitude larger than all existing rated datasets of faces – of diverse individuals in highly varied circumstances. Based on the database, we further propose a novel deep learning model to accurately predict face image quality, which, for the first time, explores the use of generative priors for IQA. By taking advantage of rich statistics encoded in well pretrained off-the-shelf generative models, we obtain generative prior information and use it as latent references to facilitate blind IQA. The experimental results demonstrate both the value of the proposed dataset for face IQA and the superior performance of the proposed model.

Abstract:
Traditional full-reference image quality assessment (FR-IQA) methods predict the perceptual quality of a distorted image with a given pristine-quality image as the reference. However, the near-threshold visual perception suggests that there could be numerous pristine-quality representations that are indistinguishable in a scene, and the so-called pristine image used in FR-IQA for reference is just one of them. With numerous approaches proposed for FR-IQA by evaluating the perceptual similarity, much less work has been dedicated to locating the best reference for the deterministic perceptual similarity measure. This article aims to answer the question that whether enabling the freedom in reference image selection could lead to better performance by designing a new FR-IQA paradigm FLexible REference (FLRE). The FLRE paradigm is developed in the feature space by attempting to obtain the feature-level reference of the distorted image via the selection of its corresponding best explanation within an equal-quality space. To this end, we devise the Perceptually Near-Threshold Estimation (PNTE) and the Pseudo-Reference Search (PRS) strategies. In particular, the PNTE module predicts the equal-quality map of a given pristine-quality feature, forming an equal-quality space. Subsequently, the PRS strategy is employed to locate the reference of the distorted feature within the equal-quality space in an element-wise minimum distance search manner. Due to the lack of the ground-truth reference (i.e., best explanation) of each distorted image, we optimize the pseudo-reference feature learning under three constraints, i.e., the quality regression loss, the disturbance maximization loss, and the content loss. We implement the FLRE as a plug-in module before the deterministic FR-IQA process, and experimental results have demonstrated that combining FLRE with the existing deep feature-based FR-IQA models can significantly improve the quality prediction performance, largely surpassing the state-of-the-art methods.

Abstract:
Recent studies have shown that joint depth and pose estimation using convolutional neural networks (CNNs) can learn unlabelled monocular frames. However, three problems remain: 1) CNNs can only extract local features due to the limited receptive field, 2) scale ambiguity is inherent in the monocular task, and 3) illness regions violate the photometric consistency assumption and produce large errors. We propose a novel framework, ADPDepth, with corresponding effective strategies to ameliorate the above problems. First, a PCAtt module is designed to capture the correlation between channels and efficiently extract multiscale spatial information using a multibranch parallel strategy. Second, depth-pose consistency loss is proposed based on the geometric consistency in depth and pose to constrain the scale between samples, eliminate scale ambiguity and obtain a globally consistent scale. To further improve performance, a cover mask is derived from depth-pose consistency for filtering dynamic objects and outliers to reduce the adverse effects of these illness regions. Extensive experiments are conducted on the KITTI, NYU-Depth and Make3D datasets. Based on public benchmarks, the experimental results confirm that the proposed ADPDepth framework achieves state-of-the-art performance. The effectiveness of each strategy is also verified in subsequent ablation experiments.

Abstract:
Owing to the rapid development of emerging 360^\circ panoramic imaging techniques, indoor 360^\circ depth estimation has aroused extensive attention in the community. Due to the lack of available ground truth depth data, it is extremely urgent to model indoor 360^\circ depth estimation in self-supervised mode. However, self-supervised 360^\circ depth estimation suffers from two major limitations. One is the distortion and network training problems caused by Equirectangular projection (ERP), and the other is that texture-less regions are quite difficult to back-propagate in self-supervised mode. Hence, to address the above issues, we introduce spherical view synthesis for learning self-supervised 360^\circ depth estimation. Specifically, to alleviate the ERP-related problems, we first propose a dual-branch distortion-aware network to produce the coarse depth map, including a distortion-aware module and a hybrid projection fusion module. Subsequently, the coarse depth map is utilized for spherical view synthesis, in which a spherically weighted loss function for view reconstruction and depth smoothing is investigated to optimize the projection distribution problem of 360^\circ images. In addition, two structural regularities of indoor 360^\circ scenes are devised as two additional supervisory signals to efficiently optimize our self-supervised 360^\circ depth estimation model, containing the principal-direction normal constraint and the co-planar depth constraint. The principal-direction normal constraint is designed to align the normal of the 360^\circ image with the direction of the vanishing points. Meanwhile, we employ the co-planar depth constraint to fit the estimated depth of each pixel through its 3D plane. Finally, a depth map is obtained for the 360^\circ image. Experimental results illustrate that our proposed method achieves superior performance than the current advanced depth estimation methods on four publicly available datasets.

Abstract:
Image fusion aims to integrate the complementary information of source images and synthesize a single fused image. Existing image fusion algorithms apply hand-crafted fusion rules to merge deep features which cause information loss and limit the fusion performance of methods since the uninterpretability of deep learning. To overcome the above shortcomings, we propose a learnable fusion rule for infrared and visible image fusion based on class activation mapping. Our proposed fusion rule can selectively preserve meaningful information and reduce distortion. More specifically, we first train an encoder-decoder network and an auxiliary classifier based on the shared encoder. Then, the class activation weights are taken out from the auxiliary classifier, which indicates the importance of each channel. Finally, the deep features extracted by the encoder are adaptively fused according to the class activation weights and the fused image is reconstructed from the fused features via the pre-trained decoder. Note that our learnable fusion rule can automatically measure the importance of each deep feature without human participation. Moreover, it fully preserves the significant features of source images such as salient targets and texture details. Extensive experiments manifest our superiority over state-of-the-art algorithms. Visualization of feature maps and their corresponding weights reveals the high interpretability of our method.

Abstract:
Multimodal understanding aims at constructing semantic correlations among modalities of data while performing various downstream tasks. As one of the primary multimodal downstream tasks, image-text retrieval imposes a high demand on semantic alignment because of the independent expression paradigms of images and text. Existing methods mainly construct a joint embedding space at a single granularity level (either global or local). However, such single reasoning paradigms lack granularity interaction, resulting in semantic inconsistency and cross-domain catastrophes. To address these issues, we design a novel Joint Intra and Inter-grained Network (JIIGNet), focusing on not only intra- but also inter-grained interaction between modalities by combining scene information (global) with region-level (local) instances. Specifically, we simultaneously initiate three specific alignment modules, i.e., global-grained, local-grained, and cross-grained alignment modules, followed by Triplet Attention Refinement to better refine the fused embedding at the alignment-level with proper self and cross attention. For different scenarios, a Style Adaptation Head is further designed to smartly accommodate different samples. We validate JIIGNet through extensive experiments conducted on two widely used datasets: Flickr-30 K and MS-COCO, demonstrating the effectiveness of our proposed method.

Abstract:
Reversible data hiding in encrypted 3D models (RDHEM) is an emerging steganography technique, capable of both encrypting the cover model to ensure confidentiality and embedding additional messages for covert communication. However, the embedding capacity provided by recent RDHEM methods is still at a low level. In this paper, an adaptive vertex grouping strategy is proposed, which can divide the vertices in the cover 3D model into groups. Then, the multi-MSB prediction and Huffman coding are exploited to compress the data volume of vertices. Through proper vertex grouping and efficient data compression of the model vertices, the embedding capacity of the RDHEM can be effectively improved. Additionally, two schemes for 3D model encryption are provided. One is based on a secret sharing method over the Galois field and the other leverages the stream cipher technique. Experimental results show that the embedding capacity of the two proposed schemes significantly outperforms state-of-the-art schemes.

Abstract:
Today,fashion design can be readily performed by most people due to the rapid development of design tools. However, not everyone possesses the professional skills to produce an aesthetically pleasing design. In order to assist an inexperienced user during the design process, this research explored a new fashion-related disentanglement task, with the goal of creating novel fashion items with controllable attributes. The key idea is to develop a unified framework, called CTS-GAN, by disentangling the colors, textures, and shapes of fashion items simultaneously using a generative adversarial network (GAN). Specifically, we first introduced a fashion attribute encoder to decompose input fashion items into three latent spaces, i.e., color, texture, and shape. A fashion item pattern-making module (FIPM)-based generator was then proposed to control the corresponding parameters of color and texture in FIPMs independently and combine them with the shape features in order to accomplish the final generation of new fashion items. Furthermore, three independent pathways were introduced to extract the representations of color, texture, and shape in fashion items to optimize our CTS-GAN in an unsupervised manner. Extensive experimental results demonstrate the effectiveness of our CTS-GAN and suggest that it can generate diverse, novel fashion images by taking full advantage of the controllability of the colors, textures, and shapes of different fashion items.

Abstract:
Despite the various privacy protection methods that are available through medical services platforms, it is still challenging for patients to achieve a desirable level of privacy protection during image sharing. Therefore, this article proposes a privacy protection mechanism, called PPM-SEM, for the secure sharing of electronic patient records (EPRs) and medical images in telemedicine; it includes two stages: Privacy preparation and privacy protection and reconstruction. In the first stage, a dual watermark (i.e., an image watermark) is generated by combining the patient's EPRs with an image, which can be utilized to ensure the security of patient identity data (i.e., a text watermark). In the second stage, a modal transformation network is constructed by training the dual watermark together as an additional channel. This network is called watermark-CycleGAN (W-CycleGAN), which can address the privacy and security issues concerning medical images and provide a double protection mechanism for EPRs. Experimental results demonstrate that only the recovery network with the correct key can restore high-quality medical images. In addition, the patient's EPRs can be fully extracted; i.e., 100% accuracy can be maintained. It is noted that the nonpaired recovery network can also recover visually meaningful medical images, thereby realizing privacy protection for patients in telemedicine scenarios.

Abstract:
Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other's information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.

Abstract:
Domain generalization (DG) aims to train a model with access to a limited number of source domains for generalizing it across various unseen target domains. The key to solving the DG problem is disentangling domain-invariant features (i.e., semantic factors) from domain-specific features (i.e., variation factors) to facilitate generalizable representation learning. Previous studies either implicitly model the semantic and variation factors or ineffectively constrain the disentangling process, thus rendering the disentanglement incomplete and ineffective. In this study, we propose a novel approach, named DualVAE, to explicitly model and disentangle both the semantic and variation factors. DualVAE is based on the variational autoencoder (VAE) architecture. However, it differs from the conventional VAE in that it consists of two paths, which explicitly model the semantic and variation factors. In addition to the reconstruction loss of VAE and the classification loss, three types of regularizations, namely statistical independence regularization, factorized prior regularization, and prediction consistency regularization, are proposed to further facilitate the disentanglement of factors. Experimental results on representative DG benchmarks show that our method performs favourably against previous state-of-the-art methods. Ablation and visualization results demonstrate that semantic and variation factors can be effectively disentangled.

Abstract:
Digital Twin (DT) technologies create digital models of physical entities frequently in multimedia forms, which are crucial for concurrent simulation and analysis of real-world systems. In displaying DTs, Holographic-Type Communication (HTC) provides immersive multimedia access for users to interact with Holographic DTs (HDTs) by transmitting holographic data such as Light Field (LF) and other multisensory information. HDT has applications in remote education, work, and social interactions. However, the effective matching of demand and supply between HDT users and providers remains a challenge. To address this issue, we propose a hierarchical architecture that integrates the DT and HTC paradigms. This architecture incorporates a marketplace for HDT services, leveraging a formulated Double Dutch Auction (DDA) mechanism to optimize matching and pricing based on user and provider valuation. Furthermore, We employ an actor-critic-based Deep Reinforcement Learning (DRL) algorithm to train a DDA auctioneer that dynamically adjusts auction clocks during the auction process. As an alternative to the Multi-layer Perceptron (MLP), we experiment with a Deep Simplistic Variational Quantum Circuit (DSVQC) to reduce the number of parameters and enhance performance stability. Our simulations reveal that the proposed learning-based auctioneer achieves 92% optimal social welfare at a 37% auction information exchange cost for an MLP-based actor and 99% optimal social welfare at a 77% auction information exchange cost for a DSVQC-based actor.

Abstract:
Depth image-based rendering (DIBR) view synthesis is the most widely employed method in real-time FVV research. Despite recent progress, most DIBR-based FVV synthesis approaches are not sufficiently simple and effective in filling holes and artifacts. Additionally, they use RGB-D cameras, which are difficult to widely adopt or take considerable time to estimate high-quality depth images. This article introduces a real-time FVV synthesis system based on DIBR and a depth estimation network. This system includes a 12-view synchronous camera system, a new multistage depth estimation network, a new GPU-accelerated DIBR algorithm, and a virtual view parameter generation method. This system provides the first real-time FVV solution for background-fixed fields based on DIBR and a depth estimation network. It can infer depth images for all camera views and synthesize any virtual view along the horizontal circular arc of the camera rig in real time. To our knowledge, we are the first to introduce background models and foreground masks and a refined multistage structure to address real-time high-quality depth estimation and DIBR FVV synthesis. We also build a high-quality multiview RGB-D synchronous dataset that has promising DIBR FVV synthesis performance to train and evaluate our system. The experimental results demonstrate the real-time and better performance of the proposed system.

Abstract:
3D scene completion (SC) has made progress in the last three years. From the application of mobile robot system, SC should support the downstream task (i.e. mapping or perception), instead of only predicting the completed scenes. However, as the low-cost few-beam LiDAR is widely applied in mobile robot, gap between SC and downstream tasks is large. To generate the high quality completion result, the bottleneck lies in the triple sparsity of input, ground truth (GT) occupancy, and GT foreground. To deal with the triple sparsity, we present an extreme sparse scene completion network (ESC-Net). At first, input sparsity hides most of the spatial information of the scene. A feature completion (FC) decoder is designed to mine the spatial feature using feature-level completion. Then, GT occupancy sparsity hinders representation learning of the real scene with continuous surfaces. A multi-view multi-task attention (MMA) loss is presented to recover the high-quality object boundaries via correcting occupancy and semantic labels of regions from 3D and bird's eye view (BEV) spaces. After that, GT foreground sparsity is the imbalance of foreground and background GT labels. It causes the inaccuracy of local 3D object completion. A combination network (ESC-Net-D) is presented to recover 3D structural details of both foreground and background. Experiment is conducted on KITTI and SemanticPOSS datasets. It shows that ESC-Net has the performance higher than current methods not only on completion task, but also on the downstream tasks (i.e. 3D registration, 3D object detection). Hence, we believe that ESC-Net benefits to the community of mobile robot. Source code is released soon.

Abstract:
Since the emergence of high-quality multimedia processing applications such as video streaming, digital editing, archiving, etc. these days, an intra coding rate control (RC) is becoming an indispensable and important technology. In this article, a frame-level intra RC scheme for Versatile Video Coding (VVC) using a Lagrange multiplier adjustment (LMA) is proposed. The VVC test model (VTM) uses an R-λ model-based rate control. However, the estimation performance of target bits based on an R-λ-QP relation is decreased because the distortion dependencies among consecutive frames are not considered especially for intra RC. Thus, in a rate-distortion optimization (RDO) based encoding, the λ values determined for given quantization parameter (QP) values should be elaborately controlled to increase the target bits estimation performance. In our work, we focus on the intra RC scheme by taking advantage of particle-filtering-based prediction (PFP) for distortion estimates, and precise per-frame λ values can be derived for an appropriate RDO process that can lead to small bit-fluctuations. Our extensive experimental results demonstrate that our RC scheme using the per-frame LMA is superior to the default RC (VTM-16.0rc1) method and the state-of-the-art RC methods with significant margins of average 15.57%, 15.31% and 31.13% improvements in terms of the normalized root mean square error (NRMSE) for All Intra (AI) configuration of VVC, respectively.

Abstract:
In this paper, we propose a novel optical flow guided network with progressive frequencies learning, achieving promising dynamic multi-exposure image fusion. Specifically, the proposed method consists of the optical flow alignment block and the progressive frequencies fusion block, where the former is to alleviate the ghost caused by the camera and object motions, and the latter is dedicated to synthesizing the desired color and details. First, in the optical flow alignment block, we estimate the optical flow between two source images and utilize the deformable convolutional network to achieve spatial alignment guided by the estimated optical flow. Second, in the progressive frequencies fusion block, color correction and details preservation are implemented in two gradual phases, i.e., low and high frequencies. For the low frequency fusion phase, we combine the convolutional neural network and swin transformer to capture local and global features, so as to consider the color correction from a complete perspective. For the high frequency fusion phase, an attention gate is designed to evaluate the important details from source images, bringing fewer artifacts and halos. Finally, the low and high frequency fusion phases are connected through a residual mapping strategy to generate a desired image with reasonable colors and rich details. Extensive experiments on publicly available datasets reveal that our method outperforms the state-of-the-art for both static and dynamic scenarios. Moreover, our method is superior in running efficiency over most of the state-of-the-art methods.

Abstract:
To increase the generalization capability of VQA systems, many recent studies have tried to de-bias spurious language or vision associations that shortcut the question or image to the answer. Despite these efforts, the literature fails to address the confounding effect of vision and language simultaneously. As a result, when they reduce bias learned from one modality, they usually increase bias from the other. In this paper, we first model a confounding effect that causes language and vision bias simultaneously, then propose a counterfactual inference to remove the influence of this effect. The model trained in this strategy can concurrently and efficiently reduce vision and language bias. To the best of our knowledge, this is the first work to reduce biases resulting from confounding effects of vision and language in VQA, leveraging causal explain-away relations. We accompany our method with an explain-away strategy, pushing the accuracy of the questions with numerical answers results compared to existing methods that have been an open problem. The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.

Abstract:
Current cross-modal retrieval methods heavily rely on accurate semantic labels or sample similarity measurements, and need to search for the nearest samples among all samples in the huge search space, severely limiting the application in stratifying large-scale and high-dimensional multimodal data. To tackle with the issues, this paper proposes an unsupervised cross-modal retrieval method to bypass the semanticwise supervision and samplewise similarity from a standpoint of featurewise matching, named by unsupervised dual hashing coding (UDC). It jointly learns the dual hashing codes on semantic tagging and sample content through factorizing a feature matching potential, which is allowed to bridge the semantic and heterogeneous gaps among different modalities simultaneously through maintaining the inter-modality-consistent semantic information and cross-modality-correlated sample content. In this way, each sample is uniquely coded by a head code on semanticwise tags, and tail codes on samplewise content. The dual coding design makes it very efficient for sample retrieval, in which the query sample only need to search for the retrieved ones with the same semantic tag, greatly narrowing down the search space. The proposed model avoids the calculation of massive sample-wise similarity and works with dual hashing coding scheme, which achieves a twofold efficiency enhancement for analyzing the large-scale and high-dimensional multimodal data. Extensive experiments have been conducted to demonstrate that it achieved superiority on computational time and retrieval performance.

Abstract:
The degradation of printed photographs due to inadequate preservation is a major problem that can be addressed through deep learning-based restoration methods. However, these methods are often limited by their reliance on annotated data, making them less effective for new domains with limited training samples. In this paper, we propose a semi-supervised old photo restoration network that employs a continuous important sample mining strategy. Specifically, we explore the learning potential of limited data from three aspects: correcting imbalanced data distribution, assigning significant pseudo labels, and learning from unlabeled data. First, we coordinate a random mask augmented strategy with the Double-consistency Alignment method to address the unbalanced damaged category (scratched damage is more prevalent than other artifact types). Second, we develop a novel Perceptual-aware Pseudo-label Propagation method that selects initial recovered results as reliable pseudo-labels to continuously expand the sample pool. Lastly, we propose a Damage-augmented Contrastive Learning method that constructs positive, anchor, and negative samples within a semi-supervised framework to mine correlations of unlabeled data more effectively. To evaluate our approach, we introduce the Old Photo Detection Dataset (OPDD) and the Old Photo Restoration Dataset (OPRD), both of which consist of 563 (6,179 augmented) photo pairs recovered by professional artists. Our extensive experiments show that our approach significantly outperforms existing methods. Furthermore, we demonstrate the effectiveness of our approach by training an external old photographic plate restoration network using the deuterogenic old photographic film dataset and obtaining promising results.

Abstract:
Road segmentation is an essential component of navigation systems. Although recent advancements in road segmentation, the occurrence of failure segmentations remains inevitable. For safety-critical tasks, e.g., navigation, knowing when and where road segmentation fails is crucial. In this paper, we propose a novel trusted road segmentation architecture, namely Multimodal Multitask Collaborative Revision Network (M2CRN), to improve the trust of road segmentation. Our approach incorporates two strategies to predict and rectify segmentation errors. First, a joint learning framework is devised to generate road segmentation results while estimating failure segmentation masks. Second, the road segmentation branch is equipped with an Uncertainty-Aware Revision Module (UARM), which eliminates the error in road segmentation. Additionally, we suppress the response of error regions in the road segmentation branch with an innovative design, called Adaptive Soft Error Suppression (ASES). To validate our methods, extensive experiments are conducted on three benchmark road segmentation datasets. The results demonstrate significant performance improvements with a real-time inference speed of 33.3 FPS, reaffirming the soundness of our revision model.

Abstract:
We introduce GS-SFS, a method that utilizes a camera array with wide baselines for high-quality multiple human mesh reconstruction in large-scale sports scenes. Traditional human reconstruction methods in sports scenes, such as Shape-from-Silhouette (SFS), struggle with sparse camera setups and small human targets, making it challenging to obtain complete and accurate human representations. Despite advances in differentiable rendering, including 3D Gaussian Splatting (3DGS), which can produce photorealistic novel-view renderings with dense inputs, accurate depiction of surfaces and generation of detailed meshes is still challenging. Our approach uniquely combines 3DGS's view synthesis with an optimized SFS method, thereby significantly enhancing the quality of multiperson mesh reconstruction in large-scale sports scenes. Specifically, we introduce body shape priors, including the human surface point clouds extracted through SFS and human silhouettes, to constrain 3DGS to a more accurate representation of the human body only. Then, we develop an improved mesh reconstruction method based on SFS, mainly by adding additional viewpoints through 3DGS and obtaining a more accurate surface to achieve higher-quality reconstruction models. We implement a high-density scene resampling strategy based on spherical sampling of human bounding boxes and render new perspectives using 3D Gaussian Splatting to create precise and dense multi-view human silhouettes. During mesh reconstruction, we integrate the human body's 2D Signed Distance Function (SDF) into the computation of the SFS's implicit surface field, resulting in smoother and more accurate surfaces. Moreover, we enhance mesh texture mapping by blending original and rendered images with different weights, preserving high-quality textures while compensating for missing details. The experimental results from real basketball game scenarios demonstrate the significant improvements of our approach for multiple human body model reconstruction in complex sports settings.

Affiliations: Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China; Wangxuan Institute of Computer Technology, Peking University, Beijing, China; Energy Research Institute @ NTU, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore; College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China; Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China

Abstract:
Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.

Abstract:
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researchers focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object detection (VSOD) to discover the co-occurrent objects. However, previous approaches design different networks for these similar tasks separately, and they are difficult to apply to each other. Besides, they fail to take full advantage of the cues among inter- and intra-feature within a group of images. In this paper, we introduce a unified framework to tackle these issues from a unified view, term as UFGS (Unified Framework for Group-based Segmentation). Specifically, we first introduce a transformer block, which views the image feature as a patch token and then captures their long-range dependencies through the self-attention mechanism. This can help the network to excavate the patch-structured similarities among the relevant objects. Furthermore, we propose an intra-MLP learning module to produce self-mask to enhance the network to avoid partial activation. Extensive experiments on four CoS benchmarks (PASCAL, iCoseg Internet and MSRC), three CoSD benchmarks (Cosal2015, CoSOD3k, and CocA) and five VSOD benchmarks (DAVIS_16, FBMS, ViSal, SegV2, and DAVSOD) show that our method outperforms other state-of-the-arts on three different tasks in both accuracy and speed by using the same network architecture, which can reach 140 FPS in real-time.

Abstract:
Due to the emergency of multi-view cameras and commercial Light Field (LF) cameras, the demand of high-performance LF quality evaluator is of great significance for guiding LF acquisition, processing and application and further promoting the visual perceived quality of LF visualizations. However, LF Images (LFIs), as high-dimensional data, suffer from various quality degradations not only in the spatial domain but also in the angular domain. Therefore, it is of great challenge to predict LF quality accurately. An effective LF evaluator should be able to represent these heterogeneous artifacts. In this paper, we provide a novel No-Reference LF Quality Assessment Evaluator (NR LF-QAE) to tackle this problem. Firstly, to measure angular consistency among viewports, we utilize group-based representations to character information similarity of aligned view stacks. Secondly, to better describe the texture information of LFIs, unifying spatial-angular texture statistic measurement is performed via Local Binary Patterns from Three Orthogonal Planes (LBP-TOP). Thirdly, we design 3D Log-Gabor filters to extract LF global structure information in Sub-Aperture Images (SAIs) as spatial feature characterizations and 2D Log-Gabor filters are adopted to characterize ray direction/depth information in Epipolar Plane Images (EPIs) as angular feature characterizations. By comprehensive LF information analyses in angular consistency and spatial-angular feature extraction with texture and structure descriptors, experimental results demonstrate the superiority of the proposed NR LF-QAE over the state-of-the-art comparative models in predicting the quality of LFIs on three available benchmark databases.

Abstract:
Sand-dust videos obtained in a low-light environment are characterized by low contrast, nonuniform illumination, color cast, and considerable noise. To realize sand-dust removal and brightness enhancement simultaneously, this article proposes an online low-light sand-dust video enhancement method using adaptive dynamic brightness correction and a rolling guidance filter. The proposed dual-threshold interframe detection strategy involves two methods to treat low-light sand-dust video frames. The first method involves two components: an adaptive dynamic brightness correction algorithm to correct the color deviation of the low-light video frame and improve its brightness and a rolling guidance filter combined with guided image filtering to enhance the frame details. The second method enhances the quality of the incoming frame by reducing the amount of calculation. The first frame of the video is processed using the first method. The processing method of each subsequent frame is determined according to its interframe detection value with the buffer frame. Through qualitative and quantitative comprehensive experiments on low-light sand-dust images and videos, the performance of the proposed method is compared with those of state-of-the-art methods. The proposed method for frame quality improvement achieves the best visual effect in enhancing the quality of low-light sand-dust images, as indicated by the best objective evaluation indicators. Moreover, compared with the framewise enhancement method, the video processing efficiency associated with the dual-threshold interframe detection strategy is 2.77 times higher.

Abstract:
Quality of experience (QoE) has been widely recognized as the primary metric to evaluate user experience in multimedia applications. However, the QoE assessment of tactile virtual environments is still highly dependent on subjective measures. Inspired by the fact that physiological signals can characterize the user's emotional state, we propose a QoE measurement method for virtual reality (VR) with vibrotactile feedback based on frontal lobe power asymmetry (FLPA). The subjective score of vibrotactile experience in VR is used as the ground truth of QoE. The selection of QoE measurement indicators consists of two steps. First, the relationship between FLPA phenomenon and scores of QoE is preliminarily established by statistical methods and Spearman Correlation Coefficient. Then, the most important FLPA feature is selected by random forest, which is the best indicator for QoE measurement. The brain neural images show that vibrotactile feedback in VR can evoke FLPA phenomenon. Correlation analysis shows that there is a significant correlation between subjective scores of QoE and FLPA features. The classification results show that the selected best FLPA feature can be used as a physiological indicator to measure and predict QoE. We achieve mutual interpretation of EEG-based physiological measurements and subjective cognitive outcomes of QoE.

Abstract:
Inspired by certain optimization solvers, the deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist the following two issues: 1) In existing DUNs, most hyperparameters are usually content independent, which greatly limits their adaptability for different input contents. 2) In each iteration, a plain convolutional neural network is usually adopted, which weakens the perception of wider context prior and therefore depresses the expressive ability. In this article, inspired by the traditional Proximal Gradient Descent (PGD) algorithm, a novel DUN for image compressed sensing (dubbed DUN-CSNet) is proposed to solve the above two issues. Specifically, for the first issue, a novel content adaptive gradient descent network is proposed, in which a well-designed step size generation sub-network is developed to dynamically allocate the corresponding step sizes for different textures of input image by generating a content-aware step size map, realizing a content-adaptive gradient updating. For the second issue, considering the fact that many similar patches exist in an image but have undergone a deformation, a novel deformation-invariant non-local proximal mapping network is developed, which can adaptively build the long-range dependencies between the nonlocal patches by deformation-invariant non-local modeling, leading to a wider perception on context priors. Extensive experiments manifest that the proposed DUN-CSNet outperforms existing state-of-the-art CS methods by large margins.

Abstract:
Due to the limitation of commercial light field camera hardware devices, the imaging field of view is quite narrow. Numerous Light Field Image (LFI) stitching algorithms have been developed to expand the field of view. However, it is highly challenging to compare the performance of LFI stitching algorithms in a fair manner. Currently, due to the absence of a comprehensive benchmark database for subjective rating and a reliable objective quality metric, it is fairly difficult to comprehensively and accurately compare the actual performance of existing LFI stitching algorithms. In this study, we dedicate our efforts to the development of quality metrics for stitched Wide field of view LFI (WLFI) from subjective and objective assessment aspects. Specifically, we build the first stitched WLFI database, which provides the stitched WLFIs generated by eight representative LFI stitching algorithms, along with their corresponding subjective rating scores. Secondly, an effective blind stitched WLFI quality metric is developed to accurately assess the visual quality degradation. Extensive experiments conducted over our established WLFI database demonstrate that the proposed metric achieves higher consistency with subjective ratings than the competing quality metrics.

Abstract:
In moment-based watermarking schemes, the accuracy of the moments is crucial for constructing robust watermarking schemes. The robustness of the watermarking scheme relies heavily on the proper representation of the moments. Despite the importance, current theoretical research on accuracy is very limited in watermarking techniques. To this end, we propose a novel robust image watermarking scheme based on accurate polar harmonic Fourier moments (PHFMs). Specifically, the accurate PHFMs computation based on polar pixel tiling with nearest neighbor interpolation (PPTN) is designed. This computation is general and used for embedder and extractor. This ingenious design eliminates geometric and numerical integration errors and also avoids the distortion interaction caused by watermarks. Also, an improved quantization strategy is applied to the embedding process, and satisfactory imperceptibility is obtained. The watermark is extracted without the host image. The experimental results show the excellent robustness of the proposed watermarking scheme to common image processing attacks, geometric attacks, and some kinds of compound attacks. The proposed scheme is superior to the state-of-the-art image watermarking schemes.

Abstract:
With the development of data security and privacy requirements in the field of cloud computing, Reversible Data Hiding in Encrypted Images (RDHEI) in encryption domain has received increasing attention. In order to take full advantage of the spatial and textural features of the original image, reversible data hiding in encrypted image with adaptive Huffman code based on Dynamic Prediction Axes (RDHEI-HDA) is proposed. First, the prediction errors of the original plaintext image are calculated according to the multidirectional median edge detector (M-MED) combined with the Dynamic Prediction Axes which are generated by the spatial correlation of the original image. After encryption process with the stream cipher, the adaptive Huffman coding labeling rule is created for pixel labeling and classification according to the Dynamic Prediction Axes and the distribution of prediction errors. Finally, bit substitution is employed to insert secret data and side information into the image. In Comparison to most of the state-of-the-art RDHEI methods, the experimental results show that the RDHEI-HDA method provides a higher pure payload while ensuring safety.

Abstract:
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this article, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.

Abstract:
Exemplar-based anime line art colorization of the same character has been a challenging problem in digital art production because of the sparse representation of line images and the significantly different anime appearance between line and color images. Therefore, it is a fundamental problem to find semantic correspondence between two kinds of images. In this paper, we propose a correspondence learning Transformer network for exemplar-based line art colorization, called ArtFormer, which utilizes a Transformer-based architecture to learn both spatial and visual relationships between line art and color images. ArtFormer mainly consists of two parts: correspondence learning and high-quality image generation. In particular, the correspondence learning module is composed of several Transformer blocks, each of which formulates the deep line image features and color images features as queries and keys, and learns the dense correspondence between two image domains. Then, the network synthesizes high-quality images with a newly proposed Spatial Attention Adaptive Normalization (SAAN) that uses warped deep exemplar features to modulate the shallow features for better adaptive normalization parameters generation. Both qualitative and quantitative experiments show that our method achieves the best performance on exemplar-based line art colorization compared with state-of-the-art methods and other baselines.

Abstract:
Hyperspectral images (HSIs) and light detection and ranging (LiDAR) are two critical and frequently used types of remote sensing data, each containing rich spectral and elevation information. Fusing HSI and LiDAR can exploit the complementary properties of the two modalities for ground object classification. The performance of existing fusion classification methods is often limited by the difficulty of adapting feature extraction operators to complex spatial distributions, and the correlation and specificity between different modalities are not reasonably exploited. Therefore, the reinforcement learning-based markov edge decoupled fusion network (MEDFN) is proposed. This network can intelligently compose graphs based on different modal characteristics and tasks to adapt to complex spatial distributions; it can also suppress noise to complete fusion classification while fully utilizing complementary information of different modalities. First, a reinforcement learning-based graph construction subnetwork (RLGN) is proposed to learn a two-modal graph construction strategy suitable for classification tasks by transforming regular multimodal data into irregular graph data. Second, a multimodal edge attention module (MEAM) is proposed to extract edge features between spatial neighboring nodes and model the importance of each node, thereby capturing the spatial topology information encompassed in the multimodal data. Finally, the decoupled multimodal fusion module (DMFM) is proposed to decouple multimodal features into shared and unshared parts and enhance the model's ability to distinguish features by targeting the modal-shared feature between modalities and modal-specific feature. The experimental results based on three well-known HSI and LiDAR datasets demonstrate the effectiveness of the proposed MEDFN in fusion classification tasks.

Abstract:
With the continuous evolution of networking technologies, multi-modal services that involve video, audio, and haptic contents are expected to become the dominant multimedia service in the near future. Edge caching is a key technology that can significantly reduce network load and content transmission latency, which is critical for the delivery of multi-modal contents. However, existing caching approaches only rely on a limited number of factors, e.g., popularity, to evaluate their importance for caching, which is inefficient for caching multi-modal contents, especially in dynamic network environments. To overcome this issue, we propose a content importance-based caching scheme which consists of a content importance evaluation model and a caching model. By leveraging dueling double deep Q networks (D3QN) model, the content importance evaluation model can adaptively evaluate contents' importance in dynamic networks. Based on the evaluated contents' importance, the caching model can easily cache and evict proper contents to improve caching efficiency. The simulation results show that the proposed content importance-based caching scheme outperforms existing caching schemes in terms of caching hit ratio (at least 15% higher), reduced network load (up to 22% reduction), average number of hops (up to 27% lower), and unsatisfied requests ratio (more than 47% reduction).

Abstract:
Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the foreground action temporal segments, in this paper we propose a WTAL pipeline based on a novel attention mechanism that effectively integrates global and local knowledge. Our attention mechanism is mainly composed of a global attention branch and a local attention branch. Specifically, the global attention branch is built on the inter-segment similarity to sparsely mine out the correlation knowledge within the entire video, while the local attention branch is built on the convolutional structure to densely aggregate the information within the fixed local respective field. Experiments on THUMOS14 and ActivityNet v1.3 datasets demonstrate the effectiveness of our proposed WTAL pipeline compared to state-of-the-art methods.

Abstract:
Image understanding and analysis primarily rely on object appearances. However, when faced with challenges such as occlusion, camouflage, and small targets in surveillance scenarios, the discriminative features of objects cannot be effectively extracted. This paper proposes the utilization of hyperspectral imaging techniques as a solution to these issues, aiming to unlock new potential in the field. Hyperspectral images have the unique capability to identify a variety of materials, thereby offering a distinct advantage in surveillance. However, existing hyperspectral image datasets are not specifically tailored for image classification tasks within surveillance scenarios. To address this issue, we introduce an innovative hyperspectral image dataset designed explicitly for real-world surveillance, with the goal of setting a new benchmark for material classification. Our aspiration extends beyond merely deploying deep learning methods for hyperspectral material classification, aiming to contribute insightful understanding of spectral patterns inherent in natural surveillance scenes. The proposed dataset is currently the largest of its kind and the first one designed specifically for surveillance scenarios. It encompasses 128 spectral bands and provides annotations for 28 common material categories. Furthermore, we introduce a novel texture-metric-based spatial and spectral fusion network, meticulously crafted to accommodate our unique scenario and dataset. This model significantly outperforms existing networks in enhancing the fusion of spatial and spectral features, achieving state-of-the-art results on both our proposed dataset and existing public hyperspectral image classification datasets.

Abstract:
Hashing-based fine-grained image retrieval pursues learning diverse local features to generate inter-class discriminative hash codes. However, existing fine-grained hash methods with attention mechanisms usually tend to just focus on a few obvious areas, which misguides the network to over-fit some salient features. Such a problem raises two main limitations. 1) It overlooks some subtle local features, degrading the generalization capability of learned embedding. 2) It causes the over-activation of some hash bits correlated to salient features, which breaks the binary code balance and further weakens the discrimination abilities of hash codes. To address these limitations of the over-fitting problem, we propose a novel hash framework from Causal Feature learning to Binary-injected Hash learning (CFBH), which captures various local information and suppresses over-activated hash bits simultaneously. For causal feature learning, we adopt causal inference theory to alleviate the bias towards the salient regions in fine-grained images. In detail, we obtain local features from the feature map and combine this local information with original image information followed by this theory. Theoretically, these fused embeddings help the network to re-weight the retrieval effort of each local feature and exploit more subtle variations without observational bias. For binary-injected hash learning, we propose a Binary Noise Injection (BNI) module inspired by Dropout. The BNI module not only mitigates over-activation to particular bits, but also makes hash codes uncorrelated and balanced in the Hamming space. Extensive experimental results on six popular fine-grained image datasets demonstrate the superiority of CFBH over several State-of-the-Art methods.

Abstract:
Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.

Affiliations: Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China; Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui Provincial Key Laboratory of Security Artificial Intelligence, School of Artificial Intelligence, Anhui University, Hefei, China; Video Investigation Detachment of Hefei Public Security Bureau, Hefei, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, School of Computer Science and Technology, Anhui University, Hefei, China

Abstract:
Since the standard license plate of large vehicle is easily affected by occlusion and stain, the traffic management department introduces the enlarged license plate at the rear of the large vehicle to assist license plate recognition. However, current researches regards standard license plate recognition and enlarged license plate recognition as independent tasks, and do not take advantage of the complementary benefits from the two types of license plates. In this work, we propose a new computer vision task called collaborative license plate recognition, aiming to leverage the complementary advantages of standard and enlarged license plates for achieving more accurate license plate recognition. To achieve this goal, we propose an Association Enhancement Network (AENet), which achieves robust collaborative licence plate recognition by capturing the correlations between characters within a single licence plate and enhancing the associations between two license plates. In particular, we design an association enhancement branch, which supervises the fusion of two licence plate information using the complete licence plate number to mine the association between them. To enhance the representation ability of each type of licence plates, we design an auxiliary learning branch in the training stage, which supervises the learning of individual license plates in the association enhancement between two license plates. In addition, we contribute a comprehensive benchmark dataset called CLPR, which consists of a total of 19,782 standard and enlarged licence plates from 24 provinces in China and covers most of the challenges in real scenarios, for collaborative license plate recognition. Extensive experiments on the proposed CLPR dataset demonstrate the effectiveness of the proposed AENet against several state-of-the-art methods.

Abstract:
Despite many advances in Human Activity Recognition (HAR), most existing works are conducted with supervision. Supervised methods rely on labeled training data. However, obtaining labeled data is difficult, costly, and time-consuming. In this paper, we introduce an automatic multi-objective particle swarm optimization clustering based on Gaussian mutation and game theory (MOPGMGT) to provide fully unsupervised human activity discovery. Furthermore, we map the multi-objective clustering problem to game theory to get the best optimal solution. The proposed algorithm can accurately find the number of activities without any prior knowledge. Multi-objective optimization problems typically cannot have a single optimal solution. We solve this problem by applying, Nash Equilibrium (NE) to the pareto front as the decision-making for choosing the best solution. NE does not just look for the best solution but tries to optimize the final solution by considering the effect of choosing each of the solutions as the best solution on the other solutions and one with the best impact is chosen. Moreover, a Gaussian mutation is applied on the pareto front to avoid premature convergence. As far as we know, this is the first time that human activity discovery is performed fully unsupervised, and a multi-objective PSO is mapped to the game theory space for finding the best solution. Experiments on six challenging human activity datasets demonstrate the capability of the proposed approach in achieving the best accuracy in human activity discovery and determining the optimal number of clusters. In comparison to well-known multi-objective algorithms, the MOPGMGT significantly improves the clustering outcomes on six benchmark clustering datasets.

Abstract:
Reversible data hiding in encrypted domain (RDH-ED) can perform data encryption to fulfill the privacy protection of original media and embed additional data for covert communication or access control. However, current researches are focusing on the encrypted images, and little attention is paid to encrypted three-dimensional (3D) models. In this article, a high capacity separable RDH-ED method for encrypted 3D models is proposed based on octree spatial subdivision and multiple most significant bit (multi-MSB) prediction. Firstly, a 3D model is adaptively subdivided into non-overlapping subblocks by octree spatial subdivision, and the vertices in a subblock are classified into embedding set and reference set. To better utilize the spatial correlation of the two sets, the multi-MSB prediction error of the embedding set is used to embed the additional data, and the reference set is used to losslessly recover the embedded set. Then, the model is encrypted by a specified encrypted algorithm. At last, additional data is embedded into the reserved embedding room by multi-MSB substitution. Experimental results show that the proposed method can achieve a higher embedding capacity compared with the state-of-the-art methods, and guarantee the lossless recovery of the 3D model.

Abstract:
One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other factors in this task still lacks a systematic study. To fill this gap, we conduct an empirical study in this article. Concretely, we ablate 42 candidate designs/settings based on a common REC framework, and these candidates cover the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three REC benchmark datasets. The extensive experimental results reveal the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation. Based on these findings, we further propose a simple yet strong model called SimREC, which achieves new state-of-the-art performance on these benchmarks. In addition to these progresses, we also find that with much less training overhead and parameters, SimREC can achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.

Abstract:
Recently, reversible data hiding in encrypted images (RDHEI) has received widespread attention from researchers. To embed high payload into encrypted images while maintaining sufficient security, a novel RDHEI algorithm in combination with consecutive zero-valued high bit-planes compression, bit-plane swapping as well as block rearrangement is proposed in this article. The proposed method is the first work to compress global zero-valued high bit-planes in a block-wise manner and adaptively allocate different Huffman indicators based on the occurrence frequency of zero-valued bit-planes so that a higher embedded payload is greatly provided. Unlike existing RDHEI methods embedded with unencrypted auxiliary information, resulting in low security, the bit-plane swapping and block rearrangement are subtly designed to cluster together all embeddable bit-planes, which enables most auxiliary information to be encrypted, largely enhancing the security and facilitating data embedding and data extraction. The experiment results demonstrate that the proposed method outperforms some state-of-the-art RDHEI methods in terms of security and payload. The average payload of the proposed method for two publicly-used datasets including BOSSbase and BOWS-2, are 3.793 bpp and 3.705 bpp, respectively.

Abstract:
Current video-based scene graph generation (VidSGG) methods have been found to perform poorly in predicting predicates that are less represented due to the inherently biased distribution of the training data. In this paper, we take a closer look at the inherent characteristics of predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.

Abstract:
In current multimodal sentiment analysis, aligned and complete multimodal sequences are often crucial. Obtaining complete multimodal data in the real world presents various challenges, and aligning multimodal sequences often requires a significant amount of effort. Unfortunately, most multimodal sentiment analysis methods fail when dealing with missing modalities or unaligned multimodal sequences. To tackle these two challenges simultaneously in a simple and lightweight manner, we present the Unified Multimodal Framework (UniMF). The primary components of UniMF comprise two distinct modules. The first module, Translation Module, translates missing modalities using information from existing modalities. The second module, Prediction Module, uses the attention mechanism to fuse the multimodal information and generate predictions. To enhance the translation performance of the Translation Module, we introduce the Multimodal Generation Mask (MGM) and utilize it to construct the Multimodal Generation Transformer (MGT). The MGT can generate the missing modality while focusing on information from existing modalities. Furthermore, we introduce the Multimodal Understanding Transformer (MUT) in the Prediction Module, which includes the Multimodal Understanding Mask (MUM) and a unique sequence, MultiModalSequence (MMSeq), representing a unified multimodality. To assess the performance of UniMF, we perform experiments on four multimodal sentiment datasets, and UniMF attains competitive or state-of-the-art outcomes with fewer learnable parameters. Furthermore, the experimental outcomes signify that UniMF, supported by MGT and MUT - two transformers utilizing special attention mechanisms, can efficiently manage both generating task of missing modalities and understanding task of unaligned multimodal sequences.

Abstract:
We propose a combined generative and contrastive neural architecture for learning latent representations of 3D volumetric shapes. The architecture uses two encoder branches for voxel grids and multi-view images from the same underlying shape. The main idea is to combine a contrastive loss between the resulting latent representations with an additional reconstruction loss. That helps to avoid collapsing the latent representations as a trivial solution for minimizing the contrastive loss. A novel dynamic switching approach is used to cross-train two encoders with a shared decoder. The switching approach also enables the stop gradient operation on a random branch. Further classification experiments show that the latent representations learned with our self-supervised method integrate more useful information from the additional input data implicitly, thus leading to better reconstruction and classification performance.

Abstract:
Salient object detection in natural scene images (NSI-SOD) has undergone remarkable advancements in recent years. However, compared to those of natural images, the properties of remote sensing images (ORSIs), such as diverse spatial resolutions, complex background structures, and varying visual attributes of objects, are more complicated. Hence, how to explore the multiscale structural perceptual information of ORSIs to accurately detect salient objects is more challenging. In this paper, inspired by the superiority of contrastive learning, we propose a novel training paradigm for ORSI-SOD, named Deeply Hybrid Contrastive Learning Based on Semantic Pseudo-Label (DHCont), to force the network to extract rich structural perceptual information and further learn the better-structured feature embedding spaces. Specifically, DHCont first splits the ORSI into several local subregions composed of color- and texture-similar pixels, which act as semantic pseudo-labels. This strategy can effectively explore the underdeveloped semantic categories in ORSI-SOD. To delve deeper into multiscale structure-aware optimization, DHCont incorporates a hybrid contrast strategy that integrates “pixel-to-pixel”, “region-to-region”, “pixel-to-region”, and “region-to-pixel” contrasts at multiple scales. Additionally, to enhance the edge details of salient regions, we develop a hard edge contrast strategy that focuses on improving the detection accuracy of hard pixels near the object boundary. Moreover, we introduce a deep contrast algorithm that adds additional deep-level constraints to the feature spaces of multiple stages. Extensive experiments on two popular ORSI-SOD datasets demonstrate that simply integrating our DHCont into the existing ORSI-SOD models can significantly improve the performance.

Abstract:
Image retrieval with fine-grained categories is an extremely challenging task due to the high intraclass variance and low interclass variance. Most previous works have focused on localizing discriminative image regions in isolation, but have rarely exploited correlations across the different discriminative regions to alleviate intraclass differences. In addition, the intraclass compactness of embedding features is ensured by extra regularization terms that only exist during the training phase, which appear to generalize less well in the inference phase. Finally, the information granularity of the distance measure should distinguish subtle visual differences and the correlation between the embedding features and the quantized features should be maximized sufficiently. To address the above issues, we propose a logit variated product quantization method based on part interaction and metric learning with knowledge distillation for fine-grained image retrieval. Specifically, we introduce a causal context module into the deep navigator to generate discriminative regions and utilize a channelwise cross-part fusion transformer to model the part correlations while alleviating intraclass differences. Subsequently, we design a logit variation module based on a weighted sum scheme to further reduce the intraclass variance of the embedding features directly and enhance the learning power of the quantization model. Finally, we propose a novel product quantization loss based on metric learning and knowledge distillation to enhance the correlation between the embedding features and the quantized features and allow the quantization features to learn more knowledge from the embedding features. The experimental results on several fine-grained datasets demonstrate that the proposed method is superior to state-of-the-art fine-grained image retrieval methods.

Abstract:
The fusion of infrared (IR) and visible (VIS) images aims to capture complementary information from diverse sensors, resulting in a fused image that enhances the overall human perception of the scene. However, existing fusion methods face challenges preserving diverse feature information, leading to cross-modal interference, feature degradation, and detail loss in the fused image. To solve the above problems, this paper proposes an image fusion method based on the infrared target mask and bimodal feature extraction strategy, termed IBFusion. Firstly, we define an infrared target mask, employing it to retain crucial information from the source images in the fused result. Additionally, we devise a mixed loss function, encompassing content loss, gradient loss, and structure loss, to ensure the coherence of the fused image with the IR and VIS images. Then, the mask is introduced into the mixed loss function to guide feature extraction and unsupervised network optimization. Secondly, we create a bimodal feature extraction strategy and construct a Dual-channel Multi-scale Feature Extraction Module (DMFEM) to extract thermal target information from the IR image and background texture information from the VIS image. This module retains the complementary information of the two source images. Finally, we use the Feature Fusion Module (FFM) to fuse the features effectively, generating the fusion result. Experiments on three public datasets demonstrate that the fusion results of our method have prominent infrared targets and clear texture details. Both subjective and objective assessments are better than the other twelve advanced algorithms, proving our method's effectiveness.

Abstract:
Most musical compositions utilize repetition as a fundamental element to create captivating aesthetic experiences. However, the potential of repetition in machine-learning-based algorithmic composition has not been thoroughly investigated. This article aims to make an initial attempt at repetition modeling by generating motif-level repetitions and integrating them into music through a combination of example-based and domain knowledge–based learning techniques. The article presents a new Motif-to-music Generation Model (MGM) that combines a motif-level repetition generator (MRG) and an outline-to-music generator (O2MG). To train this model, a new music repetition dataset (MRD) has been created, which includes 584,329 samples from various categories of motif repetition and 3,545 outline-music sequences from pop piano music. The MRG uses a Transformer encoder to learn the representation of music notes from MRD, while the repetition-aware learner in MRG takes advantage of the unique characteristics of repetitions based on music theory. The O2MG applies a novel outline-to-music learning strategy to learn the relationships among motif-level repetitions in the music and generate music based on these repetitions. The experiments show that MGM can generate a variety of beautiful repetitions with any given motif, improving the music quality and structure of machine-composed music.