TMM2026

Abstract:
Multi-view clustering has achieved advanced progress over the years, which typically integrates multi-view information to learn discriminative common representations or a unified clustering distribution for clustering. However, existing methods either simply regard each view as equally important or assign fixed weights to each view, which are insufficient to dynamically assess the sample quality variations caused by noise in multi-view data. To address this issue, by effectively modeling the uncertainty of different samples across different views, this paper proposes a novel uncertainty-aware multi-view graph clustering network, termed UMGC-Net, which achieves trusted multi-view clustering in an unsupervised manner. Specifically, by measuring the clustering distribution entropy, an uncertainty-guided common feature learning mechanism is proposed to estimate the uncertainty for each sample of each view, thus learning multi-view features friendly to clustering. Besides, a cross-view trusted distribution fusion module is designed to obtain robust clustering distribution by exploring the trusted consistency among multi-view clustering distributions based on uncertainty. Finally, experimental results on four popular multi-view datasets validate the superior performance of the proposed UMGC-Net.

Abstract:
Locality Preserving Projections (LPP) aims to find a projection matrix to map the high-dimensional data into a low-dimensional subspace while preserving the local manifold structure, which is a classical unsupervised subspace learning method. However, the lack of label guidance makes LPP not able to fully exploit the discriminative information of the data. To solve the problem, we propose a Self-Guided Discriminative LPP algorithm employing pseudo labels learned by K-Means to guide the subspace learning. In this way, it facilitates the discovery of discriminative cluster information while preserving inherent manifold structure. Besides, considering K-Means’ sensitivity to selection of cluster centroids, we introduce a centerless K-Means method to improve robustness by eliminating the need of centroid initialization. We also discuss the internal relationship between K-Means and LPP, and prove that K-Means can be written in the form of LPP under certain conditions. Experiments on seven benchmark datasets demonstrate that our method greatly improves the clustering performance.

Abstract:
Sketches, as a new solution in multimedia systems that can replace natural language, are characterized by sparse visual cues such as simple strokes that differ significantly from natural images containing complex elements such as background, foreground, and texture. This misalignment poses substantial challenges for zero-shot sketch-based image retrieval (ZS-SBIR). Prior approaches match sketches to full images and tend to overlook redundant elements in natural images, leading to model distraction and semantic ambiguity. To address this issue, we introduce a distraction-agnostic framework, purified cross-domain matching (PuXIM), which operates on a straightforward principle: masking and matching. We devise a visual-cross-linguistic (VxL) sampler that generates linguistic masks based on semantic labels to obscure semantically irrelevant image features. Our novel contribution is the concept of purified masked matching (PMM), which comprises two processes: (1) reconstruction, which compels the image encoder to reconstruct the masked image feature, and (2) interaction, which involves a transformer decoder that processes both sketch and masked image features to investigate cross-domain relationships for effective matching. Evaluated on the TU-Berlin, Sketchy, and QuickDraw datasets, PuXIM sets new benchmarks in terms of performance. Importantly, the distraction-agnostic nature of the matching process renders PuXIM more conducive to training, enabling efficient adaptation to zero-shot scenarios with reduced data requirements and low data quality.

Abstract:
Continuous sign language recognition (CSLR) uses visual cues (e.g., hands, face, mouth, and body) to automatically recognize the sign language of the hearing-impaired, helping them to actively communicate with hearing people. The effects of these visual cues change dynamically with the demonstration of sign language. However, previous CSLR methods usually model visual information from the entire frame or simple fused visual cues, and thus do not well describe such dynamic change among visual cues. Therefore, we propose the Trustworthy Fusion Network (TFN) of visual cues for CSLR, which comprises two fundamental modules: Intra-cue Cross-modality Feature Fusion module (IntraCFF) and Inter-cue Trustworthy Fusion module (InterTF). IntraCFF uses the calibrated joint-belief method to dynamically fuse cross-modality features of RGB and keypoint information, to obtain a robust visual cue feature. InterTF innovatively employs the Dempster-Shafer Theory (DST) to evaluate the uncertainty of different cues in expressing sign movements. Then, the trustworthy fusion via DST is used to adaptively weigh and credibly fuse the visual cues based on uncertainty. In addition, to address the semantic gap when fusing different cues, we design consistency fusion constraints during the training stage. These constraints enhance the semantic consistency of different cues with global sign movements. Experiments on publicly CSLR datasets validate the effectiveness of our TFN.

Abstract:
Beyond behavioral interaction records, multimedia recommendation scenarios possess abundant semantic signals, which provide excellent data support for user interest mining. Recently, the multimodal enhanced interaction graph has been actively explored and has achieved great progress. However, these methods overlook the capability disparity of various modalities in learning users' interests and lack the ability to explore the hierarchical relationships of interests in modality, resulting in suboptimal recommendation performance. Therefore, this work investigates intra-modality hierarchical learning and inter-modality guidance, proposing a hyperbolic self-distillation (HSD) model for multimedia recommendation. In each modality space, HSD introduces a hyperbolic propagation to filter users' hierarchical interests from the interaction graph effectively. Inter-modality interests are aligned further by a two-level self-distillation strategy to designate multimodal interactions to teach single-modal learning, aiming at teaching and learning to promote each other. Extensive experiments on four public datasets demonstrate that the proposed HSD outperforms leading baselines for multimedia recommendation, verifying the effectiveness of hierarchical propagation and two-level self-distillation in mining users' hierarchical interests.

Abstract:
In steganography, the cover medium takes on various types, including images, videos. Due to the advent of deep learning, neural network models are increasingly employed as the cover medium. In existing approaches, the embedding of secret data results in an obvious degradation of the model's original task performance, such as image classification. This paper proposes a low-distortion steganography method that embeds secret data into neural network models without degrading the model's original performance. The method selectively modifies parameters with minimal correlation to the loss function, ensuring that the model's performance remains unaffected. Furthermore, a suitable modification amplitude is defined to minimize the impact on task performance. Experimental results demonstrate that the proposed method enables low-distortion steganography while significantly improving embedding capacity and security. Notably, a maximum of 3.25 M bits of secret data is successfully embedded into a VGG model on the MNIST dataset, resulting in less than 0.1% accuracy degradation, a significant improvement over previous methods.

Abstract:
We address the problem of speech denoising where the goal is to extract clean speech signal from a noisy signal. Traditionally, the task of denoising has been performed using audio modality only. However, human speech perception is inherently multimodal where cues from visual modality are used to understand the speech better in a noisy environment. Similar observation has been made with computational denoising methods, i.e., performance of audio only model improves after adding visual modality. Inspired by these findings we propose a novel audio-visual network for adaptively combining both modalities for the task of speech audio denoising. We show that extracting noise from mixed audio and using it as a conditioning signal, improves speech denoising performance. To estimate the noise, we use both audio and visual modalities, i.e., lip region of the speaker, to extract the non-speech/silent regions from it. The silent regions enable us to estimate better noise profile to eliminate from the signal. Our proposed network uses self and cross attention framework between audio and video features, along the temporal dimension, to model correlations between the two modalities. We evaluate the proposed approach on a large scale audio-visual dataset VoxCeleb2 and obtain state-of-the-art results. We also demonstrate generalization to unseen speakers at test time.

Abstract:
Graph contrastive learning (GCL), which captures essential features from augmented graphs to address data sparsity issues, has recently demonstrated promising potential in improving recommendation performance. Most GCL-based recommendation methods learn consistent entity representations from user-item bipartite graphs through structural perturbations. However, these approaches impose an additional computational cost and have been shown to be insensitive to various graph augmentations, resulting in limited improvements in long-tail recommendation scenarios. To address this issue, we propose a novel framework for recommendation, Knowledge-Enhanced graph Contrastive Learning (KECL), which adopts knowledge graph-based embedding augmentation instead of graph enhancement to construct views for GCL. Specifically, we introduce a knowledge aggregation module with a heterogeneous attentive aggregator to capture relation heterogeneity in the knowledge graph. Furthermore, we propose a knowledge-based augmentation GCL model that adds knowledge-aware embeddings to the learned representations for more efficient representation-level augmentation. Extensive experiments on real-world datasets demonstrate that the knowledge-based augmentation approach effectively enhances recommendation performance and shows superiority over state-of-the-art methods.

Abstract:
The goal of gaze object prediction (GOP) is to predict human gaze objects and categories. However, existing methods require additional head priors or filter the results before evaluation, which is an obstacle for real-world applications. To this end, this paper proposes a Transformer-based Gaze Object Prediction under Real-world setting (TransGOP-R), which does not rely on any head prior input and evaluates end-to-end. We first design a head location module to generate human head location information from a head query. Then, an error analysis demonstrates that the primary error source of the existing GOP model is in gaze estimation, which is caused by the difficulty in predicting gaze points by directly regressing heatmaps. Therefore, we introduce cone prediction into the model training stage, allowing the middle-layer features of the gaze regressor to build the relationship between the target human and objects before regressing the gaze point. An oriented gradient mechanism is proposed in this process to ensure the object detection performance is not affected by cone information. Finally, we conducted very detailed and sufficient experiments to verify the superiority of our method on the GOO-Synth and GOO-Real datasets. At the same time, we also achieve advantages compared to the human-target gaze estimation methods on the GazeFollowing, VideoAttentionTarget, and ChildPlay datasets.

Abstract:
Soft prompt learning methods are effective for adapting vision-language models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals that existing methods tend to overfit seen classes and exhibit degraded performance on unseen classes. This limitation is due to the inherent bias in the training data towards the seen classes. To address this issue, we propose a novel soft prompt learning method, named Mixture-of-Prompts Distillation (MoPD), which can effectively transfer useful knowledge from hard prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft prompt (a.k.a. student prompt), thereby enhancing the generalization ability of soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a gating network that learns to select hard prompts used for prompt distillation. Extensive experiments demonstrate that the proposed MoPD method outperforms state-of-the-art baselines, especially on unseen classes.

Abstract:
Current knowledge distillation methods typically require significant computational resources and time to train task-specific teacher candidates from scratch and identify the optimal teacher.Although self-distillation methods eliminate the dependency on the teacher by allowing the student model to learn independently, they face two challenges: the student learns correct and incorrect knowledge indiscriminately, and the student's learning scope is limited due to the lack of external teacher supervision. Spurred by these deficiencies, this work proposes a CLIP-enhanced Self-Distillation (CLIP-SD) method to overcome these problems, while almost not increasing training time. CLIP-SD comprises two main components: Prediction-oriented Self-Distillation (PSD) and Two-stage Task-guided CLIP Distillation (TTCD). PSD tackles the first challenge by assigning higher and lower weights to correct and incorrect prediction samples, respectively, during self-distillation. This component forces the student to focus on correct knowledge and minimize the impact of incorrect knowledge. Regarding the second challenge, the robust CLIP model is directly introduced into self-distillation. However, CLIP lacks task-specific knowledge and its output is overly smooth during the distillation process, prohibiting the student from learning more effectively. Therefore, TTCD refines CLIP's output through a two-stage process, endowing it with task-specific knowledge to enhance student learning. Experimental results indicate that CLIP-SD significantly improves distillation performance while maintaining training efficiency comparable to self-distillation. Specifically, on the CIFAR-100 dataset, the performance of CLIP-SD reaches 72.48% when trained with ResNet20 as the student model, which is an average improvement of 2.54% and 1.12% over the knowledge distillation and self-distillation methods. Regarding training time, CLIP-SD takes 3.91 hours, an average decrease of 2.73 hours compared to knowledge distillation and an average increase of 0.45 hours compared to self-distillation. Despite the slight increase in training time compared to self-distillation, the overhead is worthwhile and negligible considering its performance improvement.

Abstract:
The advancement of generative artificial intelligence has led to the creation of more diverse and realistic fake facial images. This poses serious threats to personal privacy and can contribute to the spread of misinformation. Existing deepfake detection methods usually utilize prior knowledge about forged clues to design complex modules, achieving excellent performance in the intra-domain settings. However, their performance usually suffers from a significant decline in unseen forgery patterns. It is thus desirable to develop a generalized deepfake detection method using a neat network structure. In this paper, we propose a simple yet efficient framework to transfer a powerful large-scale vision model like ViT to the downstream deepfake detection task, namely the generalized deepfake detection framework (GenDF). Concretely, we first propose a deepfake-specific representation learning (DSRL) scheme to learn different discontinuity patterns across patches inside a fake facial image and continuity between patches within a real counterpart in a low-dimensional space. To further alleviate the distribution mismatch between generic real images and human facial images consisting of both real and fake, we introduce a feature space redistribution (FSR) scheme to separately optimize the distributions of real and fake feature space, enabling the model to learn more distinctive representations. Furthermore, to enhance the generalization performance on unseen forgery patterns produced by constantly evolving facial manipulation techniques and diverse variations on real faces, we propose a classification-invariant feature augmentation (CIFAug) function without trainable parameters. CIFAug expands the scopes of real and fake feature space along directions orthogonal to the classification direction, enabling the model to learn more generalizable features while preserving discrimination. Extensive experiments demonstrate that our method achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings with only 0.28 M trainable parameters.

Abstract:
Cross-modal hashing models face significant challenges in handling continuous data growth, particularly in balancing the plasticity to learn new knowledge and the stability to retain prior cross-modal knowledge. Existing studies partially address this by maintaining previous mappings or extending hash codes, but struggle to reconcile plasticity and stability while requiring heavy parameter optimization. To tackle this, we propose an efficient Prompt-Infused Continual Cross-Modal Hashing (PIC-CMH) approach designed for hash learning with the continuous growth of multi-modal data and emerging knowledge. Specifically, PIC-CMH introduces a finite set of learnable multi-modal prompts, including global and task-specific expert prompts, which work in synergy with multi-modal representations. All prompts are optimized with the hash functions via backpropagation after Gaussian initialization. Global prompts stay learnable throughout, linking tasks, while expert prompts are updated only within their tasks, facilitating knowledge acquisition and mitigating catastrophic forgetting in continual learning. By freezing the pre-trained models used for multi-modal representations, continual learning is confined to the lightweight multi-modal prompts and hash functions, significantly reducing computational overhead. Extensive experiments demonstrate that PIC-CMH effectively addresses the stability-plasticity trade-off in cross-modal hash learning, delivering high retrieval accuracy with low computational cost and a simple yet efficient architecture.

Abstract:
The multi-scale geometric analysis is a great representation tool. It can be used to improve the feature representation and learning process of deep networks. In addition to extracting features, the multi-scale geometric prior knowledge can also be used for the structure improvement of deep networks. In this paper, we propose a multi-scale scattering representation learning network, abbreviated as MSRLN, for image classification tasks. The exploration of structure improvement can be made with multi-scale scattering operations. In this way, the better singularity representation learning process for networks can be achieved. Firstly, the filter banks and multi-scale scattering operator are introduced for non-linear and singularity representation. Secondly, the novel multi-scale scattering representation learning network structure is designed. The scaling-wise scattering process is deployed in the shallow layer as a non-linear layer. This structure essentially supplements deep networks with geometric prior knowledge. It can further improve the non-linear activation and singularity representation process. Thirdly, we put forward the multi-stage scattering representation strategy and the prior knowledge weakening mechanism. With flexible scaling factors and learning rates, the stepwise approximation and learning process of networks can be achieved. In sum, MSRLN is a kind of structural innovative, and the scattering singularity representation structure can be extended to other backbones or tasks. Extensive experimental results show that MSRLN can achieve better image classification accuracy. Finally, necessary convergence, insight, and adaptability analyses are provided in evaluation experiments.

Abstract:
Deep video coding techniques have achieved significant advancements, leading to enhanced compression performance. However, existing approaches are primarily optimized for 8-bit content, thereby limiting their effectiveness in scenarios with different bit-depths. In this paper, we propose a deep bit-depth scalable video codec (DB-SVC) that supports two-layer scalability for different bit-depths. First, we design a base layer (BL) for low bit-depth (LBD) videos, incorporating a dual-stage multi-scale feature extraction module (DFEM) to enhance compression efficiency while providing reference features for subsequent coding. Second, we introduce an inter-layer bit-depth enhancement module (IBEM) that refines the bit-depth of BL reconstructed frames by leveraging interlayer information, thus enhancing the reference quality without increasing coding overhead. Third, we design an enhancement layer (EL) tailored for high bit-depth (HBD) videos, employing a bit-depth residual compression (BRC) method to achieve a more accurate reconstruction of HBD videos. DB-SVC supports progressive decoding of LBD and HBD videos, accommodating diverse display requirements. Experimental results demonstrate that DB-SVC outperforms state-of-the-art codecs in LBD and HBD scenarios. At the same PSNR/MS-SSIM levels, DB-SVC achieves average bit-rate savings of 11.94%/53.35% for 8-bit videos while comparing with VTM13.2 and 64.20%/76.79% for 10-bit videos while comparing with SHM12.4, showcasing its superior compression performance.

Abstract:
Action recognition aims to identify an action from video frames. The action/background information is diverse in different frames, which hinders learning the implicit action patterns. In this work, we propose a Mask-aware Kernel Model (MKM), which ensures implicit action pattern learning by integrating kernel learning with proper cluster relations. The MKM provides novel cluster-aware kernels to enhance the action representation for frame patches. The MKM introduces a kernel clustering learner, kernel masking filter, and a kernel attention selector. First, to learn temporal features, the temporal Vision Transformer uses temporal correlation to ensure the action features for kernel learning. Second, to analyze the action kernels for frame patches, we design a kernel clustering learner module. This module learns cluster relations with patch-wise convolutions to describe the common action among patches. The cluster relations are learned in each frame, which ensures cluster-aware kernel learning with input frame adaptivity. Third, to analyze the action kernels with spatial adaptivity, we design a kernel masking filter module. This module introduces a location mask by analyzing the region patterns with spatial convolution. The patch-level mask ensures the kernel learning with region-aware selection. Fourth, after learning multiple channel features by convolution with multiple kernels, we design a kernel attention selector module. This module excites kernel-aware features by learning channel-wise attention with channel-wise convolutions, which ensures the kernel learning with channel-wise selection for effective action representation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on Something-Something V1 & V2, Kinetics-400, UAV-human, and Diving 48 datasets.

Abstract:
Distance metric learning is a key field of machine learning, which aims at improving the performance of pattern classification and data discrimination by optimizing features to make samples of the same category closer in the feature space and samples of different classes farther apart. Most existing metric learning methods work in Euclidean space with zero curvature due to its simple and convenient characteristics. The latest studies show that non-zero curvature geometric spaces can better capture discriminative information. In this paper, to explore and form a more generalized feature space that is capable of matching complex data structure of samples, we look into metric learning problem in the mixed curvature space and present a new method called mixed-curvature metric learning (MCML). By simulating dimensionality reduction operations in different curvature spaces and conducting sample mining in mixed curvature space, our metric learning method is extended to feature spaces with a mixture of positive curvature, zero curvature, and negative curvature. Extensive experimental results show that our MCML approach achieves the superior performance in image retrieval task on multiples benchmark datasets, demonstrating the effectiveness of the proposed MCML method.

Abstract:
Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.

Abstract:
Single domain generalization aims to train a model on a single source domain that generalizes to unseentarget domains, which is critical in multimedia applications. Current methods typically use adversarial data augmentation to enrich the source domain distribution with novel samples. However, these methods typically rely on labeled data and require adversarial training between generators and classifiers, which may limit sample diversity and introduce spurious correlations. To tackle these problems, we propose a method that integrates Contrastive clustering regularization with an Unsupervised Diversity Augmentation (UDA), termed C-UDA. Specifically, UDA is a flexible and general framework in which two customized models iteratively optimize a novel adversarial loss to enable fully unsupervised data augmentation. Within UDA, we design a lightweight generator that diversifies each input image along three distinct visual attributes. Based on both original and augmented images, we further introduce contrastive clustering regularization to encourage the model to learn domain-invariant representations, resulting in robust decision boundaries. Extensive experiments on four challenging benchmarks demonstrate that C-UDA significantly outperforms 22 state-of-the-art methods.

Abstract:
Multimodal recommender systems try to integrate multimedia data (images, texts, etc.) with user-item historical records to better model user preference. However, most previous methods largely ignored the underlying fine-grained attribute features of items, which makes it difficult to fully explore users’ nuanced attention across individual and combined attributes, resulting in low recommendation performance. To address these issues, this paper proposes a novel and effective self-harmonized representation learning network for multimodal recommendation, named LETTER. LETTER has the ability to effectively optimize the user and item representations for multimodal recommendation. Specifically, we design a factorized attribute interaction module that captures diverse combinations of item latent attributes using a bilinear pooling strategy. Then a dual graph convolution module is established to learn the modality-specific representations from user-item interactive and item semantic relations. Finally, we design a preference self-harmonization module that adaptively identifies the salient influencing factors of user preference, thus refining user and item representations to improve recommendation accuracy. We conduct extensive experiments on three real-world datasets, demonstrating that LETTER outperforms state-of-the-art multimodal recommendation methods.

Affiliations: College of Electronic and Information Engineering, Tongji University, Shanghai, China; College of Electronic and Information Engineering, Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China; Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China; Faculty of Data Science, City University of Macau, Taipa, Macau, China; Faculty of Computer Science, University of Vienna, Vienna, Austria; Singapore Institute of Technology, Singapore

Abstract:
Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deepspectral clustering model (named BootSC), which jointly learns all stages of spectral clustering—affinity matrix construction, spectral embedding, and k-means clustering—using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset.

Abstract:
Deep learning-based food recognition has made significant progress in predicting food types from eating occasion images. However, two key challenges hinder real-world deployment: (1) continuously learning new food classes without forgetting previously learned ones, and (2) handling the long-tailed distribution of food images, where a few common classes and many more rare classes. To address these, food recognition methods should focus on long-tailed continual learning. In this work, We introduce a dataset that encompasses 186 American foods along with comprehensive annotations. We also introduce three new benchmark datasets, VFN186-LT, VFN186-INSULIN and VFN186-T2D, which reflect real-world food consumption for healthy populations, insulin takers and individuals with type 2 diabetes without taking insulin. We propose a novel end-to-end framework that improves the generalization ability for instance-rare food classes using a knowledge distillation-based predictor to avoid misalignment of representation during continual learning. Additionally, we introduce an augmentation technique by integrating class-activation-map (CAM) and CutMix to improve generalization on instance-rare food classes. Our method, evaluated on Food101-LT, VFN-LT, VFN186-LT, VFN186-INSULIN, and VFN186-T2DM, shows significant improvements over existing methods. An ablation study highlights further performance enhancements, demonstrating its potential for real-world food recognition applications.

Abstract:
Recent studies have shown that disentanglement of classification and localization tasks has great potential to improve the performance of general object detection. However, such kind of disentanglement strategies remain not well explored in oriented object detection. Particularly, there exist two challenges lying in task disentanglement for oriented object detection: (1) existing task-decoupled methods ignore the orientation of objects, hardly coping with arbitrarily oriented objects; (2) the targets in oriented object detection (e.g., high-resolution remote sensing images) are generally small-size and fine-grained, making classification more difficult. To handle the above issues, we rethink task-decoupled policy in oriented object detection and propose an effective Orientation-aware Task-Decoupled Learning (OTDL) method. Specifically, our OTDL first presents a light-weight Task-specific Proposal Offset Learning (TPOL) module to generate the eligible proposals for arbitrarily oriented objects, where TPOL module equips classification and localization tasks with individual proposals by learning task-specific and orientation-aware offsets in a local coordinate. Furthermore, we empirically study the effect of various double-head strategies on performance of oriented object detection, while proposing a novel Pyramid Covariance Attention (PCA)-based classification head to cope with small-size and fine-grained targets. Based on the proposed TPOL module and PCA-based classification head, our OTDL explores the potential of task disentanglement for improving the performance of oriented object detection. The experiments are conducted on five oriented object detection benchmarks (i.e., DOTA-v1.0, DOTA-v1.5, HRSC2016, DIOR-R and SODA-A), and the results show our OTDL method significantly outperforms its counterparts, while achieving state-of-the-art performance.

Abstract:
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decision-making and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scene-text recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52 K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition.

Abstract:
The raw depth images captured by RGB-D cameras using Time-of-Flight (TOF) or structured light often suffer from incomplete depth values due to weak reflections, boundary shadows, and artifacts, which limit their applications in downstream vision tasks. Existing methods address this problem through depth completion in the image domain, but they overlook the physical characteristics of raw depth images. It has been observed that the presence of invalid depth areas alters the frequency distribution pattern. In this work, we propose a Spatio-Spectral Mutual Learning framework (S2ML) to harmonize the advantages of both spatial and frequency domains for depth completion. Specifically, we consider the distinct properties of amplitude and phase spectra and devise a dedicated spectral fusion module. Meanwhile, the local and global correlations between spatial-domain and frequency-domain features are calculated in a unified embedding space. The gradual mutual representation and refinement encourage the network to fully explore complementary physical characteristics and priors for more accurate depth completion. Extensive experiments demonstrate the effectiveness of our proposed S2ML method, outperforming the state-of-the-art method CFormer by 0.828 dB and 0.834 dB on the NYU-Depth V2 and SUN RGB-D datasets, respectively.

Abstract:
To alleviate the expensive human labeling problem, semi-supervised semantic segmentation utilizes a few labeled images along with an abundance of unlabeled images to predict the pixel-level label maps with the same size. Previous methods often rely on co-training with two convolutional networks with the same architecture but different initialization, which fails to capture sufficiently diverse features. This limitation motivates us to employ tri-training and design a triple-view encoder to utilize encoders with different architectures to derive diverse features, while leveraging knowledge distillation to capture complementary semantics among these encoders. Moreover, existing approaches simply concatenate features from both encoder and decoder, and the simple concatenation requires a large memory cost. This inspires us to present a dual-frequency decoder that selects those important features by projecting the spatial-domain features into the frequency domain, where a dual-frequency channel attention mechanism is applied to evaluate the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation. It comprises the triple-view encoder and the dual-frequency decoder. Extensive experiments conducted on two benchmarks, i.e., Pascal VOC 2012 and Cityscapes, validate the superiority of our method, achieving a satisfying tradeoff between precision and inference speed.

Abstract:
Pedestrian detection is a crucial task in computer vision. Utilizing multispectral knowledge, especially, is essential to effectively detect the pedestrians. Existing multispectral pedestrian detection methods, however, perform only in fully-supervised situations. Although studies on semi-supervised object detection have been conducted, they focus only on single modality environments. Therefore, we propose novel semi-supervised multispectral pedestrian detector (SSMPD) that effectively utilizes multispectral knowledge. Our SSMPD consists of three methods that effectively address the pseudo-labels in the multispectral domain and a novel data selection method. First, we introduce a Pedestrian Appearance-Aware (PAA) weight to consider the quality of the pseudo-label by adjusting the multispectral knowledge transfer from the teacher model to the student model. Second, we propose a Unified Modal-Aware Simultaneous (UMAS) learning to consider the single modality (visible or thermal) and multispectral modalities when learning with the pseudo-label. Finally, we introduce a Similarity-based Contrastive (SC) loss to guide the teacher model in enhancing the quality of pseudo-labels. In addition, we provide diverse data selection for more effective semi-supervised learning. Extensive experimental results on the KAIST and LLVIP datasets demonstrate the effectiveness of our method.

Abstract:
Concept Bottleneck Models (CBMs) enhance the interpretability of deep neural networks by mapping images to human-understandable concepts and then using the concepts to make predictions. While they improve transparency, existing CBMs primarily explain only the final layer’s features, limiting the interpretability of intermediate layers. Additionally, constructing a comprehensive concept set remains a challenging task, further constraining model performance. In this paper, we investigate the assignment of concept granularity across model layers and propose the Hierarchical Concept Bottleneck Model (HCBM) to enhance interpretability. HCBM introduces a Hybrid Concept Bottleneck Layer (HCBL) at each layer, consisting of a Predefined Concept Bottleneck (PCB) that maps visual features to concepts of corresponding granularity and a Compensation Concept Bottleneck (CCB) which incorporates the concept frequency loss and the concept semantic loss to capture compensation concepts for improving performance. Extensive experiments demonstrate that HCBM outperforms state-of-the-art methods. It is worth noting that the HCBM with CLIP RN50 as the backbone outperforms the opaque model.

Abstract:
Model quantization is an effective approach to reduce the complexity of neural networks, enabling them to be deployed on resource-constrained edge devices. Recently, data-free quantization has been widely investigated, since it does not access the original datasets and can address the widely-held data privacy and security concerns. Its idea is to generate fake data depending on the prior information in the full-precision (FP) model, and then fine-tune the quantized model with them under the supervision of the FP model. The quantization performance relies heavily on the validity of the generated data, however, existing methods suffer from two severe issues: mode collapse and (catastrophic) example forgetting, leading to non-trivial accuracy degradation. In this work, we propose Contrastive Learning Quantization (CoLeQ), which achieves data diversity enhancement and old knowledge restoration via contrastive learning to address the above issues. Specifically, we introduce the MoCo paradigm that maintains a dynamic momentum queue of the encoded features to data-free quantization. The contrastive learning objective is used to improve data diversity by facilitating the separation of generated samples from the already generated ones in previous mini-batches, thus mitigating the mode collapse problem. Moreover, we design a tied-weight decoder to restore the previous samples from the encoded features in the queue without additional parameters and training, hence cost-effectively preventing the example forgetting problem. Extensive experiments are conducted to evaluate the effectiveness of CoLeQ, and the results demonstrate a consistent superiority compared to state-of-the-art methods.

Abstract:
Audio-visual speech synthesis (AVSS) aims to produce an audio-visual stream that conveys a target speaker’s speech. In this study, the AVSS system takes the input speech of a source speaker and generates the audio-visual stream of the target speaker while preserving the linguistic content of the source speech. The process involves two main components: voice conversion (VC), which adapts the vocal features from the source to the target speaker, and audio-visual synthesis (AVS), which generates the synchronized audio-visual stream from the transformed speech. This paper presents a novel generative framework based on multi-discriminative learning to enhance the realism and quality of AVSS outputs. The proposed approach integrates multiple discriminators, including capsule networks, co-occurrence neural networks, and vision transformers (ViTs), within the VC model to leverage their unique strengths in capturing diverse speech features. Additionally, the AVS model incorporates a co-occurrence neural network to improve video quality and achieve better temporal alignment between audio and visual data. Experimental evaluations on standard benchmarks demonstrate that the proposed method achieves significant improvements in both audio and video quality, offering a substantial advancement in AVSS technology.

Affiliations: School of Information Science and Technology, The Engineering Research Center of Intelligent Perception and Autonomous Control of Ministry of Education, Beijing University of Technology, Beijing, China; School of Automotive Intelligent Manufacturing, Hubei University of Automotive Technology, Shiyan, China; Faculty of Computing and Informatics, Multimedia University, Malacca, Malaysia; College of Computing and Data Science, Nanyang Technological University, Singapore; Institute of Image Communication and Information Processing, Shanghai Jiao Tong University, Shanghai, China; EPFL, Lausanne, Switzerland

Abstract:
By comparison with the commonly seen visible light images that can be effectively characterized within a Euclidean space, infrared images have non-Euclidean characteristics since their pixels contain rich thermal radiation information, such as heat distribution, surface temperature and thermal radiation. Considering the advantages of Graph Convolutional Networks (GCNs) in processing non-Euclidean data, this study proposes to introduce the GCNs to estimate the quality of infrared images by developing the Node-to-Graph Regression (NGR) model. To specify, the proposed NGR model is composed of two main steps, namely network establishment and network training. In the first step, following the classical researches of image quality estimation that include local distortion measurement followed by pooling for inferring the image quality score, this study captures the local distortion of the input infrared images by stacking up a set of Vision Graph (VSG) blocks to generate one node map, and then conducts the weighted pooling method on the node map to yield the graph output as the estimated quality score. In the second step, for enhancing the model’s performance and generalization ability in the network training process, this study implements the node regression with the big data pre-training method to raise the local distortion extraction ability in a broad range of image scenarios and distortion intensities, and then performs the graph regression by using the knowledge distillation method to reduce the over-fitting risk. Using the largest-size infrared image quality evaluation database (I2QED), this study compared the proposed NGR model with three dozen mainstream and state-of-the-art competitors, and results showed that our proposed NGR model achieved the optimal performance.

Abstract:
Multimedia data have rich semantic knowledge, and cross-modal retrieval (CMR) methods are able to explore their correlations. Graph neural networks (GNN) can represent complex connection information, so some CMR methods apply GNNs as semantic comprehender to improve matching accuracy. However, fine-grained classifiers can accurately obtain object-centric semantics, but these semantics may be conflicting, potentially leading to inexplicability responses that are difficult to ground, for example. Meanwhile, it may be concerned that the credibility of GNN, mainly includes sensitivity to out-of-distribution changes and lack of interpretability. Therefore, we attempt to integrate causal learning into GNNs and capture potential causal relationships rather than surface object-centric classification. Firstly, we analyze semantic causality and build cross-modal structure causal model, then achieve cross-modal interventional-causal learning by causality-inspired graph neural network (CIGNN). Secondly, we propose modality contrastive learning to characterize the intra-modal and inter-modal correlations, and project into the common representation space. Thirdly, a new soft rank loss method is designed beyond binary similarity to achieve fine-grained similarity sorting. Comprehensive experiments on three widely used benchmark datasets prove the superiority of our proposed method, while ablation experiments demonstrated the effectiveness of each component.

Abstract:
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations.

Abstract:
Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.

Abstract:
Adverse weather and imaging environments may degrade image quality and pose a significant challenge to the visual perception systems of multimedia. Various image restoration tasks necessitate the modeling of multiscale features, which is highly demanding on networks. To date, Vision Transformer has exhibited impressive image restoration performance. However, in this model, global self-attention is computationally expensive and local self-attention typically limits the interaction domain of each token. To solve this problem, we propose a novel reconfigurable self-attention transformer called RcFormer, which is designed to adequately model multiscale image features. This is achieved through a cross-grouped transformer (CGTransformer) block that uses convolution, area self-attention, and row-column self-attention for different head groups. CGTransformer is combined with an intragroup operation interaction structure. Moreover, an intergroup reconfigurable mechanism is implemented based on CGTransformer and channel circulation. The combination of multiple operations effectively enhances the modeling capability in the spatial and channel dimensions for various image recovery tasks. The performance of the proposed RcFormer is compared with low-level vision modules in a unified framework. Extensive experiments demonstrated that RcFormer exhibited a superior performance for the following image restoration tasks: image dehazing, rain streak removal, raindrop removal, snow removal, and single image deblurring.

Abstract:
Change captioning is a task that describes changes in image pairs using natural language. This task is more complex than single-image captioning as it requires a comprehensive understanding of each image and the ability to recognize and describe the semantic changes in image pairs. The key challenge lies in making the network generate an accurate and stable change representation under the interference of viewpoint shift. In this paper, we propose a cross-view and multi-step interaction network to generate robust change representation to resist pseudo-change. Specifically, in the intra-image representation learning stage, a cross-view interaction encoder is designed to enhance internal relationships by cross-referencing in image pairs. In the change feature learning stage, a multi-step change perceptron is employed to capture the change semantics from coarse to fine progressively. Then, a fusion module dynamically combines them as a fine-grained change representation. Besides, we propose a backward representation reconstruction module that facilitates the capture of semantic changes, thus improving the quality of captions in a self-supervised manner. Extensive experiments have shown that the method effectively captures real semantic changes under the interference of viewpoint shift and achieves state-of-the-art performance on five public datasets.

Abstract:
Semantic labels are inherently tied to geometry and luminance reconstruction, as entities with similar shapes and appearances often share categories. Traditional methods use synthesis-analysis, NeRF, or 3D Gaussian representations to encode semantics and geometry separately. However, 2D methods lack view consistency, NeRF extensions are slow, and faster 3D Gaussian methods risk spatial and channel inconsistencies between semantic and RGB. Moreover, these methods require costly manual dense semantic labels. To alleviate resource demands and achieve effective semantic reconstruction with sparse inputs while enhancing RGB rendering quality, we build upon 3D Gaussian by integrating semantic features from pre-trained models—requiring no additional ground truth input—into Gaussian features, and construct a hypergraph neural network to capture higher-order correlations across RGB and semantic information as well as between different frames. Hypergraphs use hyperedges to link multiple vertices, capturing complex relationships essential for cross-modal tasks. This higher-order structure addresses the limitations of NeRF and Gaussian methods, which lack the capacity for such advanced associations. This framework enables precise novel view synthesis and 2D semantic reconstruction without manual annotations, achieving state-of-the-art results for RGB and semantic tasks on room-scale scenes in the ScanNet and Replica datasets, while supporting real-time rendering speeds of 34 FPS.

Abstract:
Realistic image super-resolution (SR) focuses on transforming real-world low-resolution (LR) images into high-resolution (HR) ones, handling more complex degradation patterns than synthetic SR tasks. This is critical for applications like surveillance, medical imaging, and consumer electronics. However, current methods struggle with limited real-world LR-HR data, impacting the learning of basic image features. Pre-trained SR models from large-scale synthetic datasets offer valuable prior knowledge, which can improve generalization, speed up training, and reduce the need for extensive real-world data in realistic SR tasks. In this paper, we introduce a novel approach, Dual-domain Adaptation Networks, which is able to efficiently adapt pre-trained image SR models from simulated to real-world datasets. To achieve this target, we first set up a spatial-domain adaptation strategy through selectively updating parameters of pre-trained models and employing the low-rank adaptation technique to adjust frozen parameters. Recognizing that image super-resolution involves recovering high-frequency components, we further integrate a frequency domain adaptation branch into the adapted model, which combines the spectral data of the input and the spatial-domain backbone’s intermediate features to infer HR frequency maps, enhancing the SR result. Experimental evaluations on public realistic image SR benchmarks, including RealSR, D2CRealSR, and DRealSR, demonstrate the superiority of our proposed method over existing state-of-the-art models.

Abstract:
Learned image compression (LIC) methods have shown promising results and achieved superior performance compared to traditional image compression methods. Due to the neglect of the utilization of cross-component correlations, there is still a potential for further performance improvement. In this paper, we first explore the inter-channel correlations of different color spaces and transform the image compression problem in RGB color space into that in YUV color space, which has cross-component prior information. We propose a novel image compression method that leverages local-to-global cross-component prior modeling, utilizing a cross-component attention mechanism to improve coding performance. First, we design the cross-component prior gate (CPG) to model the cross-component prior information based on attention mechanism. Inspired by common knowledge in data compression, luma component (Y) contains more details and textural/structural information compared to chroma components (UV). The proposed method can make full use of the cross-component guidance information from luma to chroma components to achieve effective image compression. Experimental results demonstrate that the proposed method can achieve superior performance compared to existing learned image compression methods. The proposed method can achieve 9.20% rate savings compared to the image compression standard Versatile Video Coding (VVC) Test Model (VTM-11.0) on Kodak dataset.

Abstract:
Transformer-based trackers tend to favor fixed attention patterns for dynamic scenarios, thereby restricting the adaptive model learning capabilities across diverse situations. Furthermore, satellite targets exhibit a continuity characteristic in their motion, implying that their movement amplitude remains relatively subdued across consecutive frames. In this paper, we introduce a regularized-aware discriminative Transformer tracker (RDTracker) for satellite videos, which incorporates a cascaded discriminative Transformer (CDT) for dynamic target learning, along with a regularized-aware (RA) filter for maintaining stable tracking. Drawing inspiration from group convolution, the CDT employs a feature grading strategy for decoder inputs, which boosts accuracy with time efficiency maintaining. Besides, multiple cross-attention mechanisms are integrated with adaptive learning parameters to facilitate the dynamic enhancement of target features across diverse scenarios. This module establishes a favorable foundation for the subsequent filter learning. However, filters currently depend exclusively on spatial optimization may struggle to manage model drift arising from interference by similarities and occlusion. The RA filter builds upon this foundation with a temporal regularization term. It utilizes the Gauss-Newton method to achieve weight iteration, enhancing the precision of target positioning. To substantiate the efficacy of the proposed RDTracker, we undergone testing on three public satellite video datasets. It secures top AUC scores of 50.6%, 46.2%, and 27.4% on SatSOT, SV248S, and VISO datasets, respectively, underscoring its performance.

Abstract:
The need for more realistic 3D scene representations has fomented the development of models for a wide range of applications. In this context, solutions that attempt to model the light’s behavior through the plenoptic function have provided considerable advancements using neural-based approaches, often presenting a trade-off between rendering time and model sizes. In this work, we propose a pruning framework to reduce the sizes of these models by computing the visibility over the training data, applicable to different 3D scene representations. In particular, we implement first a solution suitable for the 3D Gaussian Splatting, and then we exemplify the solution for the Neural Radiance Fields (NeRF)-style of rendering using PlenOctrees. We show that our pruning solution produces smaller models in terms of the number of elements – be they voxels, points, or Gaussians – with minimal losses in terms of rendering novel views. We further assess our solution by combining it with state-of-the-art (SOTA) compression solutions for both rendering schemes. Results over the NeRF-Synthetic dataset show comparable metrics to the SOTA for PlenOctrees, achieving marginal gains for lower bitrates. For 3DGS, the combination of our pruning method and compression solutions achieves a compression ratio of up to 37.5 times over the uncompressed 3DGS models, with only a 0.5 dB decrease in rendering quality. When compared against other SOTA compression methods, our solution produces models 1.4 times smaller, with less than a 0.1 dB loss over novel views for synthetic data, and models 1.9 times smaller with less than 0.2 dB loss when synthesizing novel views on real-world, outdoor content.

Abstract:
Convolutional Neural Networks (CNNs) have significantly advanced Image Super-Resolution (SR), yet most CNN-based methods rely solely on pixel-based transformations, often leading to artifacts and blurring, particularly under severe downsampling rates (e.g., 8× or 16×). The recently developed text-guided SR approaches leverage textual descriptions to enhance their detail restoration capabilities but frequently struggle with effectively performing alignment, resulting in semantic inconsistencies. To address these challenges, we propose a multi-modal semantic enhancement framework that integrates textual semantics with visual features, effectively mitigating semantic mismatches and detail losses in highly degraded low-resolution (LR) images. Our method enables realistic, high-quality SR to be performed at large upscaling factors, with a maximum scaling ratio of 16×. The framework integrates both text and image inputs using the prompt predictor, the Text-Image Fusion Block (TIFBlock), and the Iterative Refinement Module, leveraging Contrastive Language-Image Pretraining (CLIP) features to guide a progressive enhancement process with fine-grained alignment. This synergy produces high-resolution outputs with sharp textures and strong semantic coherence, even at substantial scaling factors. Extensive comparative experiments and ablation studies validate the effectiveness of our approach. Furthermore, by leveraging textual semantics, our method offers a degree of super-resolution editability, allowing for controlled enhancements while preserving semantic consistency.

Abstract:
Recent advances in text-to-video generation have demonstrated the substantial superiority of diffusion models. Nevertheless, generating high-resolution videos based on text description still faces a great challenge due to the enormous computation overhead for video diffusion model training. In this paper, we present a tuning-free video diffusion approach with Spatial-Temporal LAtent Grouping (ST-LAG), for high-resolution video generation. ST-LAG exploits the prior knowledge of a pre-trained low-resolution video diffusion model for region-wise video latent denoising, and then combines all the denoised regions of video latent as a whole one to achieve global-wise spatial-temporal coherence. Specifically, ST-LAG denoises the whole video latents via two deliberately designed modules, e.g., Spatial Latent Grouping (SLG) and Temporal Latent Grouping (TLG), at spatial and temporal level, respectively. SLG spatially slices the latent of each frame into different local patches, and then feeds them into the low-resolution video diffusion model for local-region latent denoising. A text re-weighting scheme is devised in SLG to strength the cross-attention between features of text tokens and spatial regions to facilitate spatial-level fine-grained details generation. TLG capitalizes on the segment-level latent grouping to match the length of each denoised local segment with the frame number in the training stage. The well-aligned temporal receptive field facilitates better preservation of motion patterns. In each denoising step, all groups of video latent at spatial and temporal levels are fused together for high-resolution video generation. Extensive experiments conducted on the ECTV-Prompt dataset demonstrate the effectiveness of our approach quantitatively and qualitatively.

Abstract:
Point-supervised Temporal Action Detection (PS-TAD) is an emerging research direction for label-efficient learning. Current pseudo-label-based methods have achieved satisfactory detection performance. However, the performance gap between PS-TAD and fully-supervised methods remains significant. In this paper, we attribute such a large performance gap to the poor quality of pseudo-labels. Moreover, we propose a Pseudo-label Refinement (PseR) framework to obtain higher-quality pseudo-labels, consisting of three stages: seed proposal generation, proposal propagation, and refinement network. At the seed proposal generation stage, we use point annotations and the existing PS-TAD method to generate a pseudo-label for each point. The temporal boundaries of this pseudo-label cover the corresponding point annotation and achieve the highest confidence in the existing PS-TAD method, referred to as the seed proposal. Then, proposal propagation generates proposals with varying durations and center positions around the seed proposal through scale and center perturbations. These proposals, along with the seed proposal, form the proposal bag corresponding to the point annotation. Subsequently, within the refinement network, a selection module selects proposals within each bag close to the action instance. To further refine the selection process, a ranking module is proposed to obtain temporal confidence to assist in selecting the best proposals. Ultimately, the refinement network can generate higher-quality pseudo-labels. We conduct extensive experiments on four challenging benchmarks and demonstrate that our PseR significantly enhances the state-of-the-art PS-TAD methods, resulting in average mAP improvements of 3.7%, 3.3%, 9.3%, and 1.5% on THUMOS’14, GTEA, BEOID, and ActivityNet-1.3, respectively.

Abstract:
Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks.

Abstract:
Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task (Li et al., 2023), (Clark et al., 2023) to the more complex object detection task, by “inverting” a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes' rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method (Li et al., 2023), (Clark et al., 2023) for classification without sacrificing accuracy. Code and models are available at https://github.com/LiYinqi/DIVE.

Abstract:
Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (e.g., text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.

Abstract:
Alzheimer’s Disease (AD) is a prevalent and severe neurodegenerative disorder, and early diagnosis is essential for managing disease progression. Recently, multimodal graph learning has demonstrated significant potential in integrating both medical imaging and non-imaging data, as well as uncovering relationships between patients. However, the high-dimensional nature of multimodal medical data poses significant challenges for constructing and learning modality graph structures. Moreover, existing methods are often imprecise in modeling graph structures for continuous data. To address these issues, this paper introduces a novel multimodal multi-graph fusion learning method for Alzheimer’s disease diagnosis. Specifically, multimodal state space networks (multimodal SSNs) are proposed to capture the dependencies between multimodal and high-dimensional features. Furthermore, a novel graph structure learning (KGSL) based on an initial K-nearest neighbors graph is proposed to separately construct graph structures for each modality. This method is particularly suitable for modeling the graph structures of Euclidean data. Finally, multimodal graph fusion integrates various modal graph structures into a single graph, leading to enhanced multimodal integration. In addition, this paper uses a learnable Chebyshev Graph Convolutional Network for the classification network, which enables end-to-end optimization. Experimental results demonstrate that our approach achieves excellent performance on public datasets.

Abstract:
Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed EET, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both discriminative knowledge transfer and discriminative region guidance. Using a distillation paradigm, these components transfer knowledge from a larger “teacher” ViT to a more efficient “student” model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7% and boosts the retrieval performance of 16-bit hash codes by 5.15% on the challenging NABirds dataset.

Abstract:
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of \left\langle \rm human, action, object \right\rangle, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.

Abstract:
RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the Adaptive Multi-modal Visual Tracking with Dynamic Semantic Prompts (AMVTrack) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the Visual-Language Fusion Adaptation (V-L FA) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.

Abstract:
The aim of camouflaged object detection (COD) is to discern concealed objects within the background. Due to issues such as high similarity to the surrounding environment, small size, occlusions, COD is considered a highly challenging task. In this paper, we propose a novel COD framework, named multi-clue sliding window attention network (MCSWA-Net), stressing in utilizing prior knowledge at different semantic levels to guide the detection of camouflaged objects via multi-scale sliding window attention (MSWA). To this end, we first devise the dynamic local detail capture (DLC) module and the global interactive decoder (GID) module to generate both local and global guidance clues. Particularly, each block of the DLC module produces local prior clue by processing corresponding image features at each stage from the encoder. And the GID module fuses all adjacent encoder features, generates global prior clue by combining fusion features of multi-semantic levels. Further, to make full use of prior clues guiding the detection of camouflaged objects at multi-semantic levels, we design the multi-scale guidance attention fusion (MAF) module and use two prior clues to refine the image features via the group fusion and the MSWA separately. Experiments conducted on four COD benchmark datasets, and results demonstrate that our MCSWA-Net is superior to state-of-the-art (SOTA) COD methods. In addition, we explore the detection capabilities of our MCSWA-Net for the downstream vision tasks related to COD, such as polyp segmentation, COVID-19 lung infection segmentation, and industrial defect detection. Experimental results show the proposed method has high degree of generality.

Abstract:
3D Gaussian Splatting holds significant potential for high-quality visual scene rendering. However, the large number of Gaussian primitives it requires poses challenges in memory consumption and practical deploy. Existing methods often rely on empirical criteria to prune Gaussians, which inevitably compromises visual quality. To address this, we propose Adversarial Pruning Networks (APNet), a framework that employs adversarial learning to balances the reduction of redundant Gaussians with the preservation of visual fidelity. APNet comprises a Gaussian Learning and Pruning Network (GLPN) and a Discriminative Network. GLPN incorporates the geometric information into the learning of Gaussians and prunes these Gaussians through a data-driven mask. Meanwhile, the Discriminative Network is trained to distinguish between synthesized and real images, acting as an adversary. Through adversarial pruning, APNet significantly reduces the number of Gaussians while rendering high-quality images. Extensive experiments on the Mip-NeRF360, Tanks & Temples, and Deep Blending datasets demonstrate that APNet achieves up to a 90% reduction in the original 3DGS while maintaining high rendering quality.

Abstract:
Anomaly detection is a key technology in quality control for automated production lines. Currently, 2D-based anomaly detection methods fail to identify geometric structure anomalies in products. To address this limitation, this paper proposes a multimodal anomaly detection model using 3D point clouds and RGB images. To ensure the single-domain inference capability of each modality, we design an attention-enhanced dual memory bank to separately store local point cloud features and RGB features. The attention mechanism enhances the informativeness and discriminability of the feature descriptors, significantly improving the data quality in the memory bank. During the inference phase, the local point cloud features in the dual memory bank guide the RGB features in calculating anomaly scores in the 2D modality. This memory-guided approach strengthens the correlation between information across different modalities. Moreover, to improve the overall segmentation precision of the model, we propose an anomaly scoring scheme based on a weight map of signed distance values. The final anomaly detection results are obtained by integrating the advantages of point cloud data in geometric structure anomaly detection and RGB data in color anomaly detection. Extensive experiments demonstrate that the proposed method achieves superior segmentation precision compared to other advanced methods on the MVTec 3D-AD and Eyecandies datasets.

Abstract:
Images suffer from color shift and detail distortion owing to limitations of contrast and visibility in hazy scenes, affecting their subjective perception. However, the performance of existing algorithms on real-world hazy images remains limited as scenes can be complex and haze degradation varies in outdoor visual systems. This study proposes a meta-knowledge single image dehazing algorithm based on hierarchical decoupling, combining the advantages of convolutional neural networks (CNNs) and transformers. We propose a novel dual-branch decoupling network that decouples low-level features from high-level semantic information in images, leveraging the hierarchical properties of the network. It combines a CNN and cross dual-branch transformer network (CDual transformer) in the encoder network to fully extract local and global features of images. To disentangle high-level semantic features, a style transfer module is designed to transform the style of hazy images while retaining the remaining semantic information. Afterward, the low-level features of images and the transformed high-level semantic features are used to reconstruct the dehazed images, fully capitalizing on the multilevel features. Furthermore, we built a meta-semi-supervised training strategy to improve the decoupling performance of the model and accumulated style knowledge of clear images from both synthetic and real-world hazy data, improving the generalizability of the model. Extensive experiments on both synthetic and real datasets show that the proposed algorithm effectively removes haze and offers better generalization abilities than similar methods.

Abstract:
Multi-view Multi-label Learning (MVML) leverages multi-view information to accurately predict multiple labels. Unfortunately, most existing MVML methods assume data completeness, making them ineffective in scenarios involving missing views or uncertain labels. Recent methods address incomplete data, yet few handle simultaneous view and label absence. To address this, we propose the Dual-view Feature-guided Fusion Learning (DFFL) framework. DFFL considers both view-specific unique features and inter-view consistent features. Specifically, DFFL constructs view uniqueness contrastive learning to ensure features within the same view maintain high semantic relevance despite missing views, while distinguishing inter-view semantics. Unlike previous methods, DFFL assumes label relevance can be reversely mapped to high-dimensional features. By establishing View-consistency learning, mutual information in the shared embedding space is maximized to achieve consistent feature alignment. In particular, DFFL minimizes the conditional entropy of the marginal distribution of multi-view features via dual prediction, deriving the maximum joint distribution for feature fusion combined with the missing view index matrix. This process effectively alleviates fusion feature suppression. Finally, the missing label index matrix is combined with fusion features to complete classification. We validate the framework on five datasets, where results demonstrate superior performance compared to state-of-the-art methods. Ablation studies further validate the effectiveness of each component.

Abstract:
Although several watermarking techniques have been proposed for spherical panoramic content, most have focused on simple leakage situations and have not addressed the various copyright leakage scenarios specific to spherical panoramic content. Such leakage scenarios are yet to be thoroughly analyzed in the literature. Diverse scenarios can occur in the case of spherical panoramic content depending on the rendering process and stage of image leakage. A distinct watermarking method is required for each scenario. In this study, six leakage scenarios for spherical panoramic content were identified, and the requirements for effective watermarking methods were examined. Without the original source information, existing watermarking techniques generally fail to protect copyrights. To this end, we propose two supplementary methods to enhance blind watermarking techniques. In the first method, a deep learning model designed for steganalysis was used to detect vertical viewpoints from perspective images without using the original source image. In the second method, a template was used to increase the robustness against spherical angle translation attacks. Using these two supplementary methods, we achieved comprehensive coverage across all scenarios that utilize existing watermarking techniques.

Abstract:
Knowledge graphs (KGs) play a key role in promoting various multimedia and AI applications. However, with the explosive growth of multi-modal information, traditional knowledge graph completion (KGC) models cannot be directly applied. This has attracted a large number of researchers to study multi-modalknowledge graph completion (MMKGC). Since MMKG extends KG to the visual and textual domains, MMKGC faces two main challenges: (1) how to deal with the fine-grained modality information interaction and awareness; (2) how to ensure the dominant role of graph structure in multi-modal knowledge fusion and deal with the noise generated by other modalities during modality fusion. To address these challenges, this paper proposes a novel MMKGC model named TSAM, which integrates fine-grained modality interaction and dominant graph structure to form a high-performance MMKGC framework. Specifically, to solve the challenges, TSAM proposes the Fine-grained Modality Awareness Fusion method (FgMAF), which uses pre-trained language models better to capture fine-grained semantic information interaction of different modalities and employs an attention mechanism to achieve fine-grained modality awareness and fusion. Additionally, TSAM presents the Structure-aware Contrastive Learning method (SaCL), which utilizes two contrastive learning approaches to align other modalities more closely with the structured modality. Extensive experiments show the proposed TSAM model significantly outperforms existing MMKGC models on widely used multi-modal datasets.

Abstract:
Multi-modal visual signals are prevalent in emergency communications. To ensure high reliability of signal transmission under bandwidth constraints, it is crucial to compress redundant information both within and between modalities as much as possible, and ensure the fidelity of the reconstructed signals. Most existing studies depend exclusively on single-modal coding schemes and fail to effectively leverage the semantic correlations between modalities. In this paper, we introduce an end-to-end general cross-modal visual coding scheme, namely CMVC, which aims to jointly compress multi-modal visual signals (such as visible and infrared signals). First, we propose a cross-modal asynchronous entropy module that extracts common features using a cross-attention mechanism. Additionally, we enhance the accuracy of common features extraction by maximizing mutual information loss. This module further compresses multi-modal visual signals by compressing only the residual features between modalities. Second, we propose a cascaded enhancement module based on cross-modal Mamba that fuses complementary information to enhance the reconstruction quality of multi-modal visual signals. Finally, extensive experimental results demonstrate that our scheme significantly outperforms other advanced methods on visible-infrared datasets. Even at low bitrates, multi-modal visual signals can still achieve excellent reconstruction quality. Additionally, our scheme exhibits outstanding compression and reconstruction performance when applied to visible-depth signals, effectively demonstrating its robustness and generalizability.

Abstract:
Transformer-based models have recently adopted increasingly complex structure (e.g., deeper or wider stacked network) to promote the representation learning capabilities of vision recognition. However, progressively deeper or wider stacked network cause the expensive computation cost, which hinders their effective deployment in resource-constrained edge clouds or end devices. In this paper, we propose DTSNet, a dynamic transformer slimming model, which scales vision transformers (ViTs) down across layers from both of the model depth and input width. This is the first time to explore the joint reduction of input tokens and model parameters for ViTs under maintaining performance. Specifically, DTSNet adopts a diversity-enhanced weight sharing module to reduce network parameters, where the weight knowledge of multiple adjacent blocks is effectively integrated into one block. Furthermore, DTSNet designs a unified and massively scalable token pruning mechanism that dynamically discarding less important tokens with a model-driven manner, by introducing a series of discriminant parameters, which is a simple change to the common architecture of vision transformers. Extensive experiments are conducted to verify that DTSNet is able to yield high efficacy in compressing parameter space and accelerating model inference. DTSNet-T/-S/-B on ImageNet achieves 3.0 M/11.1 M/42.9 M parameters and 0.8/2.9/13.7 GFLOPs, where number of parameters are reduced by 48% ～51% and inference speed are improved by 1.3× ～ 1.5×. Experiments results on semantic segmentation and object detection dataset further demonstrate the potential of DTSNet on complex dense prediction tasks.

Abstract:
Audio-visual event localization (AVEL) refers to the identification of the category and the corresponding temporal boundaries of an event that is both visually and audibly discernible in unconstrained videos. However, the event-irrelevant background (e.g., ambient noise or visual occlusion) and event-specific modal biases often lead to audio-visual semantic inconsistency. Existing methods utilize modality-guided attention to suppress background interference, but they neglect this attention inevitably introduces redundant or irrelevant information from the other modality. To alleviate this problem, we propose a novel Modality-Aware Gated Attention Network (MAGAN) that focuses on event-relevant visual regions, consolidates informative audio frequencies, and captures event-specific modality biases. Specifically, a cross-modal gated co-attention (CMGCA) scheme is presented for modeling the correspondence between the potential (self-guided) localization maps and the modality-guided localization maps through two gated components, i.e., audio-to-visual attention and visual-to-audio attention. Furthermore, a cross-modal gated co-interaction (CMGCI) mechanism that incorporates both unimodal gated interaction and multimodal gated interaction is introduced to capture event-specific modality biases by considering unimodal independence and multimodal synergy simultaneously. Extensive experiments on the AVE dataset demonstrate the superiority and effectiveness of our model over state-of-the-art approaches in both fully- and weakly-supervised AVE settings.

Abstract:
Deep learning has achieved significant success in stereo matching, with its training process often supervised by LiDAR measurements. However, the sparsity of real-world LiDAR data limits the ability of deep models to extract effective features from stereo images. To address this issue, a novel deep learning-based framework called sparse LiDAR point cloud supervised stereo matching (SLSM-Net) is proposed. Specifically, dense reconstruction of sparse single-frame point clouds is first designed to avoid the error introduction with the mergence of multi-frame point clouds. To effectively densify point clouds of objects in local areas, stereo images are utilized as supervision information to train the deep models. Furthermore, a coarse-to-fine structure of the deep model is designed for stereo matching. A self-supervised learning strategy, which employs a photometric consistency constraint, is second proposed along with fully supervised learning to obtain dense and precise supervision information. This stage generates coarse disparity maps from stereo images. Finally, to fully leverage the complementary characteristics of LiDAR and stereo cameras, multi-scale feature fusion of point clouds and stereo images is performed by a residual block, where the feature maps of point clouds are derived from the densification reconstruction. This stage refines the results. Experimental results indicate that SLSM-Net outperforms current state-of-the-art methods, demonstrating superior performance in stereo matching.

Abstract:
Existing Blind Super-Resolution (BSR) methods are mostly trained on artificial synthetic degradation data pairs or rely on specific degradation priors, which lead to poor performance due to the trained degradation mismatch between other unknown complex degradations in real-world scenarios. To tackle this problem, we propose a novel Diffusion-based Disentangled Degradation representation method for BSR, dubbed D3BSR, which disentangles arbitrary unknown degradation into structure and texture degradations to enhance perception and fidelity quality individually. Specifically, the structure degradation is optimized by degradation distribution transition with a self-supervised collaborative learning strategy to recursively minimize the perception error. The texture degradation is restored through posterior sampling controlled by a fidelity coefficient to leverage rich texture priors encapsulated in a pre-trained diffusion model for preserving fidelity. The degraded image is super-resolved using an analytical solution with the pseudo-inverse of the structural and texture degradation, which achieves a controllable trade-off between perception and fidelity and does not rely on any degradation priors or extra-supervised training. Extensive experiments on the nine heavily degraded synthetic and real-world natural and face datasets demonstrate that our D3BSR outperforms SOTA methods on the diverse metrics in reconstruction faithfulness and perceptual quality.

Abstract:
A neural Multimodal Machine Translation (MMT) system utilizes multimodal information, particularly images, to enhance traditional text-only models and achieve superior performance. However, the effectiveness of MMT heavily depends on the availability of extensive collections of bilingual parallel sentence pairs and manually annotated images, which poses a challenge due to the scarcity of such pairs. To address this issue, we propose incorporating the Multimodal Knowledge Graph (MMKG) for data augmentation in MMT. By utilizing MMKG as an additional source of knowledge, we can overcome the limitations of existing sentence-image pairings. This allows us to expand the original parallel corpus and generate corresponding images, creating new synthetic data pairs that facilitate effective data augmentation. Experiments conducted on two translation datasets, Multi30k and IKEA, demonstrate that the proposed MMKG enhancement method significantly improves performance across multiple baseline methods, ultimately outperforming all baseline approaches. Additionally, experiments under low-resource conditions reveal that our method achieves exceptional enhancement effects in low-resource corpora, surpassing other data augmentation baseline methods. These results indicate the efficacy and potential of the proposed method for enhancing the performance of multimodal models across diverse datasets.

Abstract:
Digital cameras consume ～ 0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ～\! 20 W for a 4 K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100×. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.

Abstract:
In Open-Set Single Domain Generalization (OS-SDG), one only has access to a single labeled source domain for training. It assumes that the learned model generalizes well to target samples belonging to the source label space whilst classifies target samples outside the source label space into a single “unknown” class. The current method synthesizes new samples that are semantically unrelated to known classes to simulate target unknown classes. This ignores that unknown classes actually may semantically correlated to known classes, making it difficult to discriminate samples at the margins of class decision boundaries as “unknown”. In this work, we introduce a Class-Aware Diversified Augmentation (CADA) method to overcome this problem. Our key idea is to synthesize explicitly new multiple unknown target classes with diversified semantic and learn the inherent correlation among the known and unknown classes, so to both increase the coverage of multiple target unknown classes and to optimize class margin separation. CADA is optimized by enhanced diversity maximization and class-aware minimization. The former synthesizes more novel classes by considering both semantic relationships to known classes and domain shift between the source and target domains. The latter employs class-agnostic clustering with synthesized samples to simulate class correlations among target classes, maximizing class margin separation. Theoretical analysis and experiments on five benchmarks show the efficacy of our CADA.

Abstract:
Satellite video tracking presents significant challenges due to unpredictable target variations, environmental disturbances, and occlusions. Existing approaches either rely on auxiliary modalities or require full fine-tuning of foundation models, resulting in excessive parameter sensitivity and poor generalization. Meanwhile, conventional prompt-based tuning only updates parameters at a single location, limiting its ability to adapt to complex appearance changes. To address these limitations, we propose Adaptive Visual Prompting for Effective Satellite Video Tracking (AVPTrack). Unlike conventional prompts, introduced Super Prompts dynamically refine the original template at multiple distinct positions. This multi-location adaptation allows for fine-grained representation learning, enabling the tracker to better capture target variations and resist environmental disturbances. Additionally, Dynamic Templates are introduced to mitigate tracking failures in highly challenging scenarios, such as occlusions and background clutter, ensuring robust target localization. Furthermore, the Template Selection Adapter (TSA) selects the most relevant templates in real-time, enhancing tracking efficiency. These components are optimized during training while keeping other parameters frozen, ensuring parameter efficiency. We also investigate the relationship between fine-tuning proportions and learning rates to optimize model performance. Extensive evaluations on the SV248S, SatSOT, and VISO datasets demonstrate the superior adaptability and robustness of AVPTrack compared to existing methods.

Abstract:
Kinesthetic and tactile information can represent the physical states of objects, encompassing roughness, stiffness, motion, force, and other attributes. The introduction of these can enhance imprecise recognition that relies solely on visual information in cases of light disturbance, occlusion and camouflage. Nevertheless, this task is still challenging due to the heterogeneity among visual, tactile and kinesthetic data. To address this issue, this paper delves into the alignment of heterogeneous data dimensions, the fusion of heterogeneous data features, and the optimization of learning rates for multi-source heterogeneous sensor learning models. Consequently, an effective Visual-Kinesthetic-Tactile Information Fusion (VikitaFusion) network is proposed, which comprises: 1) heterogeneous data extractors that align visual images with tactile and kinesthetic data through image-to-sequence projection; 2) a visual-kinesthetic-tactile Transformer-based domain fusion that mimics human multi-sensory fusion perception through a feature-level fusion block and dynamic fusion blocks; 3) a Periodic Triangulation Learning Rate (PTLR) method aimed at optimizing the learning rate for performance enhancement in multi-source heterogeneous sensor learning models. Extensive experiments demonstrate that VikitaFusion outperforms current state-of-the-art methods with higher recognition accuracy and a lower parameter size.

Abstract:
Cross-domain recommendation (CDR) aims to enhance recommendation accuracy in data-sparse domains by transferring knowledge from data-rich domains. Most existing CDR methods conduct knowledgetransfer based on overlapping users or items to address the user cold-start problems, including few-shot(i.e., users with sparse interactions) and zero-shot (i.e., users with no interactions) scenarios. However, in real-world scenarios, such overlap is often sparse or non-existent, limiting the effectiveness of these approaches. To overcome this challenge, we propose a novel Multi-modal Prompt-tuning Framework for Non-overlapping Multi-Domain Recommendation (MPF-NMDR). MPF-NMDR transfers knowledge across non-overlapping domains, enhancing recommendation performance in both few-shot and zero-shot scenarios. Specifically, we first pre-train the MPF-NMDR framework on data from all domains to capture users’ generalized cross-domain preferences, which are learned through the generalized multi-modal interest mining module. We then conduct prompt-tuning with domain, user, and item prompts in the target domain to capture distinctions among various domains, users, and items. In this process, only the prompt parameters are fine-tuned, while all other parameters remain frozen, enabling the model to capture the distinctions among domains, users, and items while preserving the cross-domain knowledge. Extensive experiments on Amazon and Douban review datasets validate the superior performance of MPF-NMDR compared to SOTA baselines.

Abstract:
Convolutional Neural Networks (CNN) have widely used in semantic segmentation, and can effectively extract local hierarchical information while being unsatisfactory in extracting global information. By contrast, Transformer is good at extracting long-distance dependencies in semantics while it is time-consuming. In this work, we propose a Light CNN-Transformer Dual-Branch Network (LCTDBNet) for real-time semantic segmentation. It consists of a longer CNN branch to extract local hierarchical information and a shorter Transformer branch to extract global contextual information. The CNN branch uses a lightweight encoder-decoder structure to further extract more local hierarchical information. We propose a Deep Strip Aggregation Pyramid Pooling Module (DSAPPM) to extract contextual and strip information. We further propose a Feature Pooling Refinement Module (FPRM) to optimise the feature representation at different stages. Finally, we propose a CNN-Transformer Fusion Module (CTFM) to fuse the features of two branches. Experimental results demonstrate that our proposed LCTDBNet is effective and achieves satisfactory results. Specifically, the base version of LCTDBNet achieves 80.3% mean intersection over union (mIoU) at 78.6 frames per second (FPS) on Cityscapes, 80.0% mIoU at 137.5 FPS on CamVid and 40.9% mIoU at 253.7 FPS on ADE20K.

Abstract:
Efficient model distribution is becoming increasingly critical in bandwidth-constrained environments. In this paper, we propose a simple yet effective approach called Progressive Precision Update (P^2U) to address this problem. Instead of directly transmitting the original high-precision model, P^2U transmits a lower-bit precision model, coupled with a model update representing the difference between the original high-precision model and the transmitted low precision version. With extensive experiments on various model architectures, ranging from small models (1 - 6 million parameters) to a large model (more than 100 million parameters) and using three different data sets, e.g., chest X-Ray, PASCAL-VOC, and CIFAR-100, we demonstrate that P^2U consistently achieves a better tradeoff between accuracy, bandwidth usage, and startup latency, i.e., the time it takes for the receiver to start inference. Moreover, we show that when bandwidth or startup time is the priority, aggressive quantization (e.g., 4-bit) can be used without severely compromising performance. These results establish P^2U as a practical solution for scalable and efficient model distribution across distributed environments such as federated learning, edge computing, and IoT deployments. Given that P^2U complements existing compression techniques and can be implemented alongside many compression methods, e.g., sparsification, quantization, pruning, etc., the potential for improvement is even greater.

Abstract:
Attention models, particularly Transformers, have significantly advanced deep learning in fields like natural language processing and computer vision by capturing contextual relationships in both sequential and spatial data. This ability is valuable for Point Clouds (PC), which are unstructured sets of points in 3D space. Transformers can effectively identify correlations between distant points, allowing them to focus on the most critical regions of the data. To demonstrate this capability, this paper proposes a novel, scalable Graph-Guided Transformer model, labeled 2GFormer, for static PC geometry. This model is built using a scalable architecture that leverages Graph Convolutions to enhance a Relational Neighborhood Self-Attention (RNSA) base layer model. Both models are integrated into the JPEG Pleno Learning-based Point Cloud Coding (JPEG PCC) standard, resulting in the creation of two attention-enabled codecs for static PC geometry coding: JPEG RNSA and JPEG 2GFormer. While JPEG RNSA codec delivers significant compression improvements for solid and dense PCs compared to the baseline JPEG PCC standard, JPEG 2GFormer extends these gains to solid, dense, and sparse PCs with only a marginal increase in model parameters. Additionally, JPEG 2GFormer outperforms both conventional and learning-based state-of-the-art PC codecs. These results position JPEG 2GFormer as a highly efficient solution for versatile PC coding.

Abstract:
Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solutionfor generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudo-labels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.

Abstract:
Vision GNNs (ViGs) divide an image into multiple patches, treating these image patches as graph nodes. The image is represented by extracting explicit features from these patches as node features and constructing edge connections based on explicit dependencies.However, this explicit graph structure struggles to accurately capture deeper implicit dependencies. For example, at the node-level, implicit relationships include the intra-group consistency of local and global features belonging to the same semantic group and the inter-group distinction of features belonging to different semantic groups. At the graph-level, implicit relationships manifest in whether global consistency of edge connections can be established in the absence of direct edge connection supervision. These aspects are crucial for improving the accuracy of downstream tasks. Therefore, more effective learning of implicit dependencies in vision graph structures remains an area requiring further research. We designed the Discriminative Feature Reorganization (DFR) module to address implicit dependencies at the node-level. This module constructs a loss function using similarity measures between positive and negative sample feature pairs from adjacent layers of the neural network. By adjusting this loss function, the intra-group consistency and inter-group distinction of node-level local and global features can be enhanced.We also designed the Graph Structure Refinement (GSR) module. This module refines the consistency of graph-level implicit relationships of edge connections through interactive supervision of two graphs learned from adjacent layers of the neural network. Experimental results show that ViDR-GNN achieves significant performance improvements in image classification, object detection, and instance segmentation tasks.

Abstract:
In the field of human centric multimedia, text-driven human motion generation is a significant pursuit with wide-ranging applications across diverse scenarios. Despite substantial advancements, existing methods often suffer from a trade-off between inference latency and high-quality generation. To overcome this gap, we propose the Motion Latent Flow Matching model (MotionFlow), a novel and powerful framework for motion generation. It introduces flow matching algorithm in the latent space, which can achieve superior performance with just one-step inference. In addition to the text-driven task, we further extend our method to controllable motion generation. Specifically, we integrate a control encoder into the latent space and further decode the predicted latent code into motion space to support explicit supervision, ensuring the synthesized motion can tightly align with the input signals. Extensive experiments demonstrate that our MotionFlow not only outperforms current leading approaches for the text-driven task, but also delivers remarkable capabilities in controllable motion generation.

Abstract:
Recently, text-driven video generation has achieved tremendous progress. However, existing methods neglect the contexts of long short-term frames in the video, thereby compromising temporal consistency. They also encounter challenges of heavy memory costs due to the use of the standard temporal attention mechanism and misalignment between training videos and captions. Additionally, previous approaches for long video generation are flawed because they are hard to ensure content diversity and consistency. To alleviate these issues, we propose a novel Long Short-term Temporal Diffusion (LSTD) model to generate videos with superior temporal consistency. We introduce two novel temporal modules, i.e., the Short-term Temporal Convolution and the Long-term Temporal Attention. The former can learn short-term features with a shallow structure, and the latter concentrates on long-term information of complex motion with a new memory-efficient attention mechanism. The combination of the two modules can ensure the temporal consistency of the generated videos. Furthermore, a novel inference method for long video generation is also proposed, which can iteratively generate hundreds of video frames. Experimental results on UCF-101, MSR-VTT, and two long video benchmarks prove that our method achieves superior zero-shot inference performance even when the size of the training data is reduced by 26.5 times.

Abstract:
Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively.

Abstract:
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

Abstract:
Cross-modal retrieval techniques, with their efficient search capabilities, have garnered considerable attention in both industry and academia. Among them, imagetext cross-modal retrieval, a classic task in the cross-modal domain, has been widely applied in internet applications. However, image data in cross-modal retrieval applications often contains rich personal sensitive information, which is typically stored in plaintext format, posing a risk of privacy leakage. Due to the unique semantic and heterogeneous gap in the cross-modal space, existing single-modality image privacy protection techniques cannot be directly applied to cross-modal scenarios. To address this issue, we propose a privacy-preserving cross-modal retrieval method based on disentangled soft-label alignment(PPDSA). Firstly, to protect the privacy of image data, thumbnail-preserving encryption is used. This method aims to encrypt the image while ensuring the feasibility of image-text cross-modal feature representation learning, so as to realize cross-modal encrypted retrieval. Secondly, to further improve the retrieval effect of encrypted images and plaintext, we introduce a disentangled soft-label alignment technique. Specifically, a teacher model is used to obtain the soft-label matrix, and cross-modal and single-modal soft-label alignment techniques are designed to capture more fine-grained semantic representation information and reduce the interference of false positives on similarity recognition. Experiments on Flickr30k and MSCOCO datasets show that the proposed method not only effectively protects image privacy in cross-modal retrieval, but also improves the retrieval performance by 6.7% and 3.7% respectively compared with the CLIP baseline.

Abstract:
Dynamic point clouds, widely used in virtual reality and autonomous driving systems, often suffer from distortions due to quantization in the process of compression. These distortions significantly degrade the visual quality of dynamic point clouds, especially temporal inconsistency. To address this issue, a temporal consistency-aware dynamic point clouds color attribute enhancement method is proposed in this work. Specifically, a 3D Spatial-Temporal Search (STS) module is designed to adaptively search point cloud patches in the temporal domain for feature alignment. These matched patches are then individually fed into Single Frame Feature Extraction (SFFE) module that comprises of multi-head attention and graph convolution to exploit latent features of point cloud color attribute. In addition, to further capture both the spatial and temporal dependencies, a Convolutional Point cloud Long Short-Term Memory (Conv-PointLSTM) network is applied, which integrates convolution and max pooling with LSTM mechanism to facilitate the color attribute correspondents across the spatial-temporal latent features. Experimental results demonstrate that the proposed method can achieve 0.44 dB gains on average in terms of Peak Signal-to-Noise Ratio (PSNR) and 1.50%/5.31%bit rate reductions at the low/high bit rate, which outperforms the state-of-the-art works.

Abstract:
Blind Face Restoration (BFR) involves restoring high-quality images from various unknown and severely degraded counterparts, which is a challenging task due to the conflicting objectives of content reconstruction and detail generation. In this paper, we propose a closely-coupled approach to address this problem and achieve photorealistic and faithful reproductions. Specifically, we propose a two-step image restoration model that consists of the following steps: Firstly, we train a BaseNet that incorporates a filtered feature fusion module (F^3M) to purify degraded feature maps. Secondly, while keeping the BaseNet fixed, we train a DetailNet that utilizes a feature probabilistic model to generate high-frequency detail information. The proposed framework not only separates the reconstruction and generation processes but also deeply analyzes their interactions, leading to an optimized balance between perceptual quality and fidelity. Our approach is validated through extensive experiments on both synthetic datasets and real-world facial photographs, demonstrating significant improvements in Fréchet Inception Distance (FID) scores while maintaining identity consistency. The experimental results highlight our method’s state-of-the-art performance, achieving superior visual quality and processing efficiency compared to existing methods.

Abstract:
Transformer-based Single Image Deraining (SID) methods have achieved remarkable success, primarily attributed to their robust capability in capturing long-range interactions. However, we’ve noticed that current methods handle rain-affected and unaffected regions concurrently, overlooking the disparities between these areas, resulting in confusion between rain streaks and background parts, and inabilities to obtain effective interactions, ultimately resulting in suboptimal deraining outcomes. To address the above issue, we introduce the Region Transformer (Regformer), a novel SID method that underlines the importance of independently processing rain-affected and unaffected regions while considering their combined impact for high-quality image reconstruction. The crux of our method is the innovative Region Transformer Block (RTB), which integrates a Region Masked Attention (RMA) mechanism and a Mixed Gate Forward Block (MGFB). Our RTB is used for attention selection of rain-affected and unaffected regions and local modeling of mixed scales. The RMA generates attention maps tailored to these two regions and their interactions, enabling our model to capture comprehensive features essential for rain removal. To better recover high-frequency textures and capture more local details, we develop the MGFB as a compensation module to complete local mixed scale modeling. Extensive experiments demonstrate that our model reaches state-of-the-art performance, significantly improving the image deraining quality.

Affiliations: State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China; Department of Information, University of Pisa, Pisa, Italy; School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; College of Computing and Data Science, Nanyang Technological University, Singapore; Department of Information Engineering and Computer Science, University of Trento, Trento, Italy; Department of Computer Science, ETH Zurich, Zurich, Switzerland

Abstract:
Although data-driven methods have achieved success in 3D human pose estimation, they often suffer from domain gaps and exhibit limited generalization. In contrast, optimization-based methods excel in fine-tuning for specific cases but are generally inferior to data-driven methods in overall performance. We observe that previous optimization-based methods commonly rely on projection constraint, which only ensures alignment in 2D space, potentially leading to the overfitting problem. To address this, we propose an Uncertainty-Aware testing-time Optimization (UAO) framework, which keeps the prior information of pre-trained model and alleviates the overfitting problem using the uncertainty of joints. Specifically, during the training phase, we design an effective 2D-to-3D network for estimating the corresponding 3D pose while quantifying the uncertainty of each 3D joint. For optimization during testing, the proposed optimization framework freezes the pre-trained model and optimizes only a latent state. Projection loss is then employed to ensure the generated poses are well aligned in 2D space for high-quality optimization. Furthermore, we utilize the uncertainty of each joint to determine how much each joint is allowed for optimization. The effectiveness and superiority of the proposed framework are validated through extensive experiments on challenging datasets: Human3.6M, MPI-INF-3DHP, and 3DPW. Notably, our approach outperforms the previous best result by a large margin of 5.5% on Human3.6M.

Abstract:
Online Class-Incremental Continual Learning (OCIL) addresses the challenge of continuously learning from a single-channel data stream, adapting to new tasks while mitigating catastrophic forgetting. Recently, Mutual Information (MI)-based methods have shown promising performance in OCIL. However, existing MI-based methods treat various knowledge components in isolation, ignoring the knowledge confusion across tasks. This narrow focus on simple MI knowledge alignment may lead to old tasks being easily forgotten with the introduction of new tasks, risking the loss of common parts between past and present knowledge. To address this, we analyze the MI relationships from the perspectives of diversity, representativeness, and separability, and propose an Enhanced Mutual Information (EMI) method based on knowledge decoupling. EMI consists of Diversity Mutual Information (DMI), Representativeness Mutual Information (RMI) and Separability Mutual Information (SMI). DMI diversifies intra-class sample features by considering the similarity relationships among inter-class sample features to enable the network to learn more general knowledge. RMI summarizes representative features for each category and aligns sample features with these representative features, making the intra-class sample distribution more compact. SMI establishes MI relationships for inter-class representative features, enhancing the stability of representative features while increasing the distinction between inter-class representative features, thus creating clear boundaries between classes. Extensive experimental results on widely used benchmark datasets demonstrate the superior performance of EMI over state-of-the-art baseline methods.

Abstract:
As audio adversarial attacks continue to evolve, Automatic Speech Recognition (ASR) models have emerged as a significant target. Traditional audio attack methods often focus on minimizing perturbation magnitude and frequency, overlooking the importance of perturbation location. However, certain audio regions hold lower importance for ASR models, making attacks on these regions less effective and more perceptible as noise. Additionally, the human ear perceives noise differently depending on its placement within the audio sequence, with noise in silent segments being more noticeable. To address these challenges, this paper proposes Pitch Sparse Audio Attack (PASK), an innovative framework designed to enhance adversarial imperceptibility through sparse perturbations. PASK introduces two key techniques: Pitch Mapping, which provides a strategic starting point for perturbation, and an adaptive grouped selective mask that achieves targeted sparsity, focusing perturbations on high-impact audio regions. Experimental results demonstrate that PASK outperforms existing methods in both effectiveness and imperceptibility. Furthermore, a human study confirms that silent-segment perturbations are more easily detected, underscoring the perceptual advantages of our approach.

Abstract:
With the wide spread of video, video watermarking has become increasingly crucial for copyright protection and content authentication. However, video watermarking still faces numerous challenges. For example, existing methods typically have shortcomings in terms of watermarking capacity and robustness, and there is a lack of specialized noise layer for High Efficiency Video Coding(HEVC) compression. To address these issues, this paper introduces a Deep Invertible Network for Video watermarking (DINVMark) and designs a noise layer to simulate HEVC compression. This approach not only increases watermarking capacity but also enhances robustness. DINVMark employs an Invertible Neural Network (INN), where the encoder and decoder share the same network structure for both watermark embedding and extraction. This shared architecture ensures close coupling between the encoder and decoder, thereby improving the accuracy of the watermark extraction process. Experimental results demonstrate that the proposed scheme significantly enhances watermark robustness, preserves video quality, and substantially increases watermark embedding capacity.

Abstract:
Recent progress in recognizing emotions through gait analysis has attracted substantial interest. Spatial temporal graph convolutional networks (ST-GCN) have been applied to extract gait features effectively, enabling enhanced emotion recognition. However, existing methods do not account for subtle movement cues that are intricately linked to human emotions, resulting in a lack of representation of emotion intensity. Additionally, these methods fail to consider the cyclic nature of gait, focusing only on local dependencies in the temporal domain. To tackle these limitations, we propose an innovative three-streams graph neural network model PMF-GCN (Posture-Movement-Frequency-enhanced Graph Convolutional Network). First, we introduce for the first time in gait emotion recognition, the integration of movement features from translational and rotational perspectives, combined with frequency-domain data, to capture emotion intensity and global dependencies. Then, we devise a novel adaptive feature fusion mechanism (TR-AFM), which achieves effective extraction of spatial-temporal-specific emotion features from gaits through temporal and spatial attention mechanisms, as well as gated units. Comprehensive experiments on two public datasets show that PMF-GCN achieves leading performance in gait emotion recognition, and achieves state-of-the-art performance.

Affiliations: Beijing Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China; School of Information Science and Technology, Beijing University of Technology, Beijing, China; School of Artificial Intelligence, Xidian University, Xi’an, China; Laboratory for Image and Video Engineering (LIVE), Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, Boulder, CO, USA

Abstract:
Omnidirectional Video Quality Assessment (OVQA) is a challenging task due to the limited availability of adequate numbers of training samples for learning representations of distortions on omnidirectional videos. The recent masked autoencoder (MAE) has shown promising performance in learning local and global representations in a self-supervised way, and can be used to attempt to mitigate the difficulty of having insufficient annotated samples to adequately train omnidirectional video quality prediction models. But the reconstruction tasks that MAE models are designed for do not pertain to predicting diverse perceptual distortions, especially those relevant to the task of OVQA. We have attempted to overcome these limitations to harness and apply the power of the MAE concept to the OVQA problem. Towards this purpose, we create a Distortion-Sensitive Masked AutoEncoder (DS-MAE) that is able to represent perceptual distortions on omnidirectional videos. DS-MAE extracts viewports from omnidirectional videos and employs a masked autoencoding module (MAM) and a knowledge replay module (KRM) to learn representations on each viewport. In the MAM, distorted patches from omnidirectional videos are masked, by replacing them with undistorted counterparts. The autoencoder is trained to reconstruct the masked distortions, imbuing them with the ability to represent diverse video degradations. The KRM extracts and stores content representations, which are then “replayed” to mitigate potential catastrophic forgetting of content during training of the DS-MAE. Finally, a simple OVQA model is constructed using the pre-trained DS-MAE across all viewports. The new model, called OmniVQA, was tested on three public OVQA datasets. The experimental results show that OmniVQA delivers competitive performance against all compared models.

Abstract:
Image restoration aims to restore high-quality images from degraded inputs caused by factors such as motion blur, defocus blur, and rain, where the primary difference between degraded and high-quality images lies in their high-frequency components. Despite the critical role of high frequencies in restoration, few methods explicitly prioritize computational resources for high frequencies over low frequencies. To address this issue, we propose a High-Frequency Prioritized Sparse Attention Network (HFP-SAN), a novel architecture for image restoration tasks. We explicitly prioritize high-frequency components by designing a symmetric encoder-decoder framework integrated with High-Frequency Selective Sparse Attention (HFSSA) modules while handling low-frequency components with a smaller residual network, thereby proportionally allocating computational resources based on their relative importance. HFSSA incorporates a Frequency-Selective Matching (FSM) algorithm to focus attention on strongly correlated high-frequency regions, mitigating computation on areas with weak correlations and irrelevant areas. Additionally, we introduce a dynamically adjustable high-frequency mask that guides the network to focus on the severely degraded regions, further refining restoration quality. The above designs ensure the final reconstructed image is a high-quality product. Experiments demonstrate that our HFP-SAN achieves state-of-the-art performance across multiple image restoration tasks, both quantitatively and qualitatively.

Abstract:
Deep neural networks (DNNs) are well-known to be susceptible to many universal adversarial perturbations (UAPs), where each UAP can successfully attack many images when added to the input. In this paper, we explore the existence of diversified UAPs, each of which successfully attacks a large but substantially different set of images. Since the sets of images successfully attacked by different UAPs are often complementary to each other, strategically selecting the most effective UAP to attack each new image could maximize the overall coverage of successful attacks. Following this insight, we propose a novel attack framework named boosting universal adversarial attack. The key idea is to simultaneously train a set of diversified UAPs and a selective neural network, such that the selective neural network can choose the most effective UAP when attacking a new target image. Due to the simplicity and effectiveness of the proposed boosting attack framework, it can be generally used to significantly boost the attack effectiveness of many classic single-UAP methods that only use a single UAP to attack all target images. Meanwhile, the boosting attack framework is also able to perform real-time attacks as it does not require any additional training or fine-tuning when attacking new target images. Extensive experiments demonstrate the outstanding performance of the proposed boosting attack framework.

Abstract:
Generalized category discovery (GCD) is an important and challenging task in open-world learning. Specifically, given some labeled data of known classes, GCD aims to cluster unlabeled data that contain both known and unknown classes. Current GCD methods based on parametric classification adopt the DINO-like pseudo-labeling strategy, where the sharpened probability output of one view is used as supervision information for the other view. However, large pre-trained models have a preference for some specific visual patterns, resulting in encoding spurious correlation for unlabeled data and generating noisy pseudo-labels. To address this issue, we propose a novel method, which contains two modules: Loss Sharpness Penalty (LSP) and Dynamic Anchor Selection (DAS). LSP enhances the robustness of model parameters to small perturbations by minimizing the worst-case loss sharpness of the model, which suppressing the encoding of trivial features, thereby reducing overfitting of noise samples and improving the quality of pseudo-labels. Meanwhile, DAS selects representative samples for the unknown classes based on KNN density and class probability during the model training and assigns hard pseudo-labels to them, which not only alleviates the confidence difference between known and unknown classes but also enables the model to quickly learn more accurate feature distribution for the unknown classes, thus further improving the clustering accuracy. Extensive experiments demonstrate that the proposed method can effectively mitigate the noise of pseudo-labels, and achieve state-of-the-art results on multiple GCD benchmarks.

Abstract:
Event cameras hold great potential for motion deblurring because they capture motion information with microsecond precision, offering robustness to motion blur. However, the limited interaction between RGB frames and event streams presents a significant challenge, preventing the full utilization of the event cameras’ unique advantages. To address this, we propose Dual frame-event Interaction and introduce a multi-scale Network structure, DuInt-Net. DuInt-Net aims to tackle two key challenges: (1) enhancing the representational and interaction capabilities between RGB frames and event streams, and (2) adaptively selecting richer visual features for improved motion deblurring. We introduce an event-frame joint interaction module that consists of three branches: a base branch, a global awareness attention branch, and a local enhancement attention branch. The base branch processes essential pixel-level features that retain the original structural information. The global branch integrates event data to improve large-scale motion understanding, while the local branch uses large-kernel convolutions to refine fine-grained details in RGB frames. For superior reconstruction performance, we also propose the event-guided multi-scale fusion attention module, which effectively combines local visual information and global frame-event relationships. Extensive experiments demonstrate that DuInt-Net achieves superior performance, both quantitatively and qualitatively, showcasing its superior motion deblurring capabilities.

Abstract:
We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model. This initial foray into the application of untrained diffusion models in virtual try-on technology potentially paves the way for further exploration and refinement in this industrially and academically valuable field.

Abstract:
Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization. To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6 M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., 21 M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets.

Abstract:
In light of their capability to capture structural information while reducing computing complexity, anchor graph-based multi-view clustering (AGMC) methods have attracted considerable attention in large-scale clustering problems. Nevertheless, existing AGMC methods still face the following two issues: 1) They directly embedded diverse anchor graphs into a consensus anchor graph (CAG), and hence ignore redundant information and numerous noises contained in these anchor graphs, leading to a decrease in clustering effectiveness; 2) They drop effectiveness and efficiency due to independent post-processing to acquire clustering indicators. To overcome the aforementioned issues, we deliver a novel one-step multi-view clustering method with adaptive low-rank anchor-graph learning (OMCAL). To construct a high-quality CAG, OMCAL provides a nuclear norm-based adaptive CAG learning model against information redundancy and noise interference. Then, to boost clustering effectiveness and efficiency substantially, we incorporate category indicator acquisition and CAG learning into a unified framework. Numerous studies conducted on ordinary and large-scale datasets indicate that OMCAL outperforms existing state-of-the-art methods in terms of clustering effectiveness and efficiency.

Abstract:
Weakly supervised text-based person re-identification (Text-ReID) confronts the challenge of matching target person images with textual descriptions, hindered by the absence of identity annotations during training. Traditional approaches, which rely solely on global features, overlook the rich, fine-grained information within both text and image modalities. Besides, merely aligning features at the semantic level is insufficient due to the significant differences in feature representation spaces between the two modalities. Existing methods also neglect the information inequality caused by person-irrelevant factors in images. In this paper, we introduce a novel framework called Attribute-Centric Cross-modal Alignment (ACCA), specifically designed to overcome these issues. Our approach concentrates on two main aspects: visual-text attribute alignment and prediction distribution alignment. To effectively capture fine-grained information without identity labels, we implement a visual-text attribute alignment method based on momentum contrastive learning to synchronize visual and textual attribute features within a unified embedding space. We also propose a unique strategy for negative sample filtering and enrichment, creating robust and comprehensive negative attribute sample spaces to support the attribute alignment. Additionally, we establish two methods of label-free prediction distribution alignment to encourage the learning of invariant feature representations across modalities. The first method, bias-reduction distribution alignment, aligns features and predictions within each text-image pair by utilizing semantic information from the text and reduces the impact of person-irrelevant factors in images. The second method, global-attribute distribution alignment, enhances the interaction between global and local prediction distributions across visual and textual modalities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets validate our superior performances across all standard benchmarks.

Abstract:
The active speaker detection task is to determine whether a person is speaking or not across a series of video frames. Existing methods heavily rely on facial information within the annotated face bounding boxes for cross-modal learning with audio. This leads to a substantial decline in detection performance when facial cues are unclear, such as in cases of face occlusion or low-resolution facial appearances. In this paper, we extend the perception scale using only face bounding box annotations to model both facial and gestural cues, addressing the over-reliance on facial cues in active speaker detection. We propose a novel graph neural network that models inter-speaker interactions and integrates various cues from individual speakers. The final detection results are obtained through a binary graph node classification task. Our method achieves state-of-the-art performance on the AVA-ActiveSpeaker dataset (mAP: 95.6%) and the ASW dataset (mAP: 99.4%), with a model size only 21% that of the second-best method. Additionally, when facial cues are of poor quality, our method demonstrates a significant performance advantage over existing approaches.

Affiliations: School of Software, Dalian University of Technology, Dalian, China; School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang, China; China Telecommunication Technology Laboratory, China Academy of Information and Communications Technology, Beijing, China; Computer Science and Engineering Department, National Institute of Technology Patna, Bihar, India; School of Computer Science and Technology, Xidian University, Xi’an, China

Abstract:
Nowadays, massive amounts of facial images have been tampered with and then widely spread through social networks. Many studies have developed algorithms for frame-level DeepFake detection. However, they have low robustness due to their focus on tamper-independent features during training. To this end, we propose a framework, namely MIF-Net, based on multi-information fusion for robust frame-level DeepFake detection. Specifically, key landmarks and the facial area are first detected in the original frame. Then, the graph convolutional network constructs biometric information from these landmarks. Meanwhile, the facial region is processed into multi-view inputs by noise and edge enhancement algorithms. Finally, these products are encoded as high-level features and classified as real or fake. Five benchmark datasets are utilized for testing our model through within-dataset and cross-dataset validations. Extensive experiment results demonstrate that our proposed MIF-Net is robust and has advantages over peer algorithms.

Abstract:
Data missing is a common issue in real-world applications, posing significant challenges for incomplete data processing. Traditional incomplete multi-view clustering methods rely on manually-designed optimization problems based on prior interpretable knowledge, considering the full utilization of available data. However, their limited feature extraction capability may become a bottleneck. In contrast, deep optimization methods leverage learning-based nonlinear transformations for clustering. They primarily achieve data imputation through the generalization ability of deep models, but their model interpretability may be limited by the black-box nature. Moreover, most existing methods only explore the structure of each view independently, where these structures are fixed and cannot form a complete unified structure. To address these issues, we propose a Structural Optimization-inspired Interpretable Network (SOI-Net) for incomplete multi-view clustering. Specifically, we project the features of all views into a unified representation space with the un-missing information of the views as constraints. By optimizing consistent structural information, we preserve the structures of missing modalities in the unified representation space, thereby mitigating the impact of missing data. Meanwhile, we derive network components based on the optimization problem to guide the learning of structure and representation. The practical significance of these network components provides model design-level interpretability. Extensive experiments on six datasets validate the effectiveness of SOI-Net in handling incomplete multi-view clustering task.

Abstract:
The rapidly increasing popularity of immersive multimedia services such as live holographic communication represents the future trend of extended reality (XR) applications. However, the realization of such immersive and interactive experiences is limited by the lack of fundamental understanding of how different user behaviours and environmental factors jointly affect the overall quality of experience (QoE). In particular, compared with the media adaptation mechanisms applied in conventional video applications, considerably more independent factors may influence user QoE in these applications, including both human- and network-related factors. In this paper, we investigate the fundamental design principles of dynamic media adaptation methods for live holographic communication by holistically considering these factors. Specifically, a machine learning-based scheme is introduced to facilitate intelligent adaptation of both the frame quality (resolution level) and frame rate according to specific contexts, such as the user intent/behaviour (including object motion patterns and user movements) and real-time network conditions. Extensive real-world experiments are conducted to assess the feasibility and performance of the proposed method, and comparisons with state-of-the-art methods are performed. The results indicate that the proposed approach can effectively satisfy user intent, with increased user QoE.

Abstract:
Referring Image Segmentation (RIS) aims to generate specified target masks in the image using natural language. While existing methods have made progress in modeling the relationship between words and pixels, they often overlook sentence-level semantic information. This limits the model's ability to fully comprehend the deeper meaning of language, affecting target localization and segmentation. To address this problem, we propose a Context-aware Mutual Attention Network (CMANet), which integrates both word-level and sentence-level semantic information to guide visual features in generating precise object masks. Specifically, during the feature encoding stage, we design a Shallow Mutual Attention (SMA) module to reduce the discrepancy between visual and linguistic representations, enhancing pixel-word alignment. In the global representation stage, we introduce a Context-aware Mutual Attention (CMA) module that utilizes sentence-level target semantics to guide the contextual representation of multi-modal features. Experiments conducted on several commonly used RIS datasets, including the natural image referring segmentation dataset, Robust RIS dataset, and referring remote sensing image dataset, show that CMANet outperforms current state-of-the-art methods on all these datasets, demonstrating superior segmentation accuracy.

Abstract:
Low-light image enhancement is critical for improving image visibility and perceptual quality in real-world nighttime conditions. Existing approaches that rely solely on single RGB images often fail to recover details in extremely dark regions, thereby limiting their practical effectiveness. Recent multimodal techniques that utilize both RGB and infrared (IR) modalities have shown potential, yet they still struggle to achieve optimal exposure, natural color restoration, and efficient crossmodal feature fusion. To address these limitations, we propose a Mamba-Based Progressive-Recovery Framework for multimodal low-light image enhancement. The proposed framework consists of three stages. (1) Illumination Estimation: A lightweight network estimates the global brightness of the low-light RGB and IR inputs. (2) Illumination Corruption Restoration: A Mamba-UNet architecture refines the illumination and restores detailed spatial information. (3) Multimodal Fusion Enhancement: A CNN. Mamba encoder-decoder and a global-local cross-modal feature fusion network are designed to effectively integrate complementary RGB-IR information, producing enhanced images with balanced exposure and vivid color representation. Extensive experiments on the LLVIP and MSRS datasets demonstrate that the proposed method outperforms ten state-of-the-art single- and multimodal enhancement techniques. Our framework achieves consistent improvements across multiple evaluation metrics, including Edge Intensity, Entropy, Average Gradient, NIQE, and BRISQUE. Specifically, it yields an average 1.94% gain in edge intensity on both datasets and ranks second in Spatial Frequency, verifying its effectiveness in restoring fine structural details and improving overall image perceptual quality.

Abstract:
Copy-move forgery detection (CMFD) is a technique tailored to detect the existence of copy-move regions in a query image. In this paper, a dual-view CMFD network named DV-Net is proposed, which integrates the combination of the similarity information and tampered features from shallow features conducive to copy-move region localization by using dual-view self-correlation calculation (DV-SCC) and the shallow similarity attention module (SSAM), and strengthens the ability of distinguishing source/target regions by making deep features pass through three serial multiple serial adaptive receptive field selection modules (ARFSMs). The SCC plays an irreplaceable role in identifying copy-move regions. However, single-view SCC, such as the cosine similarity or the Euclidean distance, can solely capture the similarly information from a single perspective. DV-SCC, a combination of Euclidean distance and cosine similarity, provides more comprehensive similarity information from numerical and directional perspectives. In addition, different from previous CMFD networks that only utilize the similarity information to locate similar regions while neglecting tampered features contained in the shallow features, which are of vital importance to CMFD, we innovatively convert the similarity information into the SSAM and apply SSAM on the shallow features to emphasize the similarity information while preserving tampered features, significantly enhancing the localization accuracy of source/target regions. Multiple serial ARFSMs, each containing two parallel branches controlled by a soft attention, can adaptively select appropriate receptive fields according to the scales of tampered regions, improving the classification accuracy of source/target regions. The experimental results show that DV-Net outperforms several advanced algorithms in source/target region localization and discrimination on three publicly available datasets.

Abstract:
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method.

Abstract:
Existing multimedia recommender systems provide users with suggestions of media by evaluating the similarities, such as games and movies. To enhance the semantics and explainability of embeddings, it is a consensus to apply additional information (e.g., interactions, contexts, popularity). However, without systematic consideration of representativeness and value, the utility and explainability of embedding drops drastically. Hence, we introduce RVRec, a plug-and-play model-agnostic embedding enhancement approach that can improve both personality and explainability of existing systems. Specifically, we propose a probability-based embedding optimization method that uses a contrastive loss based on negative 2-Wasserstein distance to learn to enhance the representativeness of the embeddings. In addtion, we introduce a reweighing method based on multivariate Shapley values strategy to evaluate and explore the value of interactions and embeddings. Extensive experiments on multiple backbone recommenders and real-world datasets show that RVRec can improve the personalization and explainability of existing recommenders, outperforming state-of-the-art baselines.

Abstract:
Multi-modal learning has become a transformative approach in recommendation systems, leveraging diverse data types—such as visual, textual, and audio signals—to construct rich and comprehensive user preference profiles. Despite significant progress, existing methods often struggle with key challenges, including imbalanced data utilization, diverse inter-modal correlations, and task-specific variability, which limit their ability to fully exploit inter-modal relationships. To address these issues, we propose KANM^2L (KAN Enhanced Multi-modal Learning for Recommendation), a novel framework that integrates the strengths of the Kolmogorov-Arnold Network (KAN) with multi-modal learning. Specifically, KANM^2L: (1) introduces a KAN-enhanced dilated attention mechanism to effectively capture high-dimensional, complex visual dependencies, enabling scalable and efficient processing of intricate datasets; (2) employs a multi-modal adversarial network to align and fuse features across modalities, ensuring seamless integration and improved recommendation accuracy; and (3) incorporates a rotational loss function to stabilize and refine visual feature embeddings, leveraging historical interaction data for more consistent performance. Extensive experiments on real-world datasets demonstrate that KANM^2L achieves state-of-the-art performance, with improvements of up to 11.7% over existing methods. These findings underscore the potential of KANM^2L to advance the field of multi-modal recommendation systems by overcoming critical limitations and delivering robust, scalable performance across diverse recommendation tasks.

Abstract:
Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven 3D facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a large-scale high-quality multi-modal singing head dataset, SingingHead, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we benchmark existing audio-driven 3D facial animation methods and 2D talking head methods on the singing task. Existing 3D facial animation methods and 2D talking head methods fail to produce satisfactory singing results. Focusing on the 3D singing head animation, we first utilize the proposed singing-specific dataset to retrain the 3D facial animation methods, resulting in substantial performance improvements. Besides, considering the absence of background music and the slow generation speed of existing methods, we propose a simple but efficient non-autoregressive VAE-based framework with background music as an input signal to generate diverse and accurate 3D singing facial motions in real time. Extensive experiments demonstrate the significance of the SingingHead dataset in promoting the development of singing head animation.

Abstract:
Multi-Object Tracking (MOT) in Uncrewed Aerial Vehicles (UAV) aims to continuously and stably detect and track objects in videos captured by UAVs. In existing MOT tracking-by-detection schemes, the tracker with a fixed step size is always employed, and a fixed length of past tracking information is input to the tracker to guide position prediction. However, the limited prediction range of a single-scale tracker leads to frequent tracking losses, and limited historical information also reduces tracking accuracy. To address these limitations, we propose a novel Long-Short Match (LSMTrack) tracking method. The key idea is to use long and short trackers and maintain a long-term motion state to improve tracking performance, thus reducing the likelihood of entering the lost status. To this end, a new Mamba-based tracker and a long-short match strategy are proposed. For long and short trackers, the same architecture is used based on Mamba. Unlike the previous Mamba-based approach, the proposed tracker maintains a long-term state while updating the state and making position predictions in each time step, so we call it a step Mamba tracker. Meanwhile, we devise a long-short match strategy at the inference stage to integrate long and short trackers, and design a lost control operation which updates the long-term states using historical state values. In this way, the matching probability and the inference efficiency are guaranteed. Experimental results on two UAV MOT datasets confirm the state-of-the-art performance. Specifically, the best results are achieved in terms of two popular MOTA and IDF1 tracking evaluation metrics.

Abstract:
Deep neural networks suffer from catastrophic forgetting when continually learning new concepts. In this paper, we analyze this problem from a data imbalance point of view. We argue that the imbalance between old task and new task data contributes to forgetting of the old tasks. Moreover, the increasing imbalance ratio during incremental learning further aggravates the problem. To address the dynamic imbalance issue, we propose Uniform Prototype Contrastive Learning (UPCL), where uniform and compact features are learned. Specifically, we generate a set of non-learnable uniform prototypes before each task starts. Then we assign these uniform prototypes to each class and guide the feature learning through prototype contrastive learning. We also dynamically adjust the relative margin between old and new classes so that the feature distribution will be maintained balanced and compact. Finally, we demonstrate through extensive experiments that the proposed method achieves state-of-the-art performance on several benchmark including CIFAR-100, ImageNet-100, TinyImageNet, Food-101, and CUB-200. Experimental results show that our approach not only effectively addresses the issue of imbalanced old data in memory but also tackles the problem of imbalanced new data distributions.

Abstract:
Recent diffusion models have demonstrated exceptional efficacy across various image restoration tasks, but still suffer from time-consuming and substantial computational resource consumption. To address these challenges, we present LPCDiff, a novel Laplacian Pyramid-based Conditional Diffusion model designed for real-scene image dehazing. LPCDiff leverages the Laplacian pyramid decomposition to decouple the input image into two components: the low-resolution low-pass image and the high-frequency residuals. These components are subsequently reconstructed through a diffusion model and a well-designed high-frequency residual recovery module. With such a strategy, LPCDiff can substantially accelerate inference speed and reduce computational costs without sacrificing image fidelity. In addition, the framework empowers the model to capture intrinsic high-frequency details and low-frequency structural information within the image, resulting in sharper and more realistic haze-free outputs. Moreover, to extract more valuable information from the limited training data, we introduce a low-frequency refinement module to further enhance the intricate details of the final dehazed images. Through extensive experimentation, our method significantly outperforms 12 state-of-the-art approaches on three real-world and one synthetic image dehazing benchmarks.

Affiliations: School of Computer Science, China University of Geosciences, Wuhan, China; National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan, China; Guangzhou Urban Planning and Design Survey Research Institute and Guangdong Enterprise Key Laboratory for Urban Sensing, Monitoring and Early Warning, Guangzhou, China; School of Mathematical Sciences, University of Science and Technology of China, Hefei, China

Abstract:
Point cloud reconstruction is an ingredient in geometry modeling, computer graphics, and 3D vision. In this paper, we propose a novel unsupervised learning method called the Recurrent Multi-Step Moving Strategy, which progressively moves query points toward the underlying surface to accurately learn unsigned distance fields (UDFs) for point cloud reconstruction. Specifically, we design a recurrent network for UDF estimation that integrates a multi-step strategy for query movement. This model treats query movement as a trajectory prediction process, establishing dependencies between the current query move decision and the previous path, thus utilizing temporal information to improve UDF estimation accuracy. Further, we design distance and gradient regularization losses to ensure the precision, consistency, and continuity of the estimated UDFs. Extensive evaluations, comparisons, and ablation studies are conducted to show the superiority of our method over the competing approaches in terms of reconstruction accuracy and generality. Our unsupervised reconstruction method outperforms many supervised techniques and demonstrates efficacy across diverse scenarios, including single-object, indoor, and outdoor benchmarks.

Abstract:
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.

Abstract:
Reconstructing 3D poses from 2D poses lacking depth information is particularly challenging due to the complexity and diversity of human motion. The key is to effectively model the spatial constraints between joints to leverage their inherent dependencies. Thus, we propose a novel model, called Double-chain Graph Convolution Transformer (DC-GCT), to constrain the pose through a double-chain design consisting of local-to-global and global-to-local chains to obtain a complex representation more suitable for the current human pose. Specifically, we combine the advantages of GCN and Transformer and design a Local Constraint Module (LCM) based on GCN and a Global Constraint Module (GCM) based on self-attention mechanism as well as a Feature Interaction Module (FIM). The proposed method fully captures the multi-level dependencies between human body joints to optimize the modeling capability of the model. Moreover, we propose a method to use temporal information into the single-frame model by guiding the video sequence embedding through the joint embedding of the target frame, with negligible increase in computational cost. Experimental results demonstrate that DC-GCT achieves state-of-the-art performance on two challenging datasets (Human3.6 M and MPI-INF-3DHP).

Abstract:
In recent years, deep learning has revolutionized hyperspectral image (HSI) classification. However, it remains a significant challenge to achieve high-precision classification with limited image quality and labeled samples. Most existing methods fail to effectively leverage unlabeled samples and neglect the impact of image quality degradation on classification performance. To address these issues, this paper proposes a Fusion-Driven Task Mutual-Guidance Network (FTMNet), which enhances image quality and improves classification performance through mutual-guidance between fusion and classification tasks. Specifically, we propose an image fusion subnet integrating contrastive learning to jointly optimize input quality enhancement and discriminative feature representation through multi-objective constraints. To mitigate sample scarcity, a multi-task interactive multimodal contrastive architecture is developed, leveraging cross-modal complementarity and cross-task feature sharing mechanisms to strengthen discriminative power. Furthermore, we introduce a cross-task collaborative mutual-guidance strategy that synchronizes inter-task information exchange via learnable parametric constraints, forming unified optimization directions for coordinated performance enhancement. The experimental results demonstrate that the proposed method outperforms the existing state-of-the-art methods in both quantitative and qualitative aspects.

Abstract:
Enhancing video quality assessment (VQA) through semantic information integration is a critical research focus. Recent research has employed the Contrastive Language-Image Pre-training (CLIP) model as a foundation to improve semantic perception. However, the image-text alignment inherent in these pre-trained Vision-Language (VL) models frequently results in suboptimal VQA performance. While prompt engineering has recently targeted the language component to address this alignment issue, the unique insights resided in visual analysis is still overlooked for further advancing VQA tasks. Additionally, seeking a trade-off between quality separability and domain invariance in VQA remains largely unresolved within the VL paradigm. In this paper, we introduce a novel cross-modal prompt-based approach to tackle these challenges. Specifically, we propose learnable prompts within the vision branch to foster synergy between visual and language modalities through a language-to-vision coupling function. The multi-view backbone is then carefully crafted with content enhancement and distortion-aware temporal modulation to ensure quality separability. The language prompts, derived from visual representations, are further supported by adaptive weighting mechanisms to optimize the balance between quality separability and domain invariance. Experimental results demonstrate the effectiveness of our proposed method over leading VQA models, showing significant improvements in generalization across diverse datasets.

Abstract:
Few-shot classification is a challenging task that recognizes novel classes by learning from few training instances. Metric-based models are currently the most effective solutions for few-shot classification. In these models, patch feature distances between query instances and support classes are calculated to achieve classification. However, it is difficult for patch-based methods to mine semantic information of support and query instances, leading to inaccurate feature similarity measures. To address these problems, we propose to construct CrossHypergraph based on hypergraph modeling. Specifically, we first align the local prototype vertices of support and query instances to model consistent hypergraph structures. Then a vertex-hyperedge-vertex-based interactive feature updating mechanism is designed to generate CrossHypergraph representation with consistent high-order semantic information for support and query instances. Based on the CrossHypergraph, we propose a consistent high-order semantic network, in which the high-order semantic-based weighted metric strategy is designed to achieve accurate classification. The proposed method is evaluated on general, fine-grained, and cross-domain few-shot benchmarks, including miniImageNet, tieredImageNet, CIFAR-FS, FC100, and miniImageNet \rightarrow CUB datasets. Experimental results show that our CrossHypergraph-based few-shot classifier generates consistent high-order semantic features, and achieves state-of-the-art performance on both 1-shot and 5-shot tasks.

Abstract:
Remote photoplethysmography (rPPG) for heart rate (HR) measurement based on facial videos has recently attracted increasing attention. However, most existing methods focus on average heart rate (AHR) over a period rather than instantaneous heart rate (IHR), which better reflects physical and mental states. To address this issue, we propose a novel rPPG-based method for measuring IHR values from facial videos. Our method employs the wavelet synchrosqueezed transform (WSST) to generate time-frequency representations (TFRs) of chrominance (CHROM) signals from multiple facial regions of interest (ROIs), synchronously reflecting the IHR during a video segment. Furthermore, the TransUNet is introduced to refine these TFR images, enhancing the ridge line information related to IHRs. Comprehensive comparisons and ablation studies on four public datasets (UBFC-rPPG, PURE, UBFC-Phys, and MMPD) reveal that our WSST-UNet method achieves superior performance over several typical rPPG methods, achieving mean absolute errors (MAE) of 2.34 beats per minute (bpm), 1.29 bpm, 5.03 bpm, and 6.58 bpm, respectively. The proposed method offers a promising solution for practical application in video-based IHR measurements.

Abstract:
Diffusion Probabilistic Models (DPMs) have recently demonstrated considerable potential for single image super-resolution (SISR) by utilizing a conditional generation process that transforms Gaussian noise into high-resolution (HR) images based on low-resolution (LR) inputs. Current Image-Conditional DPMs (icDPMs) have demonstrated promising results by leveraging LR images as a condition to guide the generation of HR images. However, icDPMs fail to effectively integrate LR images and other conditional information to generate accurate and natural output. To address this issue, we propose an Integrated Conditional Diffusion Model for Single Image Super-Resolution (ICDSR). Our approach encodes the LR image as a condition to generate the prior feature, simultaneously integrating it with timestep information to establish intermediate constraints. To further enhance these constraints, we designed a multi-scale guidance structure for the U-shaped concatenation of the diffusion model during the integration of conditions. This constraint serves as multi-scale guidance specifically designed for the U-shaped concatenation of the diffusion model during the integration of conditions. Specifically, multi-scale integrated information is injected into the diffusion model basic block, informing about the coarse structure of the sharp image at the intermediate layers with spatially adaptive conditions. Additionally, ICDSR employs a lightweight U-Net to provide initial guidance and leverages the diffusion model to learn residual guidance for faster convergence. Extensive experiments on facial and general benchmarks, including the CelebA and DIV2K datasets, demonstrate that ICDSR surpasses existing methods, achieving state-of-the-art perceptual quality while maintaining competitive distortion metrics.

Abstract:
360^\circ video streaming emerges as an innovative video presentation form that offers users an immersive and interactive experience, where the quality of experience (QoE) is a vital indicator to measure user viewing perception. In multicategory 360^\circ video streaming, existing QoE-driven approaches typically assume a fixed request distribution to enhance users’ average QoE, prioritizing the optimization of edge caching and bitrate selection decisions for the video category with a larger request number. Inevitably, these unfair approaches would lead to average QoE reduction in real-world scenarios, in which the request distribution exhibits significant variations and is challenging to predict accurately. To this end, we propose a fairness-aware 360^\circ video streaming strategy in cloud-edge collaboration networks for improving users’ average QoE. Specifically, we first formulate the joint edge caching and bitrate selection problem as a multi-agent cooperative input-driven Markov decision process to maximize users’ average QoE and guarantee QoE fairness for users. Subsequently, we devise an adaptive learning-based multi-agent deep reinforcement learning (MADRL) approach, which can adaptively adjust the learning rate of each agent according to the dynamic user request distribution, thus helping agents make optimal decisions. Finally, experimental results on real-world datasets show that the proposed algorithm significantly improves users’ average QoE while ensuring QoE fairness for users.

Abstract:
Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the overlooking of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation.

Abstract:
Prompt learning is an effective way to adapt pre-trained models to downstream tasks by training a small number of additional learnable prompts. Recent studies address several early challenges by combining generalized knowledge from frozen pre-trained VL models with task-specific knowledge from training data as guidance for prompt learning. However, existing methods still struggle with the generalization-adaptation (GA) trade-off dilemma: excessive reliance on generalized knowledge hinders adaptation to downstream tasks, while overemphasis on task-specific knowledge undermines the inherent generalization capabilities of pre-trained models. To address this issue, we propose a novel prompt learning method called Prompt Learning with Knowledge Regularization (PLKR). PLKR effectively mitigates the GA trade-off dilemma by offering greater flexibility in adapting to task-specific knowledge while minimizing the disruption of pre-trained knowledge. Specifically, we propose category-invariant and topology-invariant knowledge regularization to preserve generalized knowledge: the former enhances category-level discriminative capabilities while allowing flexible task-specific learning, and the latter maintains global topological stability during adaptation to new tasks. Through the proposed regularization, PLKR improves the performance on both base and new tasks. We evaluate the effectiveness of our approach on four representative tasks over 11 datasets. Experimental results show our method outperforms existing SOTA methods by a large margin.

Affiliations: Ministry of Education Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance, School of Information Engineering, Minzu University of China, Beijing, China; Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China; Institute for Infocomm Research, A*STAR, Singapore; School of Computer Science and Information Technology, Beijing Jiaotong University, Beijing, China; School of Economics and Management, Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, China; Institute of Information Science, Beijing Jiaotong University, Beijing, China

Abstract:
With the rapid development of intelligence media, traditional semantic segmentation has shown excellent potential in application scenarios like autonomous driving. However, due to limited performance, traditional segmentation models usually lead to poor user experiences in applications that require high segmentation precision. Therefore, interactive semantic segmentation (ISS) is gaining the attention is gaining attention due to its capability to generate high-precision semantic segmentation results through a few user-provided clicks for experience improvement, which thus has a promising development prospect in fine-grained application scenarios, e.g., virtual reality, smart medical, data annotation, etc.. For good interaction efficiency, most existing interactive methods make efforts to conduct suitable click simulation strategies and reasonable click encoding methods, aiming at the robust understanding of diverse user clicks and translating comprehensible user intent, i.e., assign the correct category to the clicked area, for the neural network. Though proved effective, their designs ignore the uncertainty hiding in the extracted interaction features, which reflects the interaction difficulty and the user clicking intents. This can lead to inappropriate click simulation and click encoding, limiting the interaction efficiency. Hence we focus on exploring a reasonable ISS scheme via an uncertainty mining view. Specifically, we propose an uncertainty-based class-balanced click sampling (UCCS) simulation strategy by considering both the uncertainty of the click simulation region and its semantic imbalance, to form a reasonable click distribution. Furthermore, we propose a semantic uncertainty residual encoding (SURE) method to better embed the user’s intention into the localization maps, by mining semantic confusion between the click and misprediction classes. We prove the effectiveness of our design through extensive experiments and initially analyze the importance of uncertainty mining for the ISS. Our model can achieve state-of-the-art performance on three semantic segmentation benchmarks.

Abstract:
Learned Image Compression (LIC) has achieved superior performance in recent years, of which the context entropy model is an important component. However, in the context entropy model, there is no deterministic correlation between neighboring channels, and it is difficult to capture inter-channel correlation as well as spatial correlation for further improving the performance. To address this issue, a Cubic-Checkerboard conTeXt entropy model (C-CTX) for LIC is proposed in this work, which is able to refer uniformly across the channel domain and maintain the correlations in the spatial domain. To make neighboring channels have more similar distribution, Cubic Checkerboard Mask (CCM) with channel-wise mask convolution is utilized to achieve uniform distribution in different domains and Channel Wise Re-Arrangement (CWRA) is performed in terms of entropy. Based on CCM and CWRA, two Feature Disentangle Modules (FDMs) are designed in C-CTX to project the context information within sub-spaces for catching spatial correlation and channel correlation separately. Extensive experimental evaluations show that our method outperforms the state-of-the-art works on six datasets, i.e., Kodak, Tecnick, CLIC'20, CLIC'21, CLIC'22, and JPEG-AI.

Affiliations: College of Computer Science and Technology, Zhejiang University, Hangzhou, China; Department of Telecommunications, Xi'an Jiaotong University, Xi'an, China; Department of Computer Vision, Meituan, Beijing, China; College of Computer Science, Zhejiang University, Hangzhou, China; State Key Laboratory of Computer-Aided Design (CAD) and Computer Graphics (CG), Zhejiang University, Hangzhou, China; Key Laboratory for Intelligent Networks and Network Security, Ministry of Education, Xi'an Jiaotong University, Xi'an, China; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China

Abstract:
Vision Transformer (ViT) on object re-identification (ReID) has attracted significant attention recently. However, ViT-based ReID substantially increases computational complexity, imposing significant burdens during training and inference. This paper presents an efficient ViT-based backbone for ReID tasks, called the Locally Enhanced Vision Transformer (LEViT). ViT models typically emphasize global relationship modeling, yet ReID tasks are more sensitive to local information. To address this gap, we propose a Locally Enhanced (LE) block to enhance local information by performing self-attention within local split windows. Since part-based models dominate ReID, calculating self-attention across all patches is computationally inefficient. We also replace the traditional Query-Key-Value projector with the Group Convolution (G-Conv) projector, enabling the model to capture local details. Furthermore, G-Conv is integrated into the channel MLP to strengthen local feature sensitivity. Using these components, we develop two LEViT variants: LEViT-S and LEViT-L. To our knowledge, LEViT is the first highly adaptable ViT backbone for ReID tasks. Experimental evaluations demonstrate the effectiveness in five ReID datasets and three deep metric learning datasets. Notably, LEViT-S outperforms TransReID while requiring less than 10% computational complexity.

Abstract:
We present SNH-SLAM, a novel expandable dense neural simultaneous localization and mapping (SLAM) method that constructs a neural field in real-time based on run-time observation. To reach this challenging goal without any scene prior, we utilize instant depth supervision to drive the extension of planar convex hulls, where a single hash table maintains multi-level feature units embedded in the planar convex hulls. This design facilitates high-fidelity, hole-free, and low-memory map reconstruction while adding only a tiny time burden to the training process. Our approach performs mapping by minimizing both RGBD-based re-rendering loss and Truncated Signed Distance Field (TSDF) loss. In addition, for camera tracking, our optimization strategy allows SNH-SLAM to converge faster on the pose estimation and maintain robustness. We evaluate our method on common benchmarks and compare it with existing dense neural RGB-D SLAM methods. The evaluation results show the competitiveness of the SNH-SLAM in tracking accuracy, reconstruction quality, memory usage, and frame processing speed.

Abstract:
Implicit Neural Video Representation (INVR) has emerged as a novel approach for video representation and compression, using learnable grids and neural networks. Existing methods focus on developing new grid structures efficient for latent representation and neural network architectures with large representation capability, lacking the study on their roles in video representation. In this paper, the difference between INVR based on neural network and INVR based on grid is first investigated from the perspective of video information composition to specify their own advantages, i.e., neural network for general structure while grid for specific detail. Accordingly, an INVR based on mixed neural network and residual grid framework is proposed, where the neural network is used to represent the regular and structured information and the residual grid is used to represent the remaining irregular information in a video. A Coupled WarpRNN-based multi-scale motion representation and compensation module is specifically designed to explicitly represent the regular and structured information, thus terming our method as CWRNN-INVR. For the irregular information, a mixed residual grid is learned where the irregular appearance and motion information are represented together. The mixed residual grid can be combined with the coupled WarpRNN in a way that allows for network reuse. Experiments show that our method achieves the best reconstruction results compared with the existing methods, with an average PSNR of 33.73 dB on the UVG dataset under the 3 M model and outperforms existing INVR methods in other downstream tasks.

Abstract:
Image inpainting represents a fundamental and challenging problem in computer vision, requiring the synthesis of visually plausible content for missing regions while preserving both textural details and structural coherence. While current approaches employ auxiliary networks and attention mechanisms to capture structural priors and expand receptive fields, they remain constrained by two fundamental limitations: (1) insufficient interaction between textural and structural priors, and (2) the quadratic computational complexity inherent in attention operations. To overcome these challenges, we present DESSM, an innovative Dual Encoder-based State space Model for image inpainting that achieves efficient global context modeling with linear computational complexity. Our DESSM integrates three synergistically designed modules: (1) a dual-branch encoder for complementary learning of textural patterns and structural priors, (2) a Feature Cross Fusion Block (FCFB) enabling dynamic feature interaction while adaptively suppressing redundant information, and (3) a Spatial-Channel joint Selective scan Block (SCSB) for efficient long-range dependency modeling. Comprehensive evaluations across four standard benchmarks (i.e., CelebA, CelebA-HQ, Places2, and Paris StreetView) demonstrate that our DESSM achieves state-of-the-art performance in both visual fidelity and computational efficiency.

Abstract:
Video Question Answering (VideoQA) aims to answer a question based on the content of a given video. Recent methods adapt image-text pre-trained models to the VideoQA task by designing learnable temporal modules within the image encoder. However, these methods struggle to fully comprehend the questions and effectively extract temporal information due to 1) over-reliance on candidate answers and 2) lack of explicit temporal modeling. Specifically, since the question is fixed in different question-answer pairs, existing models tend to focus on the varying candidate answers. Moreover, existing methods merely utilize the classification loss to constrain the confidence of candidate answers, failing to differentiate the effectiveness of temporal information and to explicitly guide temporal modeling. In this paper, we introduce the Question Understanding and Temporality Guiding (QU-TG) method to address the aforementioned limitations. To reduce over-reliance on candidate answers, we propose providing diverse questions through question selection and enhancing the model’s comprehensive understanding of questions through question-video matching. To conduct explicit temporal modeling guiding, we propose negative video prevention and positive video guidance to conduct explicit temporal modeling guiding. Negative video prevention incorporates a prevention loss to discourage the model from making predictions based on erroneous temporal cues, whereas positive video guidance utilizes classification loss to encourage the model to derive correct answers from positive videos. Extensive experiments on the NExT-QA, IntentQA, STAR-QA, and Causal-VidQA datasets demonstrate the effectiveness and generalization of our method.

Abstract:
Deep learning has achieved significant success in single hyperspectral image super-resolution (SHSR); however, the high spectral dimensionality leads to a heavy computational burden, thus making it difficult to deploy in real-time scenarios. To address this issue, this paper proposes a novel lightweight SHSR network, i.e., LKCA-Net, that incorporates channel attention to calibrate multi-scale channel features of hyperspectral images. Furthermore, we demonstrate, for the first time, that the low-rank property of the learnable upsampling layer is a key bottleneck in lightweight SHSR methods. To address this, we employ the low-rank approximation strategy to optimize the parameter redundancy of the learnable upsampling layer. Additionally, we introduce a knowledge distillation-based feature alignment technique to ensure the low-rank approximated network retains the same feature representation capacity as the original. We conducted extensive experiments on the Chikusei, Houston 2018, and Pavia Center datasets compared to some SOTAs. The results demonstrate that our method is competitive in performance while achieving speedups of several dozen to even hundreds of times compared to other well-performing SHSR methods.

Abstract:
Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.

Abstract:
As a specialized paradigm of domain adaptation, blended-target domain adaptation (BTDA) transfers knowledge from a source domain to a blended target domain. In this paper, we propose an Energy-Driven Explicit Alignment Network (EDEAN) framework that innovatively applies energy-based models (EBMs) to address BTDA problems. We observe that EBMs display free energy biases when the source domain and the target domain data originate from different distributions. Therefore, we use these biases as a measure of the discrepancies between the source domain and the target domain and align them by minimizing these biases via the free energy alignment (FEA) module. We further propose the balanced weight distribution (BWD) module, which comprehensively considers the complementary information between the linear and semantic pseudo-labels and obtains the corresponding complementary information by mixing both label types. Moreover, we propose the normalized free energy (NFE) module, which assigns higher weights to high free energy samples and dynamically corrects the pseudo-labels by continuously updating these weights. We also conducted experiments on four widely used BTDA databases and achieved substantial improvements over the latest BTDA methods.

Affiliations: Key Laboratory of Specialty Fiber Optics and Optical Access Networks, Joint International Research Laboratory of Specialty Fiber Optics and Advanced Communication, Shanghai Institute for Advanced Communication and Data Science, School of Communication and Information Engineering, Shanghai University, Shanghai, China; Faculty of Engineering and Information Technology, Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, NSW, Australia

Abstract:
Audio-visual segmentation (AVS) aims to achieve precise object segmentation by leveraging multimodal cues. However, effective alignment and fusion of audio and visual features are often hindered by inherent uncertainty within multimodal data, such as data quality inconsistencies, semantic mismatches, and temporal or spatial misalignments. To address these challenges, we propose an Uncertainty-aware Audio-Visual Segmentation (UAVS) that dynamically handles uncertainty to improve segmentation accuracy and robustness. Our method employs CLIP-generated text embeddings to provide semantic cues of categories for audio features, reducing ambiguity in multimodal alignment. We then introduce a Mixture of Experts (MoE) model, mapping multimodal embedding samples to multi-dimensional Gaussian distributions to quantify uncertainty through variance and modeling feature confidence using the Gaussian probability density function, effectively capturing noise and semantic discrepancies across modalities. In addition, we design a dynamic path algorithm based on uncertainty, enabling the model to adaptively route samples to experts with high confidence. This algorithm enhances performance in complex, noisy, and ambiguous scenes. Extensive experiments conducted on three subsets of the AVSBench benchmark dataset demonstrate that our proposed method achieves competitive performance.

Abstract:
This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from the images with arbitrary modality types or arbitrary modality numbers by using a single model trained once. Specifically, we develop a novel model, termed modality adaptive network (MAN), for AM SOD, which addresses two fundamental challenges in AM SOD: the diverse modality discrepancies arising from varying modality types and the dynamic fusion dilemma resulting from an unfixed number of modalities in the input data. Technically, MAN first introduces a novel Modality-Adaptive Feature Extractor (MAFE) to adaptively extract features from different input modalities based on their characteristics by utilizing a set of learnable modality prompts. Concurrently, a new modality translation contractive (MTC) loss is devised to facilitate the training of MAFE as well as modality prompts, thereby effectively addressing the inherent modality discrepancies and extracting more discriminative features from each modality image. Subsequently, MAN presents a hybrid dynamic fusion (HDF) strategy to effectively resolve the challenge of dynamic inputs in multi-modal feature fusion as well as enhance the exploitation of complementary information across different modalities. This is specially achieved by a Channel-wise Dynamic Fusion Module (CDFM) and a Spatial-wise Dynamic Fusion Module (SDFM). Experimental results show that by virtue of MAFE, MTC loss and HDF strategy, our proposed method achieves significant increasements over existing models on benchmark datasets.

Abstract:
Multi-Object Tracking (MOT) aims to build moving trajectories of objects within video sequences and serves as a critical component in autonomous driving systems. Recently, several studies have revealed the vulnerability of existing MOT methods by investigating adversarial attacks against MOT, raising significant safety concerns for real-world applications. These methods attack trackers by deliberately inserting false alarms, which mislead trajectories to drift from their correct paths. However, current MOT attack methods fail to propose efficient strategies for generating false alarms, as they either rely on computationally intensive optimization to determine the placement of false alarms, or crudely insert a large number of heuristically designed false alarms. In this paper, we propose an explainable and effective false alarm generation module, named Target Generating Module (TGM), that adaptively determines the location and size of false alarms by leveraging historical trajectory information. Based on this module, we design an attack method targeting mainstream MOT approaches, named Trajectory-Aware Attack (TA Attack). TA Attack achieves effective disruption of MOT systems by combining detection erasure and false alarm generation, requiring only a few frames to successfully compromise trajectories. To exhibit the flexibility and effectiveness of our method, we conduct experiments using four multi-object trackers (ByteTrack, SORT, CenterTrack and FairMOT) which are enabled by two representative detectors (YOLOX and CenterNet). The results demonstrate our method achieves state of the art performance with 74.87% attack success rate on BDD100 K, 81.7% attack success rate on MOT17 and 83.87% attack success rate on MOT20 while 4 frames being attacked averagely, revealing the vulnerability of association mechanism in MOT methods.

Abstract:
The colorization of scenes from multi-view grayscale images plays a crucial role in applications such as augmented reality and virtual exhibitions. Existing methods combine NeRF with an automatic colorization model, averaging multiple colorized patches to reduce inconsistency. However, they still face three key limitations: (1) Current methods cannot produce diverse colorization results due to the lack of multimodal conditional inputs, (2) They struggle to maintain multi-view consistency caused by unreliable geometric correspondence and ineffective propagation mechanisms, and (3) Computational inefficiency from NeRF's dense ray sampling and numerical integration. In this paper, we propose ColView, a unified framework for text-guided grayscale scene colorization that achieves both automatic and controllable colorization of grayscale scenes from multi-view grayscale images. First, for flexible color control, we leverage text description as the input to guide the colorization process, which allows users to specify desired colors through natural language descriptions. Second, to ensure multi-view consistency, we introduce a multi-view consistent colorization module that explicitly models dependencies between different views. This module follows three key steps: cross-view attention mechanism for collaborative key-view colorization, feature matching for inter-view correspondence establishment, and correspondence-guided feature propagation. Third, to improve computational efficiency, we adopt 3D Gaussian Splatting as our underlying representation. This explicit point-based representation renders significantly faster than NeRF. Extensive experimental results demonstrate that our method achieves superior visual quality and computational efficiency.

Abstract:
For certain applications like highway surveillance systems, only low-frame-rate videos are recorded, which presents a huge challenge to existing trackers, as objects tend to undergo far more abrupt changes in location, motion, and appearance between successive frames compared to normal frame rates. To handle the above challenges, we propose a novel approach, namely \mathbbSORT-\mathbbLFR, for \mathbbSimple \mathbbOnline and \mathbbRealtime \mathbbTracking in \mathbbLow-\mathbbFrame-\mathbbRate videos, which consists of following techniques: 1) A feature-prior association strategy to improve the capability to track new objects with significant displacements; 2) A Kalman filter using acceleration in state space (accel-fused Kalman filter) to improve the motion estimation capability for non-constant velocity moving objects; 3) A detection-guided adaptive exponential moving average (DG-AEMA) feature update mechanism to enhance feature temporal modeling capability for tracked objects; 4) A trajectory-covariance threshold tuning (TCTT) method to filter out incorrect association results. Through these techniques, the proposed SORT achieves 91.8 HOTA, 92.6 MOTA and 93.9 IDF1, which surpass all state-of-the-art trackers on the public CityFlow and our private HighwayTrack datasets under the low-frame-rate setting.

Abstract:
Recently, multi-view learning has witnessed a considerable interest on the research of trusted decision-making. Previous methods are mainly inspired from an important paper published by Han et al. in 2021, which formulates a Trusted Multi-view Classification (TMC) framework that aggregates evidence from different views based on Dempster's combination rule. All these methods only consider inter-view aggregation, yet lacking exploitation of intra-view information. In this paper, we propose a generalized trusted multi-view classification framework with hierarchical opinion aggregation. This hierarchical framework includes a two-phase aggregation process: the intra-view and inter-view aggregation hierarchies. In the intra aggregation, we assume that each view is comprised of common information shared with other views, as well as its specific information. We then aggregate both the common and specific information. This aggregation phase is useful to eliminate the feature noise inherent to view itself, thereby improving the view quality. In the inter-view aggregation, we design an attention mechanism at the evidence level to facilitate opinion aggregation from different views. To the best of our knowledge, this is one of the pioneering efforts to formulate a hierarchical aggregation framework in the trusted multi-view learning domain. Extensive experiments show that our model outperforms some state-of-art trust-related baselines.

Affiliations: College of Information and Intelligence, Hunan Agricultural University, Changsha, China; School of Computer, Hunan University of Technology, Zhuzhou, China; College of Computer Science and Electronic Engineering, Hunan University, Changsha, China; Anhui University, Hefei, China; School of Engineering, University of Warwick, England, U.K.; Guangxi Normal University, Guilin, China; Institute of Artificial Intelligence (TeleAI) of China Telecom, Beijing, China

Abstract:
Knowledge-based Visual Question Answering (KB-VQA) has surfaced as a critical task in advancing AI capabilities. Despite significant progress enabled by large language models (LLMs), there are still three major challenges: (1) flawed image captions cause unreliable reasoning; (2) noisy explicit knowledge can disrupts answering; and (3) massive LLMs scale is irreplaceable to robustness. To overcome these challenges, we develop a novel approach, Multi-Modal Refined Prompting (MMRP), which generates high-quality prompts tailored for LLMs. To tackle the first challenge, a multi-faceted image captioning strategy is employed to generate detailed, contextually relevant visual descriptions. In addition, we introduce a complementary knowledge retrieval and refinement strategy to deliver concise, contextually relevant knowledge, effectively overcoming the second challenge. These enhanced image captions and explicit knowledge are then integrated into a knowledge-infused in-context prompt, effectively activating the reasoning capabilities of LLMs. Importantly, MMRP eliminates reliance on massive LLMs and avoids the need for model fine-tuning, while achieving significant improvements in answer accuracy. Extensive evaluations on the widely-used OK-VQA benchmark against 22 baselines prove the superiority of MMRP, establishing a new state-of-the-art in KB-VQA.

Abstract:
Existing diffusion-based methods have achieved impressive results in human motion editing. However, these methods often exhibit significant ghosting and body distortion in unseen in-the-wild cases. In this paper, we introduce Edit-Your-Motion, a video motion editing method that tackles these challenges through one-shot fine-tuning on unseen cases. Specifically, firstly, we utilized DDIM inversion to initialize the noise, preserving the appearance of the source video and designed a lightweight motion attention adapter module to enhance motion fidelity. DDIM inversion aims to obtain the implicit representations by estimating the prediction noise from the source video, which serves as a starting point for the sampling process, ensuring the appearance consistency between the source and edited videos. The Motion Attention Module (MA) enhances the model’s motion editing ability by resolving the conflict between the skeleton features and the appearance features. Secondly, to effectively decouple motion and appearance of source video, we design a spatio-temporal two-stage learning strategy (STL). In the first stage, we focus on learning temporal features of human motion and propose recurrent causal attention (RCA) to ensure consistency between video frames. In the second stage, we shift focus on learning the appearance features of the source video. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content. Extensive qualitative and quantitative experiments, along with user preference studies, show that Edit-Your-Motion outperforms other methods.

Abstract:
Event-based human action recognition has gained increasing attention due to its efficiency in dynamic scenarios. Contemporary methodologies for event-based action recognition predominantly treat the problem as a one-hot classification task, which limits their ability to leverage the semantic relationships among various actions. To address this limitation, we propose a Spiking Event-Text Feature Fusion (SETFF) framework, which enhances recognition performance by integrating event and text modalities through a dual-stream architecture. SETFF leverages generative large language models to produce action descriptions, serving as semantic prompts that guide event feature learning. Specifically, a contrastive loss function is employed to align the features of both modalities, enriching the model’s capacity to distinguish intricate and subtle actions. Extensive experiments on neuromorphic datasets, including PAF, DailyAction-DVS, DVS128 Gesture, Bullying10 K, and UCF101-DVS, demonstrate that SETFF achieves state-of-the-art accuracy, with top-1 accuracy rates of up to 99.65% on the DailyAction-DVS dataset and 98.39% on the PAF dataset. Experimental results underscore the effectiveness of multimodal fusion in SNNs, advancing event-based action recognition while preserving the energy efficiency characteristic of SNNs.

Abstract:
Spatiotemporal attention learning has always been a challenging research task in video question answering (VideoQA). It needs to consider not only the modelling of local neighbourhood dependencies between the adjacent frames in a video but also the modelling of long-term dependencies between nonadjacent frames. Although the existing methods are usually good at modelling temporal dependencies in one aspect, they cannot simultaneously and effectively model the temporal dependencies between adjacent and nonadjacent frames. To address this issue, we first derive a novel statistic-driven difference-aware generation function, which can efficiently calculate the difference between a sequence feature value and the whole mean value to identify the significance of the feature. Subsequently, we design a novel parameter-free spatiotemporal attention mechanism (PSAM), which captures the most relevant cues scattered in the context of a spatiotemporal video by generating functions and utilizes a gating mechanism to adaptively integrate and filter relevant and irrelevant information. Finally, we use the PSAM and hierarchical modelling to construct a lightweight multiscale context fusion- and reasoning-based VideoQA model. Extensive experimental research results obtained on five benchmark datasets for the VideoQA task show that our VideoQA model has high Q&A performance and lightweight characteristics. Simultaneously, comprehensive ablation experimental results show that the PSAM can not only improve the performance of the model but also significantly reduce the number of model parameters. In addition, extensive experimental findings obtained on the benchmark dataset of joint tasks (video moment retrieval and video highlight detection) further demonstrate that the PSAM is a general and effective spatiotemporal attention mechanism.

Abstract:
Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, especially specialized small models, exhibit poor generalization in open scenarios. On the other hand, multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks, due to weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations. To address these challenges, we develop LLMI3D, and propose the following solutions: Spatial-Enhanced Local Feature Mining for better 3D spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations. We are the first to adapt an MLLM for image-based 3D perception. Additionally, we have constructed the IG3D dataset, which provides fine-grained descriptions and question-answer annotations. Extensive experiments demonstrate that our LLMI3D achieves state-of-the-art performance, outperforming other methods by a large margin.

Abstract:
In recent years, AI-Generated Images (AIGIs) have attracted significant attention and shown great potential in various applications, including entertainment, advertisement, education, and product design. Driven by this trend, various Text-to-Image (T2I) models are developed. However, the quality of AIGIs produced by these models varies widely, with many low-quality images failing to meet human aesthetic standards. Consequently, research into both subjective and objective Image Quality Assessment (IQA) methods for AIGIs is crucial. In this paper, we introduce a dataset called AIGI-IQAD, designed to enhance our understanding of human aesthetic preferences for AIGIs. The dataset contains 2,880 AIGIs generated by 8 T2I models using 360 deliberately designed text prompts. Further, we conducted subjective experiments to gather ratings from both aesthetic quality and text-image consistency. Building on this dataset, we propose a model named Question-guided Multimodal Interaction Network (QMI-Net) for evaluating AIGIs. QMI-Net assesses human preferences for AIGIs by focusing on both aesthetic quality and text-image consistency. Specifically, QMI-Net uses a question-answering approach to guide Multimodal Large Language Models (MLLMs) in generating detailed aesthetic and similarity information. The Visual and Aesthetic Feature Fusion Module (VAFFM) then fuses the aesthetic features with the visual features extracted by Contrastive Language-Image Pre-training (CLIP) to obtain more comprehensive aesthetic quality features. Comprehensive experiments demonstrate that state-of-the-art performance is achieved by QMI-Net on our AIGI-IQAD and three other public datasets.

Abstract:
Understanding 3D shapes is crucial across various fields in computer vision. View-based methods have shown remarkable performance in 3D shape recognition by leveraging pre-trained networks on extensive 2D image datasets while preserving stable multi-view structures. However, traditional approaches often extract view features independently before fusion, limiting each view’s perception to its narrow domain, impeding the network’s ability to capture global structures, inter-view dependencies. To address these issues, and we propose a novel Cross-view Message Token Interaction Network (CMI-Net) that enhances contextual information exchange between views, effectively extending the receptive field throughout the entire process. Specifically, we introduce a Message (MSG) token for each view to aggregate intra-view contextual information. We establish two interaction mechanisms: the Anonymous Delivery Mechanism (ADM) and the Integrated Broadcast Mechanism (IBM), each optimized for different interaction efficiencies. The ADM promotes view-level interaction by randomly transferring MSG tokens to selected views, while the IBM aggregates and broadcasts tokens across all views, enabling shape-level interaction. This design ensures that each view benefits from a comprehensive global receptive field. Moreover, we have developed a Contextual Information-Guided Part Selection module to direct the network’s attention to significant local features within each view. Finally, through multi-granularity feature fusion, CMI-Net improves the understanding of 3D shapes at the local patch, intra-view, and cross-view levels. Extensive experiments conducted on diverse datasets, including ModelNet40, FG3D, ScanObjectNN, and ShapeNet Core55, demonstrate that CMI-Net achieves state-of-the-art results in 3D shape classification and retrieval tasks. Additionally, visualizing feature variations during cross-view interactions enhances interpretability, providing deeper insights into the network’s decision-making process.

Affiliations: School of Computer Science, Hefei University of Technology, Hefei, China; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; School of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan, China; School of Computer Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, China; School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Abstract:
The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.

Abstract:
Unsupervised cross-domain 3D model retrieval (UCD3DMR), which enables the transfer of knowledge from existing labeled data to new unlabeled 3D models, has emerged as an effective tool for managing 3D models recently. However, most existing UCD3DMR methods focus on closed-set scenarios, requiring the source and target domains to share identical categories, which is idealized and impractical. Consequently, we explore a more practical yet demanding task known as universal unsupervised cross-domain 3D model retrieval. This task faces significant data distribution discrepancies and uncertain category overlap across domains, posing significant challenges for cross-domain adaptation. To address these challenges, we introduce an innovative universal UCD3DMR method named Separate Domain-Private Classes (SDPC), which integrates a Weighted Alignment Mechanism (WAM) and a Domain-private Separation Mechanism (DSM). Specifically, we formulate a couple of transferability criteria to select domain-common class samples for cross-domain alignment. Additionally, we design a couple of separation losses to mitigate interference from domain-private class samples. The transferability criteria and separation losses can mutually enhance the cross-domain alignment, leading to further cross-domain retrieval performance improvement. Extensive experiments on two well-established cross-domain 3D model datasets (MI3DOR and NTU/PSB), validate and highlight the superior performance of our SDPC.

Abstract:
Sequential recommendation is a classic task in the field of recommendation, which aims to predict the next user-preferred item based on their historical interactions. However, in practical scenarios, users' needs are dynamically evolving in a short period and exhibit a chain-like structure. Consequently, recommending only the next single item does not fully meet user demands and limits the potential for increasing business traffic on platforms. To overcome this limitation, we propose a new recommendation paradigm, Next Chain Prediction, which requires the model to predict a chain of items. Due to the advantages of generative recommendation models on user preference representation and scalability, we design a generative recommendation model for next chain prediction. The generative model extracts long-term interests and short-term demands within a unified framework. By designing a Sequence-Chain Attention mechanism, the model performs self-attention learning across dual dimensions. Additionally, we design a generative loss function to balance the hit rate and diversity of the recommended items in the chain. We conduct experiments across three datasets and the experimental results show that our method achieves at least 1.22% improvement in HR@10 across three datasets for recommending multi-item chains. Furthermore, our method improves the diversity of recommended items and also offers the flexibility to adjust the size of predicted chains, maintaining state-of-the-art performance even when limited to predicting a single item.

Abstract:
Multi-focus image fusion (MFIF) aims to combine multiple images captured in the same scene by imaging devices with varying focal lengths into one complete clear image. While artificial neural network-based methods have achieved remarkable results in MFIF tasks, their black-box working mechanisms easily leadto information losses, limiting further fusion performance improvements. To solve this issue, we introduce an interpretable neural computation model called the nonlinear spiking neural P (NSNP) system. The NSNP model effectively mitigates the information losses induced by neurons during the information transmission process by controlling the internal spike values of neurons. According to the NSNP model, we propose a novel fusion method, NSNPFuse, that effectively avoids information losses during the challenging MFIF task. On the one hand, NSNPFuse uses nonlinear spiking neurons to construct the network backbone, which yields improved feature extraction performance and reduces the induced feature loss. On the other hand, NSNPFuse embeds a feature fusion module (FFM) based on nonlinear spiking neurons to selectively retain meaningful information and reduce distortions. We conduct experiments on multiple multi-focus image datasets, including Lytro, MFFW, MFI-WHU, and Road-MF, and the subjective and objective performances of the proposed approach surpass those of 15 state-of-the-art MFIF methods. The results demonstrate that our NSNPFuse method offers more competitive performance. Furthermore, we show that NSNPFuse enhances the downstream performance achieved in salient detection and object detection tasks.

Abstract:
With low storage cost and high retrieval efficiency, hashing techniques are widely used for multi-media retrieval, which has already become the present research focus. Currently, cross-modal hashing commonly employs graph-based loss to construct pair-wise semantic relations between training samples for model optimization. However, limited by the graph-based strategy, each edge in the graph only connects two samples, which only represent a bundle of pair-wise relationships. Besides, the edges in the graph are calculated by self-attention or feature distance, only considering pair-wise relations of heterogeneous samples and ignoring the class relations. In this paper, by hypergraph modeling the semantic tuples, a novel Deep Semantic Tuplet-based Hashing by Hypergraph Modeling (DSTH) approach is proposed to leverage the multilateral semantic relations, which could guide the model to learn class-discriminative semantic binary embedding. In more detail, based on the characteristic distribution, semantic tuples are constructed for each class in one mini-batch, which represents the multilateral semantic relationships between multiple samples and multiple classes. By considering semantic tuples as hyperedges to represent multilateral semantic relations, hypergraph modeling is designed, in which HyperGraph Neural Hetwork (HGNH) is introduced to formulate hypergraph node classification goals to fully learn the multilateral semantic information contained in the semantic tuples. Moreover, to utilize the heterogeneity of local structures in embedding, the adaptive neighborhood structure is explored by learning the structure embedding, which provides fine-grained ranking lists. Through extensive experiments on three benchmark datasets, the comprehensive results validate the advancement of our proposed DSTH framework over mainstream cross-modal hashing.

Abstract:
Mirror detection in dynamic scenes plays a crucial role in ensuring safety for various applications, such as drone tracking and robot navigation. However, current mirror detection models often fail in areas with mirrors that have a similar visual and color appearance to their surrounding objects. They also struggle to generalize well in complex cases, primarily due to limited annotated datasets. In this work, we propose a novel temporal prompt learning network with depth memory (TPD-Net) to address these critical challenges. Our approach includes several key components. First, we introduce a Temporal Prompt Generator (TPG) to learn temporal prompt features. Then, we devise Multi-layer Depth-aware Adaptor (MDA) modules to progressively adapt prompt features from the TPG, thereby learning mirror-related features by embedding temporal depth information as guidance. Moreover, we further refine these mirror-related features by constructing a depth memory and a Depth Memory Read module to read the temporal depths stored in the memory, boosting video mirror detection. Experimental results on a benchmark dataset show that our TPD-Net significantly outperforms 22 state-of-the-art methods in video mirror detection tasks.

Abstract:
Catastrophic forgetting, the degradation of knowledge about previously seen classes when learning new concepts from a shifting data stream, is a pitfall faced by neural network learning in open environments. Recent research on continual image classification usually relies on storing samples or prototypes to resist this forgetting. We find that during acquiring knowledge of the new classes, the features of old classes gradually disperse, which leads to confusion of features between classes and makes them difficult to discriminate. Coping with feature dispersion would be a key consideration in resisting catastrophic forgetting, which has been neglected in previous works. To this end, we try to address this issue from two perspectives. First, we propose a dispersing feature generation mechanism, which generates pseudo-features based on the pre-pooling prototypes of the old classes to simulate feature dispersion and remind the classifier to adjust the decision boundary. Second, we design a consistent alignment constraint to alleviate the severity of feature dispersion by maintaining consistency in the hidden states of different depths when aligning the current model with the previous model. Extensive experimental results on various benchmarks show the superiority of our proposed method.

Abstract:
Shadows are dark areas, typically rendering low illumination intensity. Admittedly, the infrared image can provide robust illumination cues that the visible image lacks, but existing methods ignore the collaboration between heterogeneous modalities. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark that makes shadow removal independent of specific scenario constraints possible. Extensive experiments demonstrate that S2-ShadowNet outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

Abstract:
Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.

Abstract:
Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.

Abstract:
Recent years have witnessed increasing interest towards image aesthetics assessment (IAA), which predicts the aesthetic appeal of images by simulating human perception. The state-of-the-art IAA methods, despite their significant advancements, typically rely heavily on time-consuming and labor-intensive human annotation of aesthetic scores. Furthermore, they are subject to the generalization challenge, which is highly desired in real-world applications. Motivated by this, zero-shot image aesthetics assessment (ZIAA) is investigated to achieve robust model generalization without relying on manual aesthetic annotations, which remains largely underexplored. Specifically, a novel aesthetic prompt learning framework for ZIAA, dubbed AesPrompt, is presented in this paper. The key insight of AesPrompt is to emulate the human aesthetic perception process for learning aesthetic-oriented prompts in a multi-granularity manner. First, we first develop a new pseudo aesthetic distribution generation paradigm based on multi-LLM ensemble. Then, external knowledge of multi-granularity prompts encompassing image themes, emotions, and aesthetics is acquired. Through learning the multi-granularity aesthetic-oriented prompts, the proposed method achieves better generalization and interpretability. Extensive experiments on five IAA benchmarks demonstrate that AesPrompt consistently outperforms the state-of-the-art ZIAA methods across diverse-sourced images, covering natural images, artistic images, and artificial intelligence-generated images.

Abstract:
Underwater image enhancement (UIE) aims to mitigate wavelength-dependent absorption and multi-path scattering effects, enabling the recovery of natural colors and rich details. Despite notable progress, consistently achieving high-quality enhancement in both fidelity and perceptual clarity remains a fundamental challenge. To address this, we propose the Laplacian-domain Dual-Focus Enhancer (DFE), an innovative framework consisting of two stages: adaptive diffusion-accelerated low-frequency enhancement (ADALE) and progressive uncertainty-driven high-frequency enhancement (PUHE). Specifically, DFE applies a Laplacian transform to decouple the frequency-specific degradations in underwater images, supporting fidelity- and clarity-oriented enhancement along separate pathways. To facilitate high-fidelity restoration, ADALE incorporates an HSV-guided optimization mechanism (HSV-OM) to establish a robust color and brightness calibration baseline for the low-frequency diffusion model, adaptively managing basic degradations with minimal sampling steps. Furthermore, to enhance contour and detail perception, PUHE models the uncertainty of reference textures and integrates it with feature modulation to progressively reconstruct multi-scale high-frequency structures. The multi-reference underwater texture enhancement (MUTE) dataset further improves image clarity. Extensive experiments demonstrate that our DFE outperforms state-of-the-art (SOTA) methods in both quantitative metrics and visual quality.

Abstract:
Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robustness against unseen attacks in inference phase. LE prior is elaborated as two properties across various attacks as shown in Figs. 1 and 2: 1) low entropy misclassification for adversarial samples and 2) lower entropy prediction for higher attack intensity. This phenomenon stands in stark contrast to the naturally distributed samples. The LE prior can instruct existing test-time defense methods, thus we propose a two-stage REAL approach: Rectify Adversarial sample based on LE prior for test-time adversarial rectification. Specifically, to align adversarial samples more closely with clean samples, we propose to first rectify adversarial samples misclassified with low entropy by reverse maximizing prediction entropy, thereby eliminating their adversarial nature. To ensure the rectified samples can be correctly classified with low entropy, we carry out secondary rectification by forward minimizing prediction entropy, thus creating a Max-Min entropy optimization scheme. Further, based on the second property, we propose an attack-aware weighting mechanism to adaptively adjust the strengths of Max-Min entropy objectives. Experiments on several datasets show that REAL can greatly improve the performance of existing sample rectification models.

Abstract:
Deep neural network (DNN)-based image watermarking models have been widely recognized as an effective way to manage the huge amount of AI-generated images. However, the vulnerability of such models to different forms of adversarial attacks has been a critical concern. Among the existing forms of attacks in the literature, image-dependent attacks cannot launch real-time attacks on a large number of watermarked images, because they need to train a new noise image to attack each new watermarked image; image-regeneration attacks either require a lot of information about the watermarking system or cause too much damage to the attacked image. To fill the gap in the existing forms of attacks, in this paper, we propose a novel form of attack named “fast and effective overwrite attack (FEOA)”, which achieves an extremely fast attack speed and strong attack effectiveness. In particular, we discovered a single noise image, when directly added to many watermarked images, can overwrite their true watermark messages to different ones in milliseconds. We also develop an adaptive version of FEOA, which trains k different noise images and applies the principle of divide and conquer to significantly improve attack effectiveness. Our work opens the door to quickly launching massive overwrite attacks on a large number of watermarked images, revealing a new robustness issue of DNN-based image watermarking models. Extensive experiments demonstrate the outstanding attack time efficiency and effectiveness of our methods.

Abstract:
The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the attention of the research community. Despite improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Because of the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a comprehensive overview of how vision-language tasks benefit from pre-trained models. First, we review several main challenges in vision-language tasks and discuss the limitations of previous solutions before the era of pre-training. Next, we summarize the recent advances in incorporating pre-trained models to address the challenges in vision-language tasks. Finally, we analyze the potential risks associated with the inherent limitations of pre-trained models, discuss possible solutions, and attempt to provide future research directions.

Abstract:
To achieve an efficient compression and reconstruction of vibrotactile signals in the Tactile Internet, an end-to-end neural vibrotactile codec, PC-NSVC, is proposed. By integrating a residual product quantizer (RPQ) within a deep autoencoder, PC-NSVC effectively reduces coding latency through joint training and inference of the entire framework, while simultaneously enhancing the quality of the reconstructed signals. The RPQ allows for control over transmission bitrates by adjusting quantizer parameters, enabling scalable codec across various network environments and bandwidths. Additionally, PC-NSVC incorporates psychohaptic model to account for the influence of human perception, further improving the perceptual fidelity of the reconstructed signals. A remote vibrotactile sharing prototype, TouchShare, was developed to conduct transmission and material classification tests. Simulation and transmission results demonstrate that the PC-NSVC scheme significantly improves the quality of reconstructed signals at different compression ratios and supports accurate material classification, outperforming existing schemes.

Affiliations: School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China; School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, P.R. China; Department of Electrical Engineering and Computer Science, School of Engineering, University of California, Merced, CA, USA; School of Computer Science and Technology, Hainan University, Haikou, China; School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, China

Abstract:
Infrared and visible dual-modality vision tasks such as semantic segmentation, object detection, and salient object detection can achieve robust performance even in extreme scenes by leveraging complementary information. However, most existing image fusion-based methods and task-specific frameworks exhibit limited generalization across multiple tasks. Moreover, summing the general representations obtained from foundation models poses challenges, including insufficient semantic information mining and feature fusion. In this paper, we propose a fusion-enhanced network, which effectively enriches semantic information and integrates features based on the complementary characteristics of infrared and visible modalities. The proposed network can extend to high-level vision tasks, showing strong generalization capabilities. Firstly, we adopt the infrared and visible foundation models to extract the general representations. Then, to enrich the semantic information of these general representations for high-level vision tasks, we design the feature enhancement module and the token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effective fusion by exploring the complementary information of two modalities. Moreover, we adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementarity between the two modalities. Extensive experiments show that the proposed method outperforms state-of-the-art dual-modality methods in the semantic segmentation, object detection, and salient object detection tasks.

Abstract:
Existing multi-color space guided techniques for underwater image enhancement (UIE) fail to take the advantages of the XYZ color space for preserving underwater image details, meanwhile, existing UIE datasets, typically containing low-quality reference images of distorted colors and blurred structures, lead to inaccurate enhancement mapping between low-quality and high-quality images. To overcome these above limitations, we propose a Multi-Color Space Fusion Network (MCSF-Net) for UIE. The MCSF-Net incorporates a Multi-dimensional Feature Fusion Block (MFFB) and weighted feature fusion scheme to effectively integrate complementary features from both XYZ and RGB color spaces. Moreover, we establish a Large-Scale Mixed UIE dataset (LSMU) by using nine no-reference metrics to filter out low-quality reference images from eight public UIE datasets, enabling more effective network learning. Extensive experiments on mainstream datasets demonstrate that the proposed method outperforms several leading approaches in both color restoration and detail enhancement of various underwater images.

Abstract:
In the context of multi-view clustering, graph learning is recognized as a crucial technique, which generally involves constructing an adaptive neighbor graph based on probabilistic neighbors, and then learning a consensus graph for clustering. However, it is worth noting that these graph learning methods encounter two significant limitations. Firstly, they often rely on Euclidean distance to measure similarity when constructing the adaptive neighbor graph, which proves inadequate in capturing the intrinsic structure among data points in practice, particularly for high-dimensional data. Secondly, most of these methods focus solely on consensus graph, ignoring unique information from each view. Although a few graph-based studies have considered using specific information as well, the modelling approach employed does not exclude the noise impact from the common or specific components. To this end, we propose a novel tensor-based multi-view graph learning framework that simultaneously considers consistency and specificity, while effectively eliminating the influence of noise. Specifically, we calculate similarity using pseudo-Stiefel manifold distance to preserve the intrinsic properties of data. By making an assumption that the learned neighbor graph of each view comprises a consistent part, a specific part, and a noise part, we formulate a new tensor-based target graph learning paradigm for noise-free graph fusion. Owing to the benefits of tensor singular value decomposition (t-SVD) in uncovering high-order correlations, this model is capable of achieving a comprehensive understanding of the target graph. Furthermore, we derive an algorithm to address the optimization problem. Experiments on six datasets have demonstrated the superiority of our method.

Abstract:
Large-scale pre-trained Vision-Language Models (VLMs) have shown impressive cross-modal alignment capabilities in images and text extraction. Despite their strengths, these models are vulnerable to backdoor attacks due to their heavy reliance on training data. Prevailing backdoor attacks on VLMs involve the injection of subtle patches into the pre-training process, causing the model to exhibit harmful behaviors when these triggers appear in test images. However, existing attack methods typically suffer from the following limitations: (1) Polluting pre-training data to train a poisoned VLM is expensive and time-consuming, and this extra retraining phase could potentially harm the performance of the pre-trained VLM; (2) Backdoor triggers, often visible to humans and requiring elaborate placements, significantly raise the risk of being detected and compromise their feasibility. To overcome the above limitations, we propose a novel invisible backdoor attack with Siamese tuning on pre-trained VLMs. Specifically, we design a Siamese Tuning Attack (SiTA) method to subtly manipulate the behavior of the target VLM by parallelizing a Siamese model with the original image encoder and fine-tuning the Siamese model with a poisoned dataset. Furthermore, an imperceptible frequency-domain trigger is employed in the targeted VLM attack, enhancing its robustness and feasibility without necessitating alterations to the image encoder of the initial model. Extensive experiments conducted on three datasets across multiple downstream tasks demonstrate a remarkable attack performance of our proposed SiTA against VLMs.

Abstract:
Learning-based image coding is showing improved compression efficiency, while also offering a novel advantage in enabling computer vision tasks directly within the compressed domain. The latent representation created by deep learning methods inherently contains all visual features, without a computationally expensive synthesis process at the decoder. This paper is an invited extension of a previous solution for JPEG AI compressed domain face detection that adapts a RetinaFace-based detector to operate directly on the latent tensor. In addition to a former single-scale bridging solution, this work provides a novel multi-scale bridging architecture to enable a more effective multi-scale compressed domain face detection. The results show a significant performance gain, improving accuracy up to 20% for detection of tiny faces on the WIDER FACE dataset compared to single-scale bridging, and further narrowing the gap when compared to detection on uncompressed or JPEG AI decoded images. Furthermore, since the computationally expensive decoding step is bypassed and since the bridges consist of lower-complexity networks, the overall processing cost is significantly reduced. Single and multi-scale bridging, respectively, have about 10% and 32% the complexity of applying pixel domain face detection on decoded images. The proposed architecture is expected to be extended to other multiscale sensitive vision tasks, as JPEG AI is not specifically designed for any single downstream application.

Abstract:
Multi-dataset no-reference image quality assessment (NR-IQA) aims to deliver consistent image quality evaluation across a variety of contexts, empowering platform developers to optimize image processing pipelines while maintaining acceptable visual quality. Human vision, when observing images, tends to prioritize local semantics, for example, a blurry sky is perceived differently than a blurry face. This insight forms the basis of many multi-dataset NR-IQA models, which commonly rely on pretrained deep networks to extract semantic information that is crucial for assessing perceptual quality. Vision Transformer-based pre-trained models often exhibit persistent noise artifacts, as demonstrated by previous studies such as Denoising Vision Transformers; many existing IQA approaches fail to appropriately address these local semantic artifacts, leading to inconsistent local IQA score maps, even when overall performance appears satisfactory. To tackle this, we introduce DINO-IQA, a novel dual-branch network architecture designed for NR-IQA to multi-dataset. The first branch focuses on extracting local distortion features, effectively capturing image degradation, while the second branch utilizes denoised DINOv2 from ViT decomposition to extract refined semantic features, free from local artifacts. By enabling visual interaction between distortion and semantic features, our method generates locally consistent quality maps that align more closely with human perception. This approach achieves remarkable accuracy and sets a new benchmark for state-of-the-art multi-dataset NR-IQA performance. Our findings underscore the critical need to address semantic noise in pre-trained networks for enhancing NR-IQA, demonstrating that our dual-branch framework offers a robust solution to this previously underexplored challenge.

Abstract:
Weakly supervised group activity recognition (WSGAR) aims to identify the joint activity of a group of people without relying on hand-annotated human bounding boxes. Existing WSGAR methods typically acquire coarse human-level features by pooling from detected bounding boxes or applying human queries with cross attentions. These approaches focus on learning human relations from the acquired features. However, discriminative person-specific clues might be confused with irrelevant backgrounds, hindering the effectiveness of downstream human relation learning. To address this limitation, we propose a Human Feature Refinement framework that enhances human-level information with graph convolutional networks and self-attention. We define in-box regions as tokens and learn their spatial correspondence through GCN and self-attention. By explicitly extracting in-box details and suppressing irrelevant regions, our method acquires more discriminative human-level features for relation learning and group activity prediction. We further propose a Graph-based Token Merging algorithm to reduce the computation cost of Human Feature Refinement, while minimizing information loss and overfitting risk. Experiments show that our method outperforms previous WSGAR methods on Volleyball, NBA and JRDB-PAR benchmarks, with reduced computation cost.

Abstract:
Temporal Sentence Grounding (TSG) in videos aims to localize a temporal interval from an untrimmed video that is semantically relevant to a given query sentence. To achieve a balance between tremendous annotation burden and grounding performance, we propose a new Weakly Semi-supervised Temporal Sentence Grounding with Points (WSS-TSG-P) task, where the dataset comprises limited fully-annotated video-sentence pairs by start and end timestamps (full label) and a large amount of weakly-annotated pairs by a single point timestamp (point label). Based on this setting, we first introduce a point-to-moment1 regressor which converts point annotations to pseudo moment labels. To train a good regressor for reliable pseudo moment labels, we propose a point-guided feature aggregation module to aggregate cross-modal representations based on the prototype feature at the given point position. In addition, we propose to perform regressor self-training and design pseudo label generation strategies to exploit both full annotations and point annotations. All heterogeneous labels (full, pseudo moment, and point labels) are used to train a TSG backbone. In addition, we propose a novel point-guided group contrastive learning method by constructing reliable positive and negative sets and re-weighting pseudo moment labels to further improve the model performance. Extensive experiments on benchmark datasets verify that our proposed method outperforms other semi-supervised learning methods and bridges the performance gap between weakly-supervised and fully-supervised learning methods in TSG.

Abstract:
Domain generalization (DG) aims to learn a model from source domains and apply it to unseen target domains with out-of-distribution data. Owing to CLIP’s strong ability to encode semantic concepts, it has attracted increasing interest in domain generalization. However, CLIP often struggles to focus on task-relevant regions across domains, i.e., domain-invariant regions, resulting in suboptimal performance on unseen target domains. To address this challenge, we propose an attention-refocusing scheme, called Simulate, Refocus and Ensemble (SRE), which learns to reduce the domain shift by aligning the attention maps in CLIP via attention refocusing. SRE first simulates domain shifts by performing augmentation on the source data to generate simulated target domains. SRE then learns to reduce the domain shifts by refocusing the attention in CLIP between the source and simulated target domains. Finally, SRE utilizes ensemble learning to enhance the ability to capture domain-invariant attention maps between the source data and the simulated target data. Extensive experimental results on several datasets demonstrate that SRE generally achieves better results than state-of-the-art methods. The code is available at: https://github.com/bitPrincy/SRE-DG.

Abstract:
Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data—including images, text, coordinates, and parsing maps—into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM’s competitive and even superior performance across multiple human-centric referring tasks.

Abstract:
Person re-identification (ReID) aims to match person images of the same identity under different camera views. Conventional ReID models mainly consider a closed-world setting where person identities in query and gallery are exactly the same. However, in real-world applications, query identities and gallery identities usually do not exactly contain the same persons. Therefore, open-world ReID has been proposed to match the images of gallery identities (targets) with a large number of non-gallery identities (non-targets). Since some non-targets are quite similar to the targets, the ReID model may make incorrect judgments when verifying these non-targets. To solve this problem, we leverage the impressive cross-modal matching capabilities of the large vision-language model (VLM) to construct Negative Semantic guided identity boundaries for each person to develop the open-world ReID model (NS-ReID). To construct the identity boundary, we propose Virtual Non-target Repulsion that utilizes negative semantics to prompt the ReID model to push virtual non-targets away from the targets. The prompts expressing negative semantics offer a different perspective to guide the training process to avoid contradictory optimization. Moreover, we propose the Dual-Boost Refinement Learning strategy to train learnable identity prompts to capture detailed identity information, which is essential for constructing the identity boundary since the variations among identities are comparatively small. These facilitate the model in constructing the wide identity boundary of each person. Extensive experiments on two benchmark ReID datasets demonstrate that our proposed NS-ReID achieves state-of-the-art performance compared with existing methods.

Abstract:
Videos are generally compressed to save storage and transmission bandwidth. Popular lossy video compression inevitably leads to Perceivable Encoding Artifacts (PEAs) that affect user’s visual experience. Thus, Compression Artifact Removal (CAR) methods have emerged to eliminate perceivable encoding artifacts after video coding. However, there still lacks of an efficient artifact discrimination and evaluation method to guide the optimization of CAR methods. To solve this problem, we make the first attempt to propose an Artifact Perception and Evaluation Network (APE-Net) that can accurately locate artifacts and evaluate their impacts on user experience. First, we propose an Artifact Perception Module (APM) that captures various types and long-tailed-distributed PEAs with attention learning and data re-weighting, thus greatly improving the perception capability for video compression artifacts. Second, we design an Artifact Evaluation Module (AEM) to fuse all recognized PEAs with visual saliency and random forest regression, which assists the artifact perception model to be in line with human visual characteristics in video quality assessment tasks. Experimental results demonstrate that our proposed APE-Net is superior to the state-of-the-art algorithms on compressed video quality assessment. Our codes will be made publicly available after the peer review process

Abstract:
Occluded person re-identification (ReID) poses substantial challenges in computer vision, primarily due to incomplete information and occlusion interference. Although Transformer architectures have become dominant in ReID due to their strong feature modeling capabilities, their lack of an adaptive weight allocation mechanism for multi-granularity feature processing limits their ability to extract generalizable and robust features. Recently, Masked Image Modeling (MIM) has demonstrated considerable promise in visual tasks, but its integration into ReID models remains underexplored. This paper presents AMFOR (Adaptive Multi-granularity feature Fusion and Occlusion Reconstruction), a novel framework combining MIM and Transformer architectures. AMFOR consists of three key components: AMFF-Encoder, HPR-Decoder, and teacher-student Decoder. The AMFF-Encoder enables adaptive fusion of multi-granularity features through learnable queries, allowing interaction between text-visual features and visual features extracted from multiple Transformer layers. The HPR-Decoder conceptualizes occluded regions in pedestrian images as reconstructable patches, guiding the encoder to extract more discriminative features through reconstruction. Additionally, the self-distillation teacher-student decoder is employed to refine pedestrian part features, further optimized by the proposed AMGDLoss. This paper represents the first successful implementation of the MIM mechanism in person ReID models. Empirical evaluations on five benchmark datasets, covering both occluded (Occluded-DukeMTMC, Occluded-REID, and P-DukeMTMC) and complete (Market-1501 and DukeMTMC-reID) scenarios, demonstrate that AMFOR outperforms existing state-of-the-art methods in person ReID.

Affiliations: School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China; School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou, China; School of Computer Science, Shanghai Jiao Tong University, Shanghai, China; Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong

Abstract:
In recent years, scene text detection research has increasingly focused on arbitrary-shaped texts, where text representation is a fundamental problem. However, most existing methods still struggle to separate adjacent or overlapping texts due to ambiguous spatial positions of points or segmentation masks. Besides, the time efficiency of the entire pipeline is often neglected, resulting in sub-optimal inference speed. To tackle these problems, we first propose a novel text representation method based on robust subspace recovery, which robustly represents complex text shapes by combining orthogonal basis vectors learned from labeled text contours. These basis vectors capture basis contour patterns with distinct information, enabling clearer boundaries even in densely populated text scenarios. Moreover, we propose a dynamic sparse assignment scheme for positive samples that adaptively adjusts their weights during training, which not only accelerates inference speed by eliminating redundant predictions but also enhances feature learning by providing sufficient supervision signals. Building on these innovations, we present TextRSR, an accurate and efficient scene text detection network. Extensive experiments on challenging benchmarks demonstrate the superior accuracy and efficiency of TextRSR compared to state-of-the-art methods. Particularly, TextRSR achieves an F-measure of 88.5% at 37.8 frames per second (FPS) for CTW1500 dataset and an F-measure of 89.1% at 23.1 FPS for Total-Text dataset.

Abstract:
The malicious misuse and widespread dissemination of AI-generated images pose a significant threat to the authenticity of online information. Current detection methods often struggle to generalize to unseen generative models, and the rapid evolution of generative techniques continuously exacerbates this challenge. Without adaptability, detection models risk becoming ineffective in real-world applications. To address this critical issue, we propose a novel three-stage domain continual learning framework designed for continuous adaptation to evolving generative models. In the first stage, we employ a strategic parameter-efficient fine-tuning approach to develop a transferable offline detection model with strong generalization capabilities. Building upon this foundation, the second stage integrates unseen data streams into a continual learning process. To efficiently learn from limited samples of novel generated models and mitigate overfitting, we design a data augmentation chain with progressively increasing complexity. Furthermore, we leverage the Kronecker-Factored Approximate Curvature (K-FAC) method to approximate the Hessian and alleviate catastrophic forgetting. Finally, the third stage utilizes a linear interpolation strategy based on Linear Mode Connectivity, effectively capturing commonalities across diverse generative models and further enhancing overall performance. We establish a comprehensive benchmark of 27 generative models, including GANs, deepfakes, and diffusion models, chronologically structured up to August 2024 to simulate real-world scenarios. Extensive experiments demonstrate that our initial offline detectors surpass the leading baseline by +5.51% in terms of mean average precision. Our continual learning strategy achieves an average accuracy of 92.20%, outperforming state-of-the-art methods.

Abstract:
Humans are capable of inferring dynamic context from a still image and, with the provision of additional commonsense knowledge, can accurately complete visual commonsense reasoning tasks. Nevertheless, this remains a highly challenging cognitive-level task for current vision-language models. Previous work has primarily focused on utilizing models fine-tuned for specific downstream tasks and introduces external world knowledge to tackle these challenging tasks, while neglecting the importance of accurate context and the key role of commonsense knowledge in reasoning. In this paper, we propose a novel framework to enhance visual commonsense reasoning by incorporating context and commonsense knowledge. We decompose the visual commonsense reasoning problem into four distinct but interrelated sub-problems and combine visual language models with a large language model to enable zero-shot reasoning. The uniqueness of this work lies in the proposed commonsense knowledge filtering module, which filters out relevant commonsense knowledge through the causal strength of visual context. This process constructs Visual Context and Commonsense-guided Causal Chain-of-Thought (\mathrmVC^3-CoT) reasoning paths, thereby providing double robustness to visual commonsense reasoning by incorporating weighted majority voting strategy. Extensive experiments on several downstream tasks demonstrate that the proposed method significantly improves performance compared to baseline models and the state-of-the-art method, and confirm the effectiveness of the proposed components.

Abstract:
Inherently equipped with arbitrary resolution and multi-view consistency, the Neural Radiance Field (NeRF) as an implicit scene representation has drawn extensive attention. While traditional NeRFs excel at novel view synthesis (NVS) under ideal conditions, they overlook the potential of learning consistent geometric representations across varying sight qualities. Current methods mainly focus on optimizing synthesis under clear visibility, which limits their effectiveness in downstream scene understanding tasks where robust geometry comprehension is crucial. In this paper, we propose a NVS pre-training technique named ShadowNeRF which firstly synthesizes degraded views with shadowed regions to challenge the model in inferring complete scene geometries. We then design a self-supervised sight recovery process with a two-stage unshadowing framework, which progressively recovers neighboring areas and reveals geometric properties of invisible regions. This pre-training strategy of degradation synthesis and recovery, when combined with task-specific fine-tuning, enhances the understanding of underlying scene structure for the model and strengthens its ability to process scenes under varying sight conditions. Through extensive experiments, we demonstrate that our pre-training and fine-tuning pipeline significantly improves the model performances in semantic segmentation and 3D object detection, as well as the reconstruction quality of complex scenes.

Affiliations: School of Information Engineering, Chang'an University, Xi'an, China; Ministry of Education Key Laboratory of Intelligent Networks and Network Security, and School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China; Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, and School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China; School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore; Institute of Big Data, Fudan University, Shanghai, China

Abstract:
Geometry problem solving (GPS) requires high-level symbolic and logical reasoning based on geometry theorem knowledge to arrive at the answer. Despite the remarkable advances achieved by Large Language Models (LLMs) in various problem-solving tasks, they still struggle to perform rigorous multi-step geometry reasoning, which is essential for GPS. In this paper, we propose a dynamic tree-based geometry problem solver named GeoTree, which combines a knowledgeable LLM with a rigorous symbolic solver to perform geometry reasoning cooperatively. Specifically, an iterative multi-step geometry reasoning process is performed dynamically based on a tree-like structure, thereby emulating divergent and deliberate human problem-solving thinking. Each geometry reasoning step is completed collaboratively through four components, consisting of Theorem Seeker, Symbolic Solver, Evaluator, and Controller. First, Theorem Seeker prompts LLMs to seek out candidate theorems with their inherent geometry theorem knowledge. Subsequently, Symbolic Solver applies the theorems on the known conditions to obtain new additional conditions. Then, Evaluator assesses the availability of the theorems and prompts LLMs to judge the usefulness of these new conditions for the problem target, which serves as the heuristic guidance for subsequent reasoning. Finally, Controller determines the termination state, which decides whether to continue invoking the other three components for further attempts. Extensive experiments on Geometry3K demonstrate the superiority of GeoTree in accuracy, efficiency, and explainability.

Abstract:
Existing provably secure linguistic steganographic methods typically rely on white-box extraction, which necessitates access to large language models. This requirement is impractical in environments with limited resources. To tackle this issue, we propose Disreo, a provably secure linguistic steganography based on distribution reorganization, which extracts the messages without accessing the underlying language model. This is achieved through token position randomization and output probability reorganization for message embedding. Moreover, secret message extraction requires only the synchronization of token positions used during embedding, making it both feasible and fast for devices with constrained computational capabilities. Theoretically, the security of Disreo can be reduced to the security of the encryption algorithm we employ, and our experimental analyses confirm that Disreo maintains distribution consistency between stego and cover texts in expectations. In practice, Disreo achieves an average extraction time of 0.015 seconds for 5 bits of secret messages from 100 tokens, with a 100% extraction accuracy. By transitioning from white-box extraction to more practical no-box extraction scenarios, Disreo broadens the scope of steganography applications.

Abstract:
Convolutional Neural Network (CNN)-based image super-resolution (SR) has exhibited impressive success on known degraded low-resolution (LR) images. However, this type of approach is hard to hold its performance in practical scenarios when the degradation process (i.e. blur and downsampling) is unknown. Despite existing blind SR methods proposed to solve this problem using blur kernel estimation, the perceptual quality and reconstruction accuracy are still unsatisfactory. In this paper, we analyze the degradation of a high-resolution (HR) image from image intrinsic components according to a degradation-based formulation model. We propose a components decomposition and co-optimization network (CDCN) for blind SR. Firstly, CDCN decomposes the input LR image into structure and detail components in feature space. Then, the mutual collaboration block (MCB) is presented to exploit the relationship between both two components. In this way, the detail component can provide informative features to enrich the structural context and the structure component can carry structural context for better detail revealing via a mutual complementary manner. After that, we present a degradation-driven learning strategy to jointly supervise the HR image detail and structure restoration process. Finally, a multi-scale fusion module followed by an upsampling layer is designed to fuse the structure and detail features and perform SR reconstruction. Empowered by such degradation-based components decomposition, collaboration, and mutual optimization, we can bridge the correlation between component learning and degradation modelling for blind SR, thereby producing SR results with more accurate textures. Extensive experiments on both synthetic SR datasets and real-world images show that the proposed method achieves the state-of-the-art performance compared to existing methods.

Abstract:
Underwater images often suffer from color distortion, reduced contrast, and blurriness due to light refraction, absorption, and scattering. In this paper, we propose a coarse-to-fine deep Pyramid network for Underwater Image Enhancement (PyUIE). Specifically, PyUIE begins by decomposing the input image into high- and low-frequency components using a Laplacian pyramid. The low-frequency residual, which primarily contains lighting and color information, is processed with a lightweight deterministic color mapping network to correct global illumination and color distortions. Concurrently, the high-frequency components containing the fine details are enhanced in a coarse-to-fine manner, such that each higher scale is guided by the reconstruction from the adjacent lower scale. This hierarchical strategy effectively mitigates the risk of over-enhancement by avoiding excessive modifications to the high-frequency components. Additionally, we implement a multi-scale supervised training strategy, enabling the model to learn and reconstruct features across multiple scales, which enhances its ability to capture diverse details and improves its generalization and robustness. Extensive experiments demonstrate that our method successfully restores fine details and small structures in underwater images while producing vivid and visually appealing colors, thereby outperforming existing enhancement methods in both qualitative and quantitative evaluations.

Abstract:
Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

Abstract:
Despite significant advancements in large visual-language models (LVLMs), hallucinations remain a major bottleneck in their practical applications. One key factor contributing to hallucinations is the over-reliance on language priors during the autoregressive text generation process. Visual Contrastive Decoding (VCD), a popular technique for mitigating hallucinations, perturbs the visual input and compares the perturbed output with the original. However, it often overlooks the gradual attenuation of visual information within the decoder, limiting the model’s ability to generate text based on actual visual content. We propose a novel, training-free method—Visual-Enhanced Contrastive Decoding (VECD)—which addresses this issue by amplifying visual information within the decoder, thereby reducing hallucinations caused by excessive reliance on language priors. VECD dynamically selects later layers for visual injection, while retaining only essential visual tokens in early layers. This approach enhances the generation process by adaptively balancing visual and language priors. By comparing outputs with and without visual amplification, we derive a refined probability distribution for the next token. Moreover, we improve the beam search algorithm by introducing a visually guided token selection strategy, enabling the generation of text that aligns more closely with the image content. Our extensive experiments show that VECD significantly reduces hallucinations and improves the quality of generated text, demonstrating its effectiveness as a practical solution.

Abstract:
The objective of the Point-of-Interest (POI) recommendation system is to predict potential future visits based on users' check-in histories. However, due to factors such as users' tendency to visit nearby locations, time and space mobility costs, users are more likely to visit a limited number of POIs within a confined geographical area. Consequently, POI recommendation faces a more severe data sparsity issue compared to other recommendation scenarios. Current research has shown that incorporating multimodal content information into POI recommendations can effectively alleviate the data sparsity problem. However, existing methods still have the following limitations: 1) Multimodal noise hinders the effective extraction of multimodal content. 2) Multimodal feature fusion ignores the differential impact of various modalities on user decisions. To address these issues, we propose an Interest-aware MultiModal adaptive fusion framework for POI recommendations (IMMPOI). Specifically, we propose an interest-oriented purifier to perform multimodal noise filtering based on user preference, and introduce a disentangled multimodal graph encoder to accurately capture fine-grained behavior features, multimodal features, geographical, and sequential relationships between users and POIs. Then, we develop an interest-aware multimodal fuser that learns comprehensive multimodal representations of users and POIs by adaptively integrating multimodal content features and context features based on a self-supervised strategy. Extensive experiments on four real-world datasets demonstrate that IMMPOI achieves a 6% to 10% performance improvement compared to state-of-the-art methods.

Abstract:
Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.

Affiliations: Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Zhejiang, China; School of Software Technology, Zhejiang University, Hangzhou, China; FinVolution Group, Shanghai, China; Department of Computer Science and Engineering, University of Nevada, Reno, NV, USA; School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China; College of Computer Science and Technology, Zhejiang University, Hangzhou, China

Abstract:
Contemporary weakly-supervised object localization (WSOL) methods have primarily focused on addressing the challenge of localizing the most discriminative region while largely overlooking the relatively less explored issue of biased activation—incorrectly spotlighting co-occurring background with the foreground feature. In this paper, we conduct a thorough causal analysis to investigate the origins of biased activation. Based on our analysis, we attribute this phenomenon to the presence of co-occurring background confounders. Building upon this profound insight, we introduce a pioneering paradigm known as Counterfactual Co-occurring Learning (CCL), meticulously engendering counterfactual representations by adeptly disentangling the foreground from the co-occurring background elements. Furthermore, we propose an innovative network architecture known as Counterfactual-CAM. This architecture seamlessly incorporates a perturbation mechanism for counterfactual representations into the vanilla CAM-based model. By training the WSOL model with these perturbed representations, we guide the model to prioritize the consistent foreground content while concurrently reducing the influence of distracting co-occurring backgrounds. To the best of our knowledge, this study represents the initial exploration of this research direction. Our extensive experiments conducted across multiple benchmarks validate the effectiveness of the proposed Counterfactual-CAM in mitigating biased activation.

Abstract:
Visual dialog aims to facilitate the answering of multi-round questions by effectively integrating dialog history and the relevant content of images. Existing methods in visual dialog predominantly concentrate on devising multi-modal data interaction architectures to augment multi-modal fusion performance, but they often disregard inherent dataset selection biases. This oversight can lead to imbalanced feature learning and compromising the robustness of the model. In this paper, we propose a Debiased Visual Dialog model (DVD) to mitigate the influence of biases. Specifically, we concretize these biases as spurious relationships between foreground and background knowledge in both image and dialog history modalities and design a dual-encoding workflow to disentangle them effectively. Additionally, we introduce a knowledge bias indicator for each sample, enabling us to assess and quantify the impact of biases on the learning process. By employing a generalized cross-entropy loss, we enhance the distinction of knowledge biases, which significantly improves the efficiency of feature disentanglement. Extensive comparative experiments against state-of-the-art methods, along with ablation studies, validate the effectiveness of our DVD model. These results also substantiate the promising potential of debiasing efforts in advancing the field of visual dialog and vision-language research.

Abstract:
Deep multi-view clustering (MVC) has gained widespread attention as it can effectively mine consistent information from multiple views and improve clustering performance. However, view bias often exists between views (i.e., the quality differences between views). Treating all views equally inevitably destroys structural information when simply concatenating or summing the embedded representation of multiple views. To alleviate this issue, we propose a deep multi-view clustering with intra-view similarity and cross-view correlation learning (MISCC), facilitating the intra-view discriminability and inter-view complementarity. Specifically, we utilize the intra-view inherent structure information to dynamically identify semantically similar samples within each view. By aggregating their embedding representations, fine-grained structures are enhanced to boost intra-cluster compactness and inter-cluster separation. Then, we construct a cross-view correlation learning module to align semantically related views while preserving the distinctive features of irrelevant views. Based on them, a centralized clustering alignment strategy is proposed to align the similarity distribution and clustering structure between each view and the unified view, balancing the diverse information among multiple views. By jointly training these modules, the unified representation is optimized to capture more discriminative information from multiple views. Extensive experiments conducted on eleven multi-view datasets demonstrate that MISCC outperforms the state-of-the-art clustering methods.

Abstract:
Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities—text, audio, video, and motion—within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

Abstract:
Despite significant advancements in multimodal pre-training, effectively integrating and using latent semantic information across multiple modalities remains a challenge. In this paper, we introduce TextBridge, a text-centered framework that uses the text modality as a semantic anchor to guide cross-modal integration and alignment. TextBridge employs frozen encoders from state-of-the-art pre-trained models and introduces an innovative modality bridge module that enhances semantic alignment and reduces redundancy among different modal features. The framework also incorporates a multi-projection text feature fusion method, enhancing the alignment and integration of text features from diverse modalities into a cohesive semantic representation. To optimize the integration of multimodal information, we make the text encoder trainable and use a text-centered contrastive loss function to enhance the model’s ability to capture complementary information across modalities. Extensive experiments on the M5Product dataset demonstrate that TextBridge significantly outperforms the SCALE model in mean average precision (mAP) and precision (Prec), underscoring its effectiveness in multimodal retrieval tasks.

Abstract:
Remote sensing (RS) images are prone to various degradations, which poses challenges to downstream tasks. Although existing single-task remote sensing image restoration methods are effective, they lack generalizability across tasks. All-in-one methods can handle multiple degradation tasks, but they usually focus on spatial information, ignoring the physical properties of the degradation information. To address the above limitations, we propose a Multiscale Spatial-Frequency Degradation Decoupling framework for All-in-One remote sensing image restoration (SFD^2IR), which decouples degradation features across different tasks to guide the model in performing task-specific image restoration. Specifically, a task-specific instruction generator (TIG) is proposed first to transform degradation features into task-specific prompts. Then, a multi-scale multi-frequency enhancement (MME) module is designed to decouple degradation effects from both spatial and frequency perspectives, thus enhancing the model’s adaptability to various degradation types. Finally, a prompt feature refinement (PFR) module is developed to further refine the model’s response to degraded tasks. Extensive experiments demonstrate that the proposed method achieves excellent performance on different RSIR tasks, including cloud removal, deblurring, dehazing, and super-resolution.

Abstract:
Earth Mover’s Distance (EMD) is an important similarity measure between two distributions, commonly used in computer vision and many other application domains. However, its exact calculation is computationally and memory intensive, which hinders its scalability and applicability for large-scale problems. Various approximate EMD algorithms have been proposed to reduce computational costs, but they suffer lower accuracy and may require additional memory usage or manual parameter tuning compared to the exact calculation. In this paper, we present a novel approach, NNS-EMD, to approximate EMD using Nearest Neighbor Search (NNS), achieving high accuracy, low time complexity, and high memory efficiency. The NNS operation reduces the number of data points processed in each NNS iteration and offers opportunities for parallel processing. We further accelerate NNS-EMDvia vectorization on GPU, which is especially beneficial for large datasets. We compare NNS-EMDwith both the exact EMD and state-of-the-art approximate EMD algorithms in image and document classification and image retrieval tasks. We also apply NNS-EMDto calculate transport mapping and realize color transfer between images. NNS-EMDachieves speed 44× to 135× faster than the exact EMD implementation and offers superior accuracy, speedup, and memory efficiency compared to existing approximate EMD methods.

Abstract:
Frequency domain-based methods have demonstrated promising performance in Camouflaged Object Detection (COD) tasks because of their enhanced power for distinguishing between objects and the background in the frequency domain. However, these methods often overlook the interference caused by task-irrelevant cues such as background textures. These extraneous factors are learned alongside task-relevant features by the employed network, increasing the number of false positives. Therefore, we propose a camouflaged object detection method based on the Information Bottleneck (IB) theory. The aim is to obtain a robust representation that retains the essential features needed for prediction while minimizing the redundant information derived from both the RGB and frequency domains. Specifically, we propose a Feature Selection Information Bottleneck Module (FSIBM). By explicit supervision, this module minimizes the mutual information between the fused feature from two domains and the predictive features, thereby weakening task-irrelated information. Simultaneously, the FSIBM maximizes the mutual information between the predictive features and the ground truth (i.e., emphasizing task-related elements). Additionally, we introduce a Cross-Domain Awareness Interaction Module (CDAIM), which establishes self-reinforcement for the object attributes within each domain and facilitates cross-domain complementarity. This enables the capture of sufficient discriminative features from both domains. To verify the generalization ability of the proposed method, we applied it to three benchmark datasets, on which our method outperformed the corresponding state-of-the-art methods.

Abstract:
Scene text reading is a crucial task for scene understanding. Text detection, as a fundamental task in scene text reading, has recently garnered significant attention. Among various approaches, segmentation-based methods stand out for their flexible pixel-level prediction capabilities. However, two main issues remain. 1) These methods treat all text instances as a pixel set during training, causing the features of large-scale instances to dominate the model optimization process. As a result, the optimization deviates from the instance-level objectives. 2) Segmentation methods filter candidates based on pixel-level class scores, whereas what is needed is an evaluation of whether an instance is text, which also deviates from the original goals. To address these issues, we propose an Instance-Equal Feature Guide Module (IEFGM), a Cross-Level Feature Interaction Module (CLIFM), and a Pixel-Instance Fusion Discriminator (PIFD) to balance optimization strategies with practical goals. The IEFGM introduces instance-level features and positional information, guiding the model to treat instances of different scales equally at the feature level. The CLIFM encourages feature interaction across different levels, enabling the model to recognize text from various perspectives. Unlike existing methods that filter candidates using pixel-level results, the PIFD integrates both instance-level and pixel-level information to identify candidate regions, aligning with the original goals of text detection. A series of ablation studies demonstrates the effectiveness of the proposed modules. Extensive experiments across six datasets from different scenes demonstrate that our method outperforms existing state-of-the-art approaches.

Abstract:
With the rapid rise of short video social platforms, the spread of fake news videos has become a global challenge. Short videos, which integrate multiple modalities such as text, images, and audio, have a powerful visual and auditory impact, making fake news more prone to widespread dissemination and causing serious societal consequences. However, the complex fusion of multimodal information in fake news videos, coupled with editing artifacts that often blur the distinction between real and fake content, presents considerable challenges to traditional detection methods. To address these challenges, this paper proposes a fake news video detection method based on the Knowledge-Enhanced Dynamic Scene Graph Attention Network (KDSGAT). This method captures temporal correlations and local semantic differences in visual scenes by leveraging dynamic scene graph networks, while enhancing semantic understanding through knowledge distillation from external knowledge graphs. Specifically, we first use pre-trained models such as BERT, HuBERT, and Swin Transformer to extract text semantic features, audio emotion features, and visual features, respectively. Next, we apply an unbiased scene graph generation approach to convert keyframes from the video into scene graphs, which are then processed by the dynamic scene graph attention network to capture temporal correlations and local semantic variations within the scene graph sequences. Finally, co-attention is used to interactively fuse multimodal features, enabling precise detection of fake news in videos. We conduct extensive experiments on two real-world datasets from short video social platforms, FakeSV and FakeTT. The results show that our method outperforms state-of-the-art baselines, improving accuracy by 1.86% and 2.68% on the two datasets, respectively.

Abstract:
Multi-view representation learning is recognized for its effectiveness in multi-source data analysis, yet it faces significant challenges: 1) Deep model structures remain opaque, lacking interpretability; 2) Research on compatibility models toward multi-feature and multi-relation data is insufficient. In this paper, we introduce an interpretable multi-view representation learning framework specifically designed for the complex multi-view scene. The barrier to achieving compatibility stems from the need to simultaneously process the homogeneous information inherent in multiple features and the heterogeneous characteristic of multi-relation data. To address this, we design an objective function solved by iterative methods to learn comprehensive relations and consistent representation. The introduction of comprehensive relations aims to mitigate mutual interference among different data types while combining information abstracted from original features and relations into a unified representation. We then convert iterative solutions into feed-forward network layers with embedded learnable modules, resulting in a deep network architecture that is interpretable at the design level. Extensive experimental results demonstrate the superior performance of the proposed method over state-of-the-art approaches.

Abstract:
The existing deep-learning based robust watermarking model generally applies a discriminator to form generative adversarial network (GAN) for increasing the quality of encoded images, and adopts a single encoder to embed watermark. However, GAN training is unstable, and the single encoder cannot fully adjust the watermarking distribution, thus affecting the watermarking performance. To address those limitations, this paper presents the multi-encoder based on conditional diffusion model (CDM) for robust image watermarking, namely, DiffW. To enhance the stability, the multi-encoder structure based on CDM replaces GAN for optimizing the watermarking distribution iteratively. Specifically, the operation of each timestep in the forward and reverse diffusion processes of the CDM is regarded as an encoder to overcome the shortcomings of the single encoder structure. At the training stage, under the guidance of the conditional noisy image, the forward process trains each encoder to fuse the image and watermark to generate high-quality encoded images. During the testing stage, only a small number of trained encoders of the forward process are used, so as to reduce the time complexity. Furthermore, to improve watermarking robustness, the channel attention module (CAM) is designed to extract main watermark features by mining channel correlations for multi-layer fusion, so that watermark can be embedded into imperceptible and texture areas. The experimental results reveal that compared with the existing watermarking model, the proposed DiffW can achieve better results in terms of watermarking invisibility and robustness.

Abstract:
Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.

Abstract:
The key to fine-grained video action recognition is identifying subtle differences between action categories. Relying solely on visual features supervised by action labels makes it challenging to characterize robust and discriminative action dynamics from videos. With significant advancements in human pose estimation and the powerful capabilities of Vision-Language Models (VLMs), obtaining reliable and cost-free human pose data and textual semantics has become increasingly feasible, enabling their effective use in fine-grained action recognition. However, the inherent disparities in feature representations across different modalities necessitate a robust alignment strategy to achieve optimal fusion. To address this, we propose a universal cross-modality knowledge alignment framework, namely UniAlign, to transfer the knowledge from such pre-trained multi-modal models into action recognition models. Specifically, UniAlign introduces two additional branches to extract pose features and textual semantics with the pre-trained pose encoder and VLM. To align the action-relevant cues among video features, pose features, and textual semantics, we propose a Cross-Modality Similarity Aggregation module (CMSA) that utilizes the importance of different modal cues while aggregating cross-modal similarities. Additionally, we adopt a fine-tuning mechanism similar to Exponential Moving Average (EMA) to refine the textual semantics, ensuring that the semantic representations encoded by VLMs are preserved while being optimized towards the specific task preferences. Extensive experiments on widely used fine-grained action recognition benchmarks (e.g., FineGym, NTURGB-D, Diving48) and coarse-grained K400 dataset demonstrate the effectiveness of the proposed UniAlign method.

Abstract:
The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Abstract:
Recently, the efficient deployment and acceleration of transformer-based pre-trained models (TPMs) on resource-constrained edge devices for multimedia services have gained significant interest. Although early exiting is a feasible solution, it may lead to extra computational cost and substantial performance degradation compared to the original models. To tackle these issues, we propose a framework termed EEformer, which incorporates global-local heads (GLHs) into intermediate layers to construct the early exiting dynamic neural network (EDNN). The GLH can efficiently extract global and local information from hidden states produced by the backbone layer, thereby achieving a better performance-efficiency trade-off for the EDNN. Moreover, we propose a novel progressive fine-tuning strategy to steadily improve the efficiency of the EDNN while maintaining its performance comparable to the original mode through three fine-tuning stages. We conduct extensive experiments on image classification and natural language processing tasks, demonstrating the superiority of the proposed framework. In particular, the proposed framework achieves 1.87× speed-up while maintaining 99.0% performance on the CIFAR-100 dataset, and 3.05× speed-up while maintaining 98.5% performance on the SST-2 dataset.

Abstract:
Person re-identification (Re-ID) aims to accurately match pairs of person images across different cameras. Existing Re-ID methods primarily focus on associations within single-type camera networks (e.g., ground-ground or sky-sky matching), which are ineffective in addressing the significant viewpoint discrepancies in multi-type camera networks. One key reason for this is the absence of suitable large-scale datasets for algorithm evaluation, which limits the applicability of Re-ID across more diverse scenarios, despite its critical importance. To expand the scope of visual coverage and facilitate search operations in the special locations, we construct a novel benchmark: Multi-Source Sky-Land person Re-ID dataset (MSSL), including 66,928 images from 2,099 volunteers in nearly 20 unique scenes. Additionally, we observe that existing Re-ID systems struggle with drastic viewpoint variations in sky-ground Re-ID, especially on MSSL. To address these issues, we propose Multi-Source Prompts (MSP), separately learning finer cross-modal features of pedestrians from both sky and ground perspectives. These refined features better represent the true appearance of pedestrians from different viewpoints. Subsequently, we employ Multi-Source Alignment Loss to mitigate the impact of drastic viewpoint changes. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on our MSSL, as well as on other benchmarks such as AG-ReID dataset. Our MSSL dataset and the code will be available at https://github.com/sysuchx/SkyGroundReID.

Abstract:
Despite demonstrating superior Rate-Distortion (RD) performance, Learning-based Image Compression (LIC) algorithms have been found to be vulnerable to malicious perturbations in recent studies. However, the adversarial attacks considered in existing literature remain divergent from real-world scenarios, both in terms of the attack direction and bitrate. Additionally, existing methods focus solely on empirical observations of the model vulnerability, neglecting to identify the origin of it. These limitations hinder the comprehensive investigation and in-depth understanding of the adversarial robustness of LIC algorithms. To address the aforementioned issues, this paper considers the arbitrary nature of the attack direction and the uncontrollable compression ratio faced by adversaries, and presents two practical rate-distortion attack paradigms, i.e., Specific-ratio Rate-Distortion Attack (SRDA) and Agnostic-ratio Rate-Distortion Attack (ARDA). To the best of our knowledge, we are the first to conduct joint rate-distortion attacks on LIC algorithms. Using the performance variations as indicators, we evaluate the adversarial robustness of eight predominant LIC algorithms against diverse attacks. Furthermore, we propose two novel analytical tools for in-depth analysis, i.e., Entropy Causal Intervention and Layer-wise Distance Magnify Ratio, and reveal that hyperprior significantly increases the bitrate and Inverse Generalized Divisive Normalization (IGDN) significantly amplifies input perturbations when under attack. Lastly, we examine the efficacy of adversarial training and introduce the use of online updating for defense. By comparing their advantages and disadvantages, we provide a reference for constructing more robust LIC algorithms against the rate-distortion attacks.

Abstract:
Multimodal sentiment analysis (MSA) with missing modalities involves understanding the person’s sentiment using multimodal data where some modalities are missing. Most existing methods focus on reconstructing the missing modalities using the available modalities from each sample, relying on modality-common information. However, these methods overlook the modality-specific information that other samples can provide. Additionally, these approaches often require the guidance of full modality representations during the reconstruction process, which is impractical in resource-constrained real-world scenarios. To address these challenges, we propose the Intra-sample and Intra-modal Enhancement (IIE) framework. The IIE framework enhances both sample-level and modality-level representations to capture additional modality-common and modality-specific information from existing modalities, without requiring full modalities. Specifically, IIE first learns sample-level representations by distilling modality-common information from the available modalities into learnable latent units. Then, it enhances modality-level representations by leveraging modality-specific information from other samples with the same modality, which is crucial for improving robustness in the presence of missing modalities. Finally, IIE ensures consistency between the enhanced modality-level and sample-level representations, combining the enhanced and initial representations to make predictions. Extensive experiments on three datasets demonstrate that the IIE framework significantly outperforms existing methods in terms of both effectiveness and robustness in handling MSA with missing modalities.

Abstract:
In recent years, novel view synthesis from a monocular image has become a research hot-spot that attracts significant attention. Some recent work identifies latent vectors for high-quality view generation via iterative optimisation, which is a time-consuming process. In contrast, some others utilise an encoder learning a mapping function to approximately estimate optimal latent codes, which significantly reduces its processing time but sacrifices reconstruction quality. Consequently, how to balance synthesis quality and its generation efficiency still remains challenging. In this paper, we propose a residual-based encoder to incorporate with a 3D Generative Adversarial Networks (GAN), named ReE3D, for novel view synthesis. It applies an iterative prediction of latent codes to ensure much higher quality of novel view synthesis with an insignificant increase of processing time when compared to existing encoder-based 3D GAN inversion methods. Additionally, we enforce a novel geometric loss constraint on the encoder to predict view-invariant latent codes, thus effectively mitigating the trade-off between geometric and texture quality in 3D GAN inversion. Extensive experimental results demonstrate that our extended encoder-based method has achieved best trade-off performance in terms of novel view synthesis quality and its execution time. Our method has gained comparable synthesis quality with exponentially decreased processing time when compared to iterative optimisation methods, while improved synthesis performance of encoder-based methods significantly.

Abstract:
Therapid emergence of the Metaverse requires higher network throughput and lower latency to deliver immersive and responsive virtual experiences. Traditional centralized data processing approaches are constrained by limited computational and bandwidth resources when handling large-scale user data. A Cloud-Edge-End transmission architecture is proposed in this study, tailored for Metaverse scenarios to optimize resource allocation, minimize latency, and enhance rendering efficiency. A real-time trajectory segment prediction scheme (FDK) was developed, which combines FastDTW with K-means by leveraging user behavior trajectories to determine subscene popularity and store them on GPU servers, thereby reducing user wait time. A two-tier cache optimization scheme (MAE2C) is also proposed, incorporating GCN for subscene feature identification. GPU servers employ the MADDPG strategy to cache popular subscenes, while edge servers utilize DDPG to cache missed scenes. This approach effectively reduces cloud access and cache replacement frequency. Simulation results demonstrate that the subscene cache hit rate of the MAE2C scheme significantly outperforms existing methods across various cache capacities, with a 6.9% reduction in cache replacement frequency. This research provides effective technical support for Metaverse scene rendering and offers insights into the development of generative Metaverse systems.

Abstract:
Adversarial examples are well known to pose a security risk, when attacking deep learning models. While, most of existing adversarial attacks are designed to attack a single deep learning-based task, such as image classification. In practical scenarios, it is more necessary to study adversarial examples transferring across different vision tasks. However, it is challenging to create cross-task adversarial examples that can destroy multiple vision tasks at once due to unavailable various task-specific models and loss functions for attackers. To deal with this problem, we propose a Dual Attention-Guided Method (DAGM) for crafting cross-task adversarial examples by designing a spatial attention module and a channel attention module to capture overlapping discriminative regions and features that contribute to various tasks. Then we craft cross-task adversarial examples via reducing the dispersion (i.e., standard deviation) of feature maps re-weighted by both attention modules, which can destroy the overlapping discriminative regions and features for various tasks. Furthermore, to present theoretical explanation, we systematically analyze our method, and rigorously prove that both attention modules can provide better effectiveness of our adversarial examples, compared with existing cross-task adversarial attacks. Extensive experiments on two datasets demonstrate that our method can significantly degrade the performance of various tasks, even online CV APIs, and consistently outperform state-of-the-art methods by a large margin.

Affiliations: State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University, Beijing, China; School of Mathematical Sciences, Peking University, Beijing, China; Department of Computer Science, University of California, Los Angeles, CA, USA; Terminus Group, Beijing, China; School of Information Technology and Management, University of International Business and Economics, Beijing, China

Abstract:
Graph neural networks (GNNs) have emerged as powerful tools for graph classification tasks. However, contemporary graph classification methods are predominantly studied in fully supervised scenarios, while there could be label ambiguity and noise in real-world applications. In this work, we explore the weakly supervised problem of partial label learning on graphs, where each graph sample is assigned a collection of candidate labels. A novel method called Distribution Divergence-based Graph Contrast (DEER) is proposed to address this issue. At the heart of our DEER is to measure the divergence among the underlying semantic distributions in the hidden space and this metric enables the identification of accurate positive graph pairs for effective graph contrastive learning. Specifically, we generate graph representations of augmented graph views that retain semantics and can be regarded as samples from the underlying semantic distributions. We employ a non-parametric metric to measure distribution divergence, which is then combined with pseudo-labeling to generate unbiased and target-oriented graph pairs. Furthermore, we introduce a label-correction method to eliminate noisy candidate labels, updating target labels using posterior distributions in a soft manner. Comprehensive experiments on various benchmarks demonstrate the superiority of our DEER in different settings compared to a range of state-of-the-art baselines.

Abstract:
We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring their effects on people, such as the emotions they evoke within a context. To fill this gap, we propose the affective soundscape captioning (ASSC) task, which enables automated soundscape analysis, thus avoiding labour-intensive subjective ratings and surveys in conventional methods. With soundscape captioning, context-aware descriptions are generated for soundscape by capturing the acoustic scenes (ASs), audio events (AEs) information, and the corresponding human affective qualities (AQs). To this end, we propose an automatic soundscape captioner (SoundSCaper) system composed of an acoustic model, i.e. SoundAQnet, and a large language model (LLM). SoundAQnet simultaneously models multi-scale information about ASs, AEs, and perceived AQs, while the LLM describes the soundscape with captions by parsing the information captured with SoundAQnet. SoundSCaper is assessed by two juries of 32 people. In expert evaluation, the average score of SoundSCaper-generated captions is slightly lower than that of two soundscape experts on the evaluation set D1 and the external mixed dataset D2, but not statistically significant. In layperson evaluation, SoundSCaper outperforms soundscape experts in several metrics on datasets D1 and D2. In addition to human evaluation, compared to other automated audio captioning (AAC) systems with and without LLM, SoundSCaper performs better on the ASSC task in several natural language processing (NLP) based metrics. Overall, SoundSCaper performs well in human subjective evaluation and various objective captioning metrics, and the generated captions are comparable to those annotated by soundscape experts. The model, source code, LLM scripts, human assessment data, instructions, and evaluation statistics are all publicly available.

Abstract:
The development of Deep Neural Networks (DNNs) has enabled AI-driven models to excel in recognizing a limited set of classes within static environments. As AI systems progress, few-shot class-incremental learning (FSCIL) aims to expand their understanding of novel classes from minimal samples while retaining knowledge of previously encountered ones. However, most existing FSCIL models face significant challenges, including inadequate adaptability and catastrophic forgetting, which hinder their ability to maintain robust forward and backward learning capabilities. To address these issues, this paper proposes a novel Forward-Backward Knowledge Transfer (FBKT) paradigm1, which strategically integrates forward distribution adaptation (FDA) and backward semantic alignment (BSA) mechanisms to achieve bidirectional adaptability in knowledge transfer. The FDA mechanism enhances forward adaptability by expanding and reserving the embedding space for new classes using semantic-irrelevant masked images as virtual negative classes, thereby mitigating data overfitting. It also employs self-supervised representation learning to utilize semantic-relevant local embeddings as additional positive samples, fostering class separation and generalization. Meanwhile, the BSA mechanism ensures the semantic consistency of previously learned classes across sessions during class-incremental learning, promoting smoother backward adaptability and reducing model degradation. Extensive experiments conducted on multiple benchmark datasets consistently highlight the superior performance and effectiveness of our FBKT compared to state-of-the-art methods.

Affiliations: School of Software, Nanchang University, Nanchang, China; Institute of Information Science, Beijing Jiaotong University, Beijing, China; Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China

Abstract:
Reversible data hiding (RDH) for JPEG images, particularly those focusing on DCT coefficient modification, has garnered significant attention in recent years. Existing methods primarily select coefficients valued \pm 1 for expansion embedding to avoid significant file size increases caused by modifying zero-valued DCT coefficients. However, zero-valued coefficients, which constitute the majority of DCT coefficients, are more suitable for data embedding to reduce the shift distortion. To efficiently utilize zero-valued coefficients for high-capacity embedding while controlling the file size increment, this paper introduces a novel JPEG RDH method based on ternary matrix embedding, where ternary syndrome trellis codes (STC) is employed on selected zero-valued coefficients to minimize the expansion embedding distortion, and other non-zero-valued coefficients are shifted for reversibility. Furthermore, a novel DCT coefficients measurement strategy is proposed for coefficient selection to further reduce the shift distortion. Extensive experimental validations demonstrate the superiority of the proposed method in various evaluation criteria. Notably, the proposed method achieves more than twice the embedding capacity of some state-of-the-art methods at the same PSNR while maintaining file size increment within acceptable bounds.

Abstract:
Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model’s effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20 K, COCO-Stuff10 K, and Pascal-Context.

Abstract:
Recent methods to enhance Vision-Language Models (VLMs) for Visual Question Answering (VQA) have focused on strengthening their inference capabilities, enabling them to tackle VQA tasks independently rather than merely as aids to Large Language Models (LLMs). However, these approaches often ignore the rich commonsense knowledge inside the given VQA image sampled from the real world, limiting the full potential of VLMs. Inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of LLMs to enhance the capabilities of VLMs themselves. Our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method, e.g., 5.7% improvement on VQA-RAD and 5.2% on Winoground in the zero-shot setting without extra training cost.

Abstract:
In the realm of emotion recognition concerning continuous temporal sequence data, scholars have delved into various effective integration strategies from multiple perspectives, yielding commendable results. The majority of these studies have comfortably relied on Long Short-Term Memory (LSTM) networks to extract features from both video and audio, often overlooking the thorough extraction of underlying features prior to integration. We have entirely eschewed convolutional and recurrent architectures, opting instead to design a simple, stackable Temporal Slicing Encoder (TSE) to distill temporal characteristics. Empirical evidence from two sentiment analysis datasets demonstrates that the TSE module excels in the extraction of emotional features. Building upon this foundation, we have further explored modality interaction, addressing cross-modal data activation and synergy optimization between different features, devising the Deep Bimodal Information Transfer Module (DBIT) and the Dynamic Synergy Optimization Network (DSON), which, in conjunction with the TSE module, form our TASE-Net (Temporal Attention Synergy Emotion Network). The DBIT module establishes a cross-attention mechanism guided by mutual information to facilitate text-guided cross-modal data activation, while the DSON module achieves adaptive emotional feature confidence allocation and knowledge transfer between trimodal and unimodal features through an Emotional Weight Adjuster (EWA) and an Asymmetric Bidirectional Distillator (ABD). Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets substantiate the efficacy and advancement of our TASE-Net and TSE encoder.

Abstract:
In interactive point cloud segmentation, users can achieve higher accuracy object masks than in instance segmentation by performing limited positive and/or negative clicks on the objects of interest in the scene. Existing methods often employ sparse click representations, leading the model to focus more on local detail features around the click points and failing to fully exploit the guidance information provided by each click, thus impacting the click effectiveness. We utilize a dense representation that reflects spatial distance relationships, known as the distance map, as the click channel to tackle the sparsity problem of click representation in current approaches. Based on the distance map, we introduce ClickEnhance, which is designed to maximize the guiding impact of each click. The proposed method encompasses the design of a click-specific encoder and the utilization of contrastive learning. The Click-Specific Encoder ensures that the network can adequately consider the influence of individual clicks during the feature encoding phase. Contrastive learning, on the other hand, reduces the feature distance between the click points and the target object, thus simplifying the subsequent segmentation process. Experimental results demonstrate that the ClickEnhance method markedly improves segmentation performance across multiple datasets, exhibiting superior generalization capabilities on challenging datasets compared to the state-of-the-art methods. This allows for the generation of high-precision object-level masks with fewer interactions, indicating great potential for practical applications.

Abstract:
Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks.

Affiliations: Harbin Institute of Technology Zhengzhou Research Institute, Zhengzhou, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China; Department of Computer Science, City University of Hong Kong, Hong Kong; School of Cyber Science and Technology, Sun Yat-Sen University, Guangdong, China; Department of Electrical Engineering and the Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan; VI Department, Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore

Abstract:
Recent research employs Mamba for image restoration, yet the intrinsic coupling between deraining characteristics and Mamba architectures remains underexplored. We propose VDMamba, a vector decomposition-based vision Mamba approach that leverages 1D sequential representations to characterize direction-aware rain distributions in the frequency embedding space. The core innovation is the Mamba-based Vector Decomposition and Synthesis Module (VDSM). VDSM derives vertical and horizontal 1D vectors from frequency components and utilizes single-direction Mamba scanning to eliminate direction-specific perturbations. This enables the exploration of global relationships for accurate learning without complex scanning designs. Additionally, these components are encoded via bidirectional coupling for refinement. Experiments on various tasks, including deraining, dehazing, and low-light enhancement, demonstrate VDMamba's competitive performance. Specifically, it achieves a 0.58 dB PSNR improvement in deraining compared to the NeRD method, while reducing model parameters by 94.3%, computational cost by 88.3%, and inference time by 77.5%.

Abstract:
Federated learning (FL) is well-suited for multimodal tasks due to its ability to protect privacy and support local training. However, the complexity of real-world sensor environments causes modality heterogeneity across clients. Some modalities may be missing altogether, making it difficult to construct a generalized global model. Existing multimodal federated learning methods often address modality-missing scenarios under simplified assumptions of modality heterogeneity, typically focusing on unimodal clients and modality-complete multimodal clients. Moreover, to mitigate performance degradation caused by missing modalities, some approaches assume the availability of auxiliary information at the server, which may be impractical in real-world scenarios. Therefore, we propose a novel heterogeneous multimodal Federated Learning with Mask-Restoration and Self-Guidance (FL-MRSG). The Mask-Restoration employs a masking strategy to simulate missing data during feature extraction, enabling the network to learn semantic features of missing modality. Furthermore, we introduce an innovative self-guidance mechanism that leverages the restored data as guidance information, enabling the network to distinguish between complete and missing data representations. In addition, we propose a personalized decoupled aggregation strategy to facilitate the collaborative training of a global model across heterogeneous modality clients. We extend the multimodal test set to arbitrary modality combinations to evaluate the robustness of the global model. Extensive experiments on MOSI and SIMS datasets demonstrate the effectiveness of the proposed FL-MRSG for arbitrary missing modalities.

Abstract:
Multiple approaches aim to enhance user experience in the delivery of immersive video content. The popularisation of VR, combined with recent advances in mulsemedia technology has improved access to immersive visual and olfactory stimuli. Synchronising multiple scent dispensers positioned around the user when watching 360^\circ videos can more accurately indicate the location of scent sources, guiding users to move their heads accordingly to the indicated directions. However, the manual annotation process required to add mulsemedia effects is labour-intensive, limiting the availability of content with sensory enhancements, particularly when using multiple scent dispensers from various directions. Addressing this issue, this paper introduces OmniScent-CNN, an innovative solution to automate the diffusion of scents from different directions in a VR environment using Convolutional Neural Networks (CNNs) for scene recognition. Multiple instances of the solution were tested, employing a number of CNN architectures. The results demonstrated that olfaction accuracy can reach up to 71.28% with the ResNet-18 model. Furthermore, user perceptual tests revealed excellent results, with 87.5% of participants agreeing or strongly agreeing that the scents enhanced their enjoyment of the experience. This indicates the feasibility of automating the process of synchronising omnidirectional scents based on 360^\circ scene recognition.

Abstract:
Robust reversible watermarking (RRW) techniques have been proposed in the literature to protect the copyrights of high-fidelity digital images while achieving robustness, reversibility, invisibility, and large capacity simultaneously. Most studies on RRW have been designed to resist common signal processing (CSP) attacks, but only a few can withstand both CSP and geometric deformation (GD) attacks. To address this problem, this study proposes a novel RRW method using a fractional-order polar complex exponential transform (FrPCET) and optimized quantization index modulation (QIM). Specifically, the optimal fractional parameter of the FrPCET is determined through numerical simulation experiments using the criterion of minimum image reconstruction errors. The stability of FrPCET moments against CSP is evaluated by performing attack simulation tests on 500 images, revealing that differences between specific pairs of FrPCET moments exhibits similar variation patterns under attacks, thus making them suitable for use as embedding carriers. Then, the watermark is embedded by optimizing a conventional QIM, which improves the robustness of the watermark under the same image quality conditions. The distortions caused by watermark embedding and the hash sequences used for integrity authentication are subsequently taken as the auxiliary information and are reversibly embedded via the prediction error expansion-histogram shift method. After receiving the watermarked image, the receiver performs an inverse operation to recover both the watermark and the original image in the absence of attacks; otherwise, it only extracts the watermark. Extensive simulation experiments demonstrate that the proposed method has greater robustness against various CSP and GD attacks than do state-of-the-art methods under the same embedding capacity and invisibility. This indicates the feasibility and effectiveness of the proposed scheme.

Abstract:
Fusing visible (RGB) and thermal (T) images for RGBT tracking has received growing interest in the field of computer vision. However, how to improve the robustness of the tracker to target scale variety, effectively apply visual prompts to multimodal tracking tasks, and enhance the multimodal fusion effectiveness are still urgent challenges in the field of RGBT tracking. To this purpose, this work proposes an RGBT tracking framework integrating scale-aware dilation attention, multimodal prompt interaction learning, and cross- fusion adapter, named MPANet. Firstly, a scale-aware dilation attention (SADA) module is put forward to enhance the flexibility of the tracker in the presence of target scale variations by embedding convolutions with different dilation rates into the self-attention. Subsequently, a multimodal prompt interaction learning (MPIL) module is constructed, which combines global token adaptive attention and spatial attention to efficiently learn visual prompts from different modalities and achieve intermodal prompt interactions. Finally, a cross-fusion adapter (CFA) is developed to facilitate the adaptability of the network to different modalities in the process of multimodal information fusion through the adapter mechanism. Extensive experiments on public RGBT benchmark tracking datasets such as GTOT, RGBT234, LasHeR and VTUAV demonstrate that the proposed method outperforms existing advanced trackers and achieves state-of-the-art performance.

Abstract:
The presence of noise in acquired data invariably leads to performance degradation in cross-modal matching. Unfortunately, obtaining precise annotations in the multimodal field is expensive, which has prompted some methods to tackle the mismatched data pair issue in cross-modal matching contexts, termed as noisy correspondence. However, most of these existing noisy correspondence methods exhibit the following limitations: a) the problem of self-reinforcing error accumulation, and b) improper handling of noisy data pair. To tackle the two problems, we propose a generalized framework termed as Rank corrElation and noisy Pair hAlf-replacing wIth memoRy (REPAIR), which benefits from maintaining a memory bank for features of matched pairs. Specifically, we calculate the distances between the features in the memory bank and those of the target pair for each respective modality, and use the rank correlation of these two sets of distances to estimate the soft correspondence label of the target pair. Estimating soft correspondence based on memory bank features rather than using a similarity network can avoid the accumulation of errors due to incorrect network identifications. For pairs that are completely mismatched, REPAIR searches the memory bank for the most matching feature to replace one feature of one modality, instead of using the original pair directly or merely discarding the mismatched pair. We conduct experiments on three cross-modal datasets, i.e., Flickr30 K, MS-COCO, and CC152 K, proving the effectiveness and robustness of our REPAIR on synthetic and real-world noise.

Abstract:
Current gait recognition methods heavily rely on various gait representations (e.g., silhouette sequences) generated by task-specific, supervised upstream processes, which inevitably incur high annotation costs and the risk of cumulative errors. Recently, generic knowledge from task–agnostic large visual models (LVMs) has been successfully applied to gait recognition, freeing the field from such dependencies. However, this approach does not address challenges posed by traditional cameras in handling scenarios with low latency, high speed, and high dynamic range. In this paper, we introduce EdinoGait, a novel and effective gait recognition framework that leverages event-based LVMs to overcome the scarcity of large-scale event-based datasets. Specifically, due to the distinct modality gap between image and event data and the lack of large-scale datasets, transferring LVMs to event-based vision is non-trivial. To address this, we introduce a novel event encoder that mitigates the modality gap through event prompts and a CLS patch contrastive loss. Subsequently, we design an autoencoder-based dual-alignment module to eliminate background noise brought by LVMs while preserving the motion details provided by event data. Additionally, to promote the application of event cameras in gait recognition, we collect the first semi-indoor, multi-view gait dataset captured by the DAVIS346 event camera. This dataset comprises 6,150 sequences (two modalities: grayscale images and event streams) of 41 subjects captured under two lighting conditions and five view angles (0^\circ , 45^\circ , 90^\circ , 135^\circ , and 180^\circ ). Specifically, for each lighting condition and viewing angle, there are six sequences representing normal walking (NM), three representing walking with a backpack (BG), three with a portable bag (PT), and three with a coat (CL). Comprehensive experiments conducted on our event-based gait dataset and EV-CASIA-B demonstrate that EdinoGait significantly outperforms frame-based LVMs. Notably, under low-light conditions, the recognition accuracy of frame-based LVMs declines sharply, while EdinoGait exhibits robust performance.

Abstract:
Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.

Abstract:
Recent advancements in 3D reconstruction coupled with neural rendering techniques have greatly improved the creation of photo-realistic 3D scenes, influencing both academic research and industry applications. The technique of 3D Gaussian Splatting and its variants incorporate the strengths of both primitive-based and volumetric representations, achieving superior rendering quality. While 3D Geometric Scattering (3DGS) and its variants have advanced the field of 3D representation, they fall short in capturing the stochastic properties of non-local structural information during the training process. Additionally, the initialisation of spherical functions in 3DGS-based methods often fails to engage higher-order terms in early training rounds, leading to unnecessary computational overhead as training progresses. Furthermore, current 3DGS-based approaches require training on higher resolution images to render higher resolution outputs, significantly increasing memory demands and prolonging training durations. We introduce StructGS, a framework that enhances 3D Gaussian Splatting (3DGS) for improved novel-view synthesis in 3D reconstruction. StructGS innovatively incorporates a patch-based SSIM loss, dynamic spherical harmonics initialisation and a Multi-scale Residual Network (MSRN) to address the above-mentioned limitations, respectively. Our framework significantly reduces computational redundancy, enhances detail capture and supports high-resolution rendering from low-resolution inputs. Experimentally, StructGS demonstrates superior performance over state-of-the-art (SOTA) models, achieving higher quality and more detailed renderings with fewer artifacts.

Abstract:
Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.

Abstract:
Deep neural networks are highly effective at transforming sparse and unstructured data into dense and semantic representations, demonstrating strong capabilities in object detection tasks. However, their performance often diminishes when detecting small-sized objects due to the loss or corruption of critical information during feature extraction. To address this challenge, DPAKS is introduced, a reliable DETR-like detector for small objects enhanced with directional prior auxiliary knowledge to guide the model’s focus on small objects. In the decoder of DPAKS, a small denoising training strategy is employed that reduces the interference of noisy queries generated from real small objects. This approach effectively learns the features of small objects during the denoising process, sharpening the model’s attention to small-sized objects. Additionally, to enhance the reliability of DPAKS’s backbone, an auxiliary branch is introduced that provides supervision via shorter paths, improving the optimization of low-level feature parameters. This branch facilitates the transmission of gradient information suited for small objects without interfering with the detection of other sized objects. Furthermore, a new supervision head is proposed and added to the detection head of DPAKS, which categorizes object sizes based on artificial prior knowledge. This guides the model to effectively learn size categories and become more sensitive to small objects. Remarkably, DPAKS achieves competitive performance in small object detection without imposing additional computational burdens at the inference stage.

Abstract:
Confusable structure segmentation (CSS) is a type of semantic segmentation applied in remote sensing sea fog detection, medical image segmentation, camouflaged object detection, etc. Structural similarity and visual ambiguity are two critical issues in CSS that pose difficulties in distinguishing foreground objects from the background. Current methods focus primarily on enhancing visual representations and do not often incorporate multimodal information, which leads to performance bottlenecks. Inspired by recent achievements in vision-language models, we propose Vision-Language Mutual Prompting (VLMP), a novel and unified language-guided framework that leverages text prompts to enhance CSS. Specifically, VLMP consists of vision-to-language prompting and language-to-vision prompting, which bidirectionally model the interactions between visual and linguistic features, thereby facilitating cross-modal complementary information flow. To prevent the predominance of one modality over another, we design a feature integration modulator that modulates and balances feature weights for adaptive multimodal fusion. Our framework is designed to be modular and flexible, allowing for integration with any backbone, including CNNs and transformers. We evaluate VLMP with three diverse datasets: SFDD-H8, QaTa-COV19, and CAMO-COD10K. Extensive experiments demonstrate the effectiveness and superiority of the proposed framework over those of state-of-the-art methods across these datasets. This shift from basic sight to deeper insight in CSS through vision-language integration represents a significant advancement in the field.

Abstract:
Many complex systems prevalent in nature and society, from particle physics systems to social networks and team sports, can be viewed as dynamical interacting systems. Understanding the underlying interactions of agents in the system is the key task for predicting future behaviors of agents, which can be applied in various applications, e.g., autonomous vehicles and smart video surveillance. Since the interaction patterns between agents in the system can be dynamic and heterogeneous rather than fixed and homogeneous, it is very challenging to model interacting systems. In this paper, we design a novel graph structure called Signed Relation Graph (SRG) to model dynamical interacting systems. Since collective behaviors are very common in real-world scenes, our method is a group based model that takes heterogeneous relationships between agents into consideration, and achieves jointly modeling inter-group interactions and intra-group interactions. To assign signs on SRG, an unsupervised method called Relationship Reasoning Network is proposed. The relationship categories are reasoned explicitly, which makes handling multi-agent systems with multiple and dynamic interactions available. Further, Group Interaction Attention Graph Neural Network is proposed to aggregate information on SRG, which achieves not only reasoning the intensity of different interaction patterns but also modeling the trade-off between inter-group interactions and intra-group interactions. Our interacting systems modeling method can be used to predict multi-agent future trajectories in a variety of scenes with hard scenarios, including dense and drastic scenarios. Experimental results on three widely used human trajectory prediction datasets, including ETH and UCY in traffic scenes and NBA SportVU in sports scenes, demonstrate the effectiveness of our proposed model.

Abstract:
With the development of deep learning in recent years, the performance of object detection under conventional cameras has been significantly improved. Nevertheless, due to the distortion caused by the fisheye cameras, detecting objects in this scenario remains a significant challenge. The dominant approaches focus on modifying the shape of the bounding box to better align the boundaries of the distorted object. However, these methods neglect the learning of spatial distortion information, which prevents them from satisfactory results. In this paper, we propose a novel fisheye camera detection network to learn distortion features better, dubbed SDANet. SDANet is composed of a series of SDABlocks, which are designed to learn spatial distortion features. Each SDABlock consists of multiple convolution kernels of different sizes, and it can generate the most suitable kernel based on the current input's distortion characteristics. Moreover, to address the limitations of the scarcity and uneven spatial distribution of fisheye image datasets on performance improvement, we propose a dedicated data augmentation strategy called Prominent Fisheye Distortion Augmentation (PFDAug). PFDAug can further introduce distortions to fisheye images, effectively alleviating these problems. Experimental results on the CEPDOF, MW-R, HABBOF, LOAF, and FishEye8k fisheye image datasets demonstrate that our method achieves state-of-the-art performance.

Abstract:
Natural language-guided drone geo-localization (DGL) provides an intuitive and scalable mode of human-drone interaction for tasks such as search, rescue, and surveillance. Recent Vision-Language Models (VLMs) can learn semantic correspondences between text and images during fine-tuning. However, their performance in DGL tasks remains constrained, as complex instructions and cluttered scenes often cause semantic dilution and granularity mismatch, leading to weak cross-modal alignment. Consequently, the models struggle with ambiguous targets and suffer from reduced localization accuracy. To address these challenges, we propose SAA-DGL, a framework for interpretable language-guided Drone Geo-Localization that enriches Semantic Attribute Alignment (SAA) with large language models (LLMs). It introduces two parameter-free cross-modal fusion modules: (1) the LLM-driven Cross-modal Semantic Attribute Enrichment (LCSAE) module, which extracts fine-grained attributes (e.g., color, shape, position) from text and embeds them into visual features as explicit semantic anchors, producing semantically enriched cross-modal representations; and (2) the Bidirectional Feature Alignment (BFA) module, which builds fusion relationships between visual and textual features via similarity-driven mechanisms, enabling effective integration of enriched visual and textual information. This design improves cross-modal consistency and interpretability while preserving pretrained alignment priors and enhancing training stability. Experiments on the GeoText-1652 benchmark show that SAA-DGL achieves state-of-the-art performance and strong robustness under complex visual and linguistic disturbances, validating its effectiveness for challenging geo-localization scenarios.

Abstract:
How to identify endangered bird species in complex outdoor environments has attracted significant attention in the fields of computer vision and machine learning. Previous studies on fine-grained bird image classification (FBIC) face numerous challenges, such as environmental occlusions and arbitrary postures, which limit the accuracy and robustness of existing methods. To address these challenges and enable more reliable bird species identification in extreme outdoor conditions, we propose a novel skeletal cues-aware bone point relationship learning for efficient FBIC via Transformers (SkeFormer). To the best of our knowledge, this is the first time skeletal relationships have been introduced to the FBIC task. Our model introduces three key modules: the skeletal relationship mining (SRM) module, the multilevel feature generation (MFG) module, and the key feature selection (KFS) module. Specifically, in SRM, the model mines the skeletal relationships among different bird species. In MFG, multiscale information is aggregated by connecting features across multiple layers. The KFS module selects key immutable regions of birds based on the learned skeletal relationships. Extensive experiments on two benchmark datasets, CUB-200-2011 and NABirds, show that SkeFormer outperforms existing state-of-the-art models.

Abstract:
Source-free universal domain adaptation (SF-UniDA) aims to correctly classify known samples from shared categories while distinguishing them from target-private unknown data. However, existing approaches predominantly focus on processing target domain data, overlooking the rich knowledge embedded in the pre-trained source model. This limitation often hampers the ability of the model to accurately identify shared categories. To overcome this limitation, we introduce a novel approach called Exploring Generic knowledge and Reactivating Source (EGRS) model. EGRS leverages the knowledge encoded in a pre-trained source model to mitigate the impact of class space discrepancies between source and target domains. Specifically, we adversarially perturb target samples to align their embeddings with source class prototypes in the embedding space of the pre-trained source model. Using these perturbed samples, we estimate the embedding shift from the source model to the target model and dynamically refine the prototypes. Furthermore, we design a novel pseudo-label clustering algorithm and propose a new strategy to update target-domain-specific parameters, instead of simply freezing all classifier parameters as in prior methods. Extensive experiments across universal adaptation scenarios demonstrate that EGRS significantly enhances classification accuracy and consistently outperforms existing state-of-the-art approaches.

Abstract:
Clothes-changing person re-identification (CC Re-ID) focuses on recognizing pedestrians in a long-term with changes in clothes. Prior arts extract clothes-irrelevant features either by introducing extra modality or clothing labels, having their respective limitations. Instead, we seek to extract clothes-irrelevant features without additional input. We first analyze and find that one impediment to extracting clothes-irrelevant features is the co-occurrence of samples with the same clothes and the same identity. Inspired by this observation, we propose a novel CC Re-ID approach using no additional input. We introduce the Slice-and-Align Framework (SA), which employs a straightforward and intuitive prior: the upper and lower clothes of a person are usually different. SA is a dual-stream framework that slices the original image into upper and lower halves, and then aligns them to extract clothes-irrelevant features. On image CC Re-ID datasets, SA outperforms methods without additional input by a large margin and is comparable to or even better than methods with additional input. Besides, SA also outperforms state-of-the-art on video CC Re-ID task.

Abstract:
Low-light video enhancement is highly demanding in maintaining spatiotemporal color consistency. Therefore, improving the accuracy of color mapping and keeping the latency low are challenging. On this basis, we propose incorporating wavelet-priori for the 4D lookup table (WaveLUT), which effectively enhances the color coherence between video frames and the accuracy of color mapping while maintaining low latency. Specifically, we use the wavelet low-frequency domain to construct an optimized lookup prior and achieve an adaptive enhancement effect through a designed wavelet-prior 4D lookup table. To effectively compensate for the a priori loss in the low light region, we further explore a dynamic fusion strategy that adaptively determines the spatial weights on the basis of the correlation between the wavelet lighting prior and the target intensity structure. In addition, during the training phase, we devise a Fourier-text driven appearance reconstruction method that dynamically balances brightness and content through multimodal semantics-driven Fourier spectra. Extensive experiments on a wide range of benchmark datasets show that this method effectively enhances the previous method's ability to perceive the color space and achieves metric-favourable and perceptually oriented real-time enhancement while maintaining high efficiency.

Abstract:
The rapid advancement of multi-modality image fusion technology enables researchers to simultaneously acquire information from different modalities within a single fused image. In existing methods, some general approaches can implement both infrared and visible image fusion (IVIF) and medical image fusion (MIF) in the same framework. Nevertheless, these methods often ignore the learning of specific features in different modalities, resulting in unsatisfactory performance in fused results. To overcome this issue, we propose a multi-scale joint framework with self-supervision for general multi-modality image fusion, abbreviated as SCSFusion. It enables more targeted and robust implementation of IVIF and MIF. Specifically, in the fusion network, a joint attention module is employed to parallelly capture self-attention features in spatial and channel domains, which can keep fused results accurate in visual representation. Meanwhile, we utilize source images of different modalities to generate visual-focused maps as pseudo labels for self-supervised training of the fusion results. It effectively preserves the salient details in each fused image from being disrupted by other extracted information. Moreover, a medical dataset with segmentation labels, termed M2DF, is reorganized for fusion and down-stream tasks in MIF. With the help of M2DF, a pre-trained segmentation model can be cascaded with the fusion network, aiming to obtain high-level semantic features from inputs and enhance the data generalization in our general framework. We have conducted extensive experiments and analyses on SCSFusion in M\rm ^3FD, FMB, and M2DF datasets, respectively. The results indicate that the fused images generated by SCSFusion can not only achieve visually appealing results and superior performance metrics in MIF and IVIF, but also exhibit satisfactory performance in down-stream tasks.

Abstract:
While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.

Abstract:
Existing weakly supervised hashing often suffers from the imprecision of user-provided tags and over-reliance on textual knowledge from pre-trained word embeddings, neglecting crucial visual knowledge associated with image labels. As a result, this leads to unsatisfactory performance in closed-vocabulary tasks and limited generalization in open-vocabulary scenarios. To address this issue, we propose Multi-modal Knowledge Distillation Hashing (MKDH), a novel method leveraging visual and language pre-training (VLP) model such as CLIP to learn robust hash codes. Our method designs a dual-layer attention adapter to generate joint representations by capturing fine-grained visual and textual knowledge from the CLIP teacher network. Additionally, we introduce a knowledge extraction contrastive loss to enhance the robustness of joint representations and a knowledge distillation contrastive loss to transfer the extracted multi-modal knowledge to the hash codes. To further mitigate the negative impact of false negative pairs in these contrastive losses, we introduce false negative weighting strategy that reduces the weights assigned to such pairs. Extensive experiments on three widely used datasets demonstrate that our method achieves robust retrieval performance with significant improvements in both closed- and open-vocabulary settings.

Abstract:
3D scene graph generation (3DSGG), which involves classifying objects and predicates, is an emerging topic in 3D scene understanding. Recent studies leveraging graph neural networks (GNNs) have introduced sophisticated architectures that enhance classification performance. However, since GNNs serve as the core and constitute the majority of parameters in 3DSGG models, their computational demands substantially increase overall complexity, which makes it difficult to determine the optimal model capacity. In this paper, we propose the first compression framework for lightweight 3DSGG models, based on pruning-as-search and knowledge distillation. This framework integrates multiple strategies and modules. In phase 1, the framework identifies the optimal compression ratio through pruning-as-search. In phase 2, to mitigate the accuracy loss incurred during compression, we employ structured pruning and a novel knowledge distillation strategy that effectively transfers precise information from the teacher to the compressed model. Experimental results show that our approach reduces model size by more than half while improving classification accuracy. Code is available at https://github.com/hojunking/3DSGG-compression.

Abstract:
The removal of rain streaks and raindrops is crucial for enhancing the image visibility and mitigating the weather degradations. However, most existing approaches rely on the paired rainy and clean images, which are challenging to obtain in real-world scenarios. To this end, we propose a novel structure-preserving frequency-regularized text-guided optimal transport (SFTOT) framework, which formulates the unpaired rain streaks and raindrops removal as an optimal transport problem. Specifically, we introduce a structure-preserving transport cost, incorporating the structural similarity constraint to minimize the duality gap between the primal and dual formulations, while preserving the structural details of reconstructed images. Furthermore, by embedding the inherent frequency sparsity of rain streaks and raindrops into the transport cost, we derive a frequency-regularized optimal transport objective, ensuring consistency in frequency distributions between the generated and clean images. Additionally, we employ a pre-trained one-step stable diffusion model as the restoration network, which is fine-tuned using the low-rank adaptation (LoRA) adapters and zero convolutional layers, while integrating the domain-specific text prompts for both degraded and clean images to guide the generation process. Extensive experiments demonstrate that our method surpasses the existing well-performing unpaired learning approaches, achieving notable improvements in both the fidelity and photo-realism.

Abstract:
In this work, we aim to reconstruct the 3D shape of an indoor scene from a single view, which includes multiple objects and the background. This task is challenging for existing methods since those instances of indoor scenes regularly occlude each other and contain diverse topologies. To address this, we propose a novel framework, ISDNet, to adaptively separate mixed instances and perform topology-aware reconstruction. Specifically, ISDNet consists of two cascaded subnetworks: an instance separation module (ISM) and an instance deformation module (IDM). The ISM learns to separate occluded objects through stepwise sampling, inferring clean features for each instance. On the basis of these features, IDM generates an instance-topology-aware template and deforms it with learned offsets to reconstruct detailed geometry. Quantitative and qualitative experiments on the SUNRGB-D and 3D-FRONT datasets demonstrate that ISDNet outperforms the state-of-the-art methods in terms of local details and overall shapes.

Abstract:
Vision-language pre-trained models (VLMs) have shown impressive cross-modal understanding, yet their “compositional understanding” ability remains under investigation. We introduce CompoVis, a framework for visually probing cross-modal gaps in VLMs. CompoVis optimizes the grid layout to highlight alignment clusters and boundaries, visually interprets multi-head attention and semantic drift, and enables interactive fine-tuning unconstrained by closed datasets or offline models. Quantitative experiments and case studies explore key insights: VLMs rely on entity shortcuts rather than comprehension-driven; stubborn global modality isolation and suboptimal fine-grained alignment remain; fine-tuning with negative samples does not fundamentally alleviate the gaps. Approximately 89% of participants (n=27) found that, compared to methods relying solely on data metrics, CompoVis offers a more innovative and effective approach for investigating modality gaps in VLMs.

Abstract:
The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects’ interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.

Abstract:
Underwater imaging is often plagued by significant degradation in visual quality, primarily due to the effects of light absorption and scattering in water. Although recent underwater image enhancement (UIE) methods rely on the current advances in deep neural network architecture designs, there is still considerable room for improvement in cross-scene robustness and computational efficiency. Diffusion models have shown great success in image generation, prompting us to explore their application to UIE tasks. However, directly applying them to UIE tasks will pose two challenges, i.e., high computational budget and color unbalanced perturbations. To tackle these issues, we propose DiffColor, a distribution-aware diffusion and cross-spectral refinement model for efficient UIE. Unlike single-noise image restoration tasks, underwater imaging exhibits unbalanced channel distributions due to the selective absorption of light by water. To address this, we design the Global Color Correction to balance the diverse color shifts, thereby avoiding potential global degradation disturbances during the denoising process. Instead of diffusing in the raw pixel space, we transform the image into the wavelet domain to obtain such low-frequency and high-frequency spectra. For the sacrificed image details caused by underwater scattering, we further present the Cross-Spectral Detail Refinement to enhance the high-frequency details, which are then integrated with the low-frequency signal as a dual-condition for guiding the diffusion. This strategy ensures the high-fidelity of sampled content and compensates for the sacrificed details. Extensive experiments demonstrate the superior performance of DiffColor over state-of-the-art methods in both quantitative and qualitative evaluations. The code is available at: https://github.com/LaibinChang/DiffColor.

Abstract:
Just Recognizable Difference (JRD) represents the minimum visual difference that is detectable by machine vision, which can be exploited to promote machine vision-oriented visual signal processing. In this paper, we propose a Deep Transformer-based JRD (DT-JRD) prediction model for Video Coding for Machines (VCM), where the accurately predicted JRD can be used to reduce the coding bit rate while maintaining the accuracy of machine tasks. Firstly, we model the JRD prediction as a multi-class classification and propose a DT-JRD prediction model that integrates an improved embedding, a content and distortion feature extraction, a multi-class classification, and a novel learning strategy. Secondly, inspired by the perception property that machine vision exhibits a similar response to distortions near JRD, we propose an asymptotic JRD loss by using Gaussian Distribution-based Soft Labels (GDSL), which significantly extends the number of training labels and relaxes classification boundaries. Finally, we propose a DT-JRD-based VCM to reduce the coding bits while maintaining the accuracy of object detection. Extensive experimental results demonstrate that the mean absolute error of the predicted JRD by the DT-JRD is 5.574, outperforming the state-of-the-art JRD prediction model by 13.1%. Coding experiments show that compared with the VVC, the DT-JRD-based VCM achieves an average of 29.58% bit rate reduction while maintaining the object detection accuracy.

Affiliations: College of Artificial Intelligence, Southwest University, Chongqing, China; School of Big Data and Computer Science, Guizhou Normal University, Guizhou, China; Data Recovery Key Laboratory of Sichuan Province, School of Mathematics and Information Sciences, Neijiang Normal University, Neijiang, China; School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University, Shenzhen, China; College of Computing and Data Science, Nanyang Technological University (NTU), Singapore

Abstract:
Images taken in low-light conditions are frequently affected by limited visibility, diminished contrast and severe noise, adversely impacting the performance of various computer vision tasks. Most variational-based Retinex decomposition methods mainly depend on integer norms to constrain the illumination and reflectance components. However, this strategy may fail to achieve the ideal Retinex decomposition. In this paper, we propose a Retinex-based variational model that incorporates flexible constraints for both illumination and reflectance. Specifically, we impose the L_p norm constraints with varying values of p to ensure the piece-wise smoothness of the illumination and promote the presence of abundant textures in the reflectance. Moreover, we develop two effective pixel-wise weight matrices that consider variance and gradients of the input image respectively, with the objective of preserving the structural edges of the illumination and retaining more details in the reflectance. In addition, we use an L_2 norm to estimate the overall noise level and avoid noise amplification. Through incorporating these constraints, our proposed variational model can obtain a structure-aware illumination and a detail-revealed reflectance. Qualitative and quantitative comparisons on real-world and synthetic datasets indicate that our approach yields results with superior visual quality and outperforms several state-of-the-art algorithms on objective metrics. Besides, our algorithm can also address similar low-level computer vision challenges, such as image dehazing and underwater image enhancement.

Abstract:
Image/video coding has been a remarkable research area for both academia and industry for many years. Testing datasets, especially high-quality image/video datasets, are desirable for the justified evaluation of coding-related research, practical applications, and standardization activities. We put forward a test dataset, namely USTC-TD, which has been successfully adopted in the practical end-to-end image/video coding challenge of IEEE International Conference on Visual Communications and Image Processing (VCIP) in 2022 and 2023. USTC-TD contains 40 images at 4K spatial resolution and 10 video sequences at 1080p spatial resolution, featuring various content due to the diverse environmental factors (e.g., scene type, texture, motion, view) and the designed imaging factors (e.g., illumination, lens, shadow). We quantitatively evaluate USTC-TD on different image/video features (spatial, temporal, color, lightness), and compare it with the previous image/video test datasets, which verifies its excellent compensation for the shortcomings of existing datasets. We also evaluate both classic standardized and recently learned image/video coding schemes on USTC-TD using objective quality metrics (PSNR, MS-SSIM, VMAF) and subjective quality metric (MOS), providing an extensive benchmark for these evaluated schemes. Based on the characteristics and specific design of the proposed test dataset, we analyze the benchmark performance and shed light on the future research and development of image/video coding.

Abstract:
In complex 360^\circ scenes, depth estimation is challenging for small objects and the depth of object boundaries, which cannot be effectively solved with existing works. 360^\circ depth estimation is unable to produce uniform depth estimate findings in both indoor and outdoor settings due to the datasets. In this paper, the Useg-PanoDepth and PanoDepth dataset is proposed to improve the above problems effectively. The Diagonal-aware Attention Module (DAM) effectively estimates small objects in complex scenes. Enhanced Boundary Module (EBM), for enhancing boundary information,can also effectively solve the problem of depth unification of indoor and outdoor scenes. Extensive experiments on our constructed PanoDepth dataset, Useg-PanoDepth achieves SOTA results. The Relative accuracy (delta < 1.25) reaches 87.82%.

Abstract:
Arbitrary-oriented object detection remains a pivotal research focus due to its practical significance and inherent challenges. Existing methods often extend frameworks and sampling strategies designed for horizontal object detectors, which struggle to handle the arbitrary orientations, high aspect ratios, and diverse scales of oriented objects. To overcome these limitations, we propose a novel and efficient method for arbitrary-oriented object detection. This approach dynamically assigns prediction layers by object pixel area, then leverages wavelet transform-based energy weighting for bottom-up sample reassignment, optimizing feature representation for oriented targets. In addition, a robust framework integrates heatmap keypoint prediction on feature maps of a quarter-sized image, along with sparse predictions on other scales. By querying small-object regions within deep feature maps, a progressive top-down feature fusion strategy further enhances the perception of fine-grained details. Extensive evaluations on four benchmark datasets demonstrate the method’s substantial improvements in detection performance, establishing its potential for broader applications in oriented object detection.

Abstract:
Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.

Abstract:
Most LiDAR odometry and SLAM systems construct maps in point clouds, which are discrete and sparse when zoomed in, making them not directly suitable for navigation. Mesh maps represent a dense and continuous map format with low memory consumption, which can approximate complex structures with simple elements, attracting significant attention of researchers in recent years. However, most existing methods operate under a static environment assumption. In effect, moving objects cause ghosting, degrading the quality of meshing. To address these issues, we propose a plug-and-play meshing module adapting to dynamic environments, which can easily integrate with various LiDAR odometry to generally improve the pose estimation accuracy of odometry. In our meshing module, a novel two-stage coarse-to-fine dynamic removal method is designed to effectively filter dynamic objects, generating consistent, accurate, and dense mesh maps. To the best of our knowledge, this is the first mesh construction method with explicit dynamic removal. Additionally, sliding window-based keyframe aggregation and adaptive downsampling strategies are used to ensure the uniformity of point cloud, benefiting for Gaussian process in mesh construction. We evaluate the localization and mapping accuracy on six publicly available datasets. Extensive experiments demonstrate the superiority of our method compared with the state-of-the-art algorithms. The code and introduction video are publicly available at https://yaepiii.github.io/CAD-Mesher/.

Abstract:
The eye-tracking video saliency prediction (VSP) task and video salient object detection (VSOD) task both focus on the most attractive objects in video and show the result in the form of predictive heatmaps and pixel-level saliency masks, respectively. In practical applications, eye tracker annotations are more readily obtainable and align closely with the authentic visual patterns of human eyes. Therefore, this paper aims to introduce fixation information to assist the detection of video salient objects under weak supervision. On the one hand, we ponder how to better explore and utilize the information provided by fixation, and then propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process. On the other hand, we achieve spatiotemporal feature modeling under weak supervision from the aspects of feature selection and feature contrast. A Semantics and Locality Query (SLQ) Competitor with semantic and locality constraints is designed to effectively select the most matching and accurate object query for spatiotemporal modeling. In addition, an Intra-Inter Mixed Contrastive (IIMC) model improves the spatiotemporal modeling capabilities under weak supervision by forming an intra-video and inter-video contrastive learning paradigm. Experimental results on five popular VSOD benchmarks indicate that our model outperforms other competitors on various evaluation metrics.

Affiliations: Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China; School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan, China; Cyberspace Institute of Advanced Technology, Guangdong Key Laboratory of Industrial Control System Security, Huangpu Research School of Guangzhou University, Guangzhou University, Guangzhou, China

Abstract:
While face recognition (FR) models have brought remarkable convenience in face verification and identification, they also pose substantial privacy risks to the public. Existing facial privacy protection schemes usually adopt adversarial examples to disrupt face verification of FR models. However, these schemes often suffer from weak transferability against black-box FR models and permanently damage the identifiable information that cannot fulfill the requirements of authorized operations such as forensics and authentication. To address these limitations, we propose ErasableMask, a robust and erasable privacy protection scheme against black-box FR models. Specifically, via rethinking the inherent relationship between surrogate FR models, ErasableMask introduces a novel meta-auxiliary attack, which boosts black-box transferability by learning more general features in a stable and balancing optimization strategy. It also offers a perturbation erasion mechanism that supports the erasion of semantic perturbations in protected face without degrading image quality. To further improve performance, ErasableMask employs a curriculum learning strategy to mitigate optimization conflicts between adversarial attack and perturbation erasion. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the state-of-the-art performance in transferability, achieving over 72% mean confidence in commercial FR systems. Moreover, ErasableMask also exhibits outstanding perturbation erasion performance, achieving over 90% erasion success rate.

Abstract:
Multimodal data holds great potential in enhancing the accuracy of sentiment analysis. However, the multimodal sentiment analysis (MSA) task faces two practical challenges: i) multimodal heterogeneity significantly increases the difficulty of integrating multimodal information, resulting in insufficient multimodal learning; and ii) multimodal incompleteness easily triggers cross-modal inference biases, seriously hindering multimodal understanding. In this paper, we propose a novel multimodal framework named c ross-modal explicit in varianc ecoordinated dual constraint reconstruction (Clive-Dcr) to mitigate MSA for heterogeneous incomplete multimodal data. Specifically, we present a language-oriented cross-modal invariance reasoning (LCIR) approach that can alleviate the inter-modal representation gaps under the supervision of fine-grained sentiment information. More importantly, LCIR can explicitly quantify inter-modal consistency and break the vagueness of heterogeneous modal information, thereby fully exploring the complementary advantages between invariant features and specific features. In the prediction head, we develop a multimodal purification learning mechanism to minimize the intervention of sentiment-irrelevant information on MSA. To enhance the robustness under incomplete modalities, we design a dual constraint reconstruction (DCR) strategy. During the reconstruction of missing features, DCR not only performs high-level feature distribution alignment but also extends the cross-modal invariance reasoning from the complete to the incomplete multimodal scenarios. Comprehensive experiments on three multimodal datasets verify that our Clive-Dcr achieves remarkable performance under complete and incomplete multimodal patterns.

Abstract:
Knowledge-based Visual Question Answering (KBVQA) aims to utilize external knowledge to answer image-related questions. Current KBVQA frameworks prompt large language models (LLMs) with answer heuristics, narrowing the focus to more relevant answers. However, these answer heuristic based frameworks exhibit the bias towards overemphasizing highest-scoring answers, neglecting potential answers with lower scores. This bias arises from the deficiencies of in-context example constructions and underperformances of multi-query ensemble strategy. In this paper, we propose MinBias, an approach designed to mitigate inherent bias of answer heuristic based VQA framework. Firstly, to help LLMs learn dialectically, we propose Dialectical Learning Space Construction strategy (DLSC). This module mitigates bias via guaranteeing diversity in the selected examples. Secondly, to strengthen the connection between image captions and questions, we propose Large Language Model guided Visual Masking (LVM) algorithm. This module mitigates bias via enhancing visual clues in example content. Finally, we propose Visual Entailment based Answer Rerank module (VEAR). This module mitigates the bias arising from multi-query strategy via obtaining unbiased evaluations. Extensive experiments demonstrate that our method mitigates the bias of the answer heuristic based VQA framework and enhances the model performance.

Abstract:
Point Cloud Quality Assessment (PCQA) aims to accurately predict the visual quality of a point cloud, which is essential in optimizing and evaluating the point cloud compression, transmission and rendering. In this paper, we propose a deep learning based full reference PCQA using 3D-to-2D Regularized Representation (RegR-PCQA), where point clouds are projected to regularized 2D image representations and then measured with deep neural networks. Firstly, we propose a regularized representation module to project unstructured point clouds to 2D Regularized Geometry Images (RGIs) and Regularized Attribute Images (RAIs), which enhance the local adjacency and uniform distribution of points. An anchor matching is developed to build the correspondence of regularized images between the distorted and reference point clouds. Secondly, to exploit the visual features of the RGIs and RAIs, we propose a deep learning based two-branch PCQA network, in which vision transformer based Geometry Feature Extractor (GFE) extracts global structural features from RGIs and Convolutional Neural Network (CNN) based Attribute Feature Extractor (AFE) extracts local semantic features of the RAIs. Finally, based on the geometry and attribute features, the point cloud quality is predicted by the proposed quality regression module, where a spatial attention mechanism is exploited to assign different importance weights for the feature maps. Experimental results show that the Pearson Linear Correlation Coefficients (PLCC) achieved by the proposed RegR-PCQA are 0.8430, 0.9575, 0.7853 and 0.8576, respectively, on the SIAT-PCQD, SJTU-PCQA, WPC and WPC2.0 datasets, which are superior to the state-of-the-art PCQAs. Also, extensive experimental results on distortion types, sampling strategy and training rate show that the proposed RegR-PCQA achieves an excellent generalization.

Abstract:
3D hand pose estimation is crucial for many human-computer interaction applications. However, existing deep neural networks (DNNs) for 3D hand pose estimation suffer from poor generalizability due to data scarcity and a lack of domain-specific knowledge. In contrast, humans remain far better than DNNs at learning; Humans require fewer samples for learning new concepts under the guidance of their prior knowledge. Inspired by this, we propose a graph-enhanced CLIP to deliver visual-semantic priors to DNNs, and provide refined domain-specific knowledge for better 3D hand pose estimation. Specifically, we first introduce a pre-trained CLIP to guide the hand estimation model in learning the semantic-aware visual features, and text-free contrastive learning is proposed to effectively transfer high-level visual-semantic priors from the pre-trained large multimodal models. Notably, our strategy is data-agnostic and avoids designing hand-crafted text prompts for various visual inputs. Second, we introduce novel graph Transformers to refine the domain-specific knowledge by fully exploiting the local adjacent relations of hand joints and capturing the global structure representations of hand poses. The introduced graph Transformers are supposed to further refine the generalized CLIP feature for the downstream task (i.e., hand pose estimation) with better performance. Experiments show that our proposed graph-enhanced CLIP achieves state-of-the-art performances on benchmark datasets, demonstrating its effectiveness for 3D hand pose estimation.

Abstract:
Video captioning is a prominent and challenging research area. Previous studies have focused predominantly on describing entire video segments, often overlooking the more significant status changes within these segments. We propose a novel decoupling and integration network for general event boundary captioning (DIN-GEBC), which is applied to the Kinetics-GEBC dataset, a video dataset with fine-grained status descriptions. DIN-GEBC focuses on generating three types of captions for each video segment boundary: a caption for the dominant subject (subject caption), a caption describing the event prior to the status moment (status-before), and a caption describing the event after the status moment (status-after). To address different descriptive focuses and characteristics, DIN-GEBC is proposed for decoupling and integrating both tasks and features. For task decoupling, DIN-GEBC is designed with a dual branch structure, in which the generation of the subject caption is addressed by the dominant subject branch with a Video Q-former encoder and the generation of the status-before and status-after captions is addressed by the event branch with a Reinventing RNNs for the Transformer Era (RWKV) encoder. For task integration, DIN-GEBC enables the dominant subject branch to guide the event branch in producing detailed information regarding the subject experiencing the change. Feature disentanglement, in which the common features are used by the dominant subject branch to capture the unchanging information, is also performed, and the difference features are applied to the event branch to capture the changing information. The experimental results show that our model outperforms existing models on the Kinetics-GEBC dataset even with fewer parameters.

Affiliations: Key Laboratory of Computer Vision and System, Tianjin University of Technology, Ministry of Education, Tianjin, China; Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China; School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China; School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, China

Abstract:
Temporal sentence grounding in videos (TSGV) is a challenging task that aims to match text queries with semantically relevant segments in untrimmed videos. However, existing methods face limitations in modeling modality features, which constrains the expressive power of candidate moment features. To address this challenge, we propose a novel Enhanced Feature Interaction Network (EFIN) that effectively captures semantic information within each modality and aligns relationships between modalities. Additionally, EFIN enhances the fusion of information between candidate moments and modality features. Specifically, our model begins by extracting modality features to generate candidate moments as priors. Building upon these modality features, we introduce an enhanced feature encoder to extract semantic information within each modality, thereby improving intra-modality feature representation. Simultaneously, the encoder captures alignment relationships between modalities to optimize cross-modality feature representation, enhancing the overall modeling capacity of modality features. Moreover, we design an information fusion module to enrich the comprehension of modality information for candidate moments. Extensive experiments on four benchmark datasets demonstrate the superiority of our proposed EFIN model. Notably, EFIN achieves a maximum performance improvement of approximately 1.67% and 1.91% across different evaluation metrics on TACoS dataset.

Abstract:
Efficiency and security issues are significant considerations in the transmission of information. The joint compression and encryption method is an effective way to improve both issues. In this paper, a novel chaos system named logistic-coupled sine and exponential function map (LSEM) and a joint JPEG compression and encryption scheme are proposed. Unlike existing JPEG image schemes, a scanning permutation before the differential pulse code modulation (DPCM) is proposed, containing two scanning modes, which can achieve good scrambling performance while reserving file space for subsequent encryption. In addition, an inter-group cross-permutation on random groups of DC coefficients is designed to permute the DC coefficients. The DC coefficients are grouped according to random length and position, and subsequently subjected to inter-group cross-permutation based on random indexes generated with the proposed chaotic system. For AC coefficients, an inter-block permutation method is proposed to change ZRV (zero-run length, value of a non-zero quantized AC coefficient) pairs quantities, which can effectively alter the corresponding block features and the histogram distribution. Experimental results demonstrate that the proposed scheme is reliable in protecting JPEG images while suppressing file size growth and maintaining format compatibility. Notably, the scanning permutation can reserve an average of 0.055% and 0.162% file size for the test images under scanning modes 1 and 2, respectively. The NPCR and UACI results for sensitivity analysis are close to the ideal values. Besides, the change rate of block features is no less than 90% under different quality factors.

Abstract:
To ensure the authenticity and validity of Secret Image Sharing (SIS) schemes, Verifiable Secret Image Sharing (VSIS) methods have been proposed. However, existing VSIS schemes are vulnerable to attacks from Dishonest Participant (DP). The dishonest participants may still access k-1 valid shares when they are identified, enabling them to recover the secret image. To address this issue, a VSIS based on polynomial interpolation for resisting Dishonest Participant Attacks (VSIS-RDPA) is proposed. Unlike traditional SIS schemes, where secret pixels are used as polynomial coefficients, our scheme treats secret pixel values as function values of the polynomial. On the contrary, the secret key and secret pixel values serve as inputs to reconstruct a k-2-degree polynomial using Lagrange interpolation, enhancing security against malicious participants. Authentication parameters generated by a hash function are combined with the k-2-degree polynomial to form a complete k-1-degree polynomial, which subsequently is utilized to generate shares. On the receiver end, the k-1-degree polynomial is first reconstructed, and the authentication parameters generated by the hash function are compared with those obtained from the reconstructed polynomial. If the two sets of values match, the authentication is successful, allowing for the recovery of the secret image. In addition, a modified authentication phase with ECC is also proposed to enhance the robustness of authentication. Experimental results and analysis demonstrate that the proposed schemes can resist DP attacks and ensure efficiency, verifiability, and security.

Abstract:
Livestreaming platforms attract countless daily active users, making online content regulation imperative. The complex and diverse multimodal content elements in dynamic livestreaming scene pose a great challenge to video content understanding. Thanks to the success of contrastive language-image pre-training (CLIP) for dynamic scene classification, which is one of the basic tasks of video content understanding. We propose a heterogeneous multimodal state space network (HMS2Net) for dynamic scene classification in livestreaming via CLIP. (1) To fully and efficiently mine the dynamic scene elements in livestreaming, we design a heterogeneous teacher-student Transformer (HT-SFormer) with CLIP to extract multimodal features in an energy-efficient unified pipeline; (2) To cope with the possible information conflicts in heterogeneous feature fusion, we introduce a cross-modal adaptive feature filter and fusion (CMAF) module to generate more complete information complementarity by adjusting multimodal feature composition; (3) For temporal context-awareness of dynamic scene, we establish a dynamic state space memory (DSSM) structure for capturing the correlation of multimodal data between neighboring video frames. A series of comparative experiments are conducted on the publicly available datasets DAVIS, Mini-kinetics, HMDB51, and the self-built BJUT-LCD. Our HMS2Net produce competitive results of 71.09%, 95.40%, 53.64%, and 82.36%, respectively, demonstrating the effectiveness and superiority of dynamic scene classification in livestreaming.

Abstract:
This study aims to extend the applicability of stylized motion generation methods to be robust for large and diverse motions akin to those found in real-world data. Specifically, we introduce metadata-independent learning alongside style-focused learning, thereby enabling training from motions absent in motion-style datasets. In addition, we construct a novel motion dataset containing both various motions and stylized motions by unifying the multiple datasets to effectively train the model. Our novel learning method and dataset enable stylized motion generation methods to learn from both various motion knowledge and motion-style relations and improve their generalized performance. In downstream tasks, we address motion style transfer and text-to-stylized-motion, validating the enhancement of generalization abilities for each task. Compared to conventional methods, the proposed method demonstrates superior performance in generating and reflecting style, particularly under conditions featuring larger and more diverse motions.

Abstract:
This paper proposes a new chaotic system, and experimental analysis shows that its Lyapunov exponent can reach up to 6.85, demonstrating excellent chaotic performance. It is highly suitable as a pseudorandom number generator for use in secure communications. Based on this, we designed a medical image encryption algorithm. The average values of Pixels Change Rate (NPCR) and the Unified Average Changing Intensity (UACI) for encrypted images can reach 99.6200% and 33.4699, respectively. The pixel value distribution is uniform, and the correlation between adjacent elements is between -0.01 and 0.01, demonstrating excellent security. Additionally, we adopted a distributed cloud storage strategy for efficient and secure image sharing, and we used Cyclic Redundancy Check (CRC) to ensure end-to-end data correctness. With these technologies at its core, we designed a medical image sharing system to securely and efficiently enable cross-domain sharing of medical images. A series of simulation experiments demonstrated that the system offers excellent security and stability.

Abstract:
Digital photography image fusion aims to integrate essential information from multiple source images. However, the current methods predominantly rely on a single-channel fusion strategy, failing to fully leverage color pixel information. Furthermore, the significant variations in fusion mechanisms across tasks limit the performance of generic models. These factors can lead to color shifts and the loss of fine details. To address these issues, we propose a novel approach named MRQE-Net for general digital photography image fusion. The proposed network treats RGB pixels as a unified entity by encapsulating them into reduced biquaternion (RQ), and achieves task-specific fusion through mixture of RQ experts (MoRQE) module within a unified model. Specifically, we first develop the RQ spatial attention (RQSA) and RQ multi-scale pixel attention (RQMSPA) modules to enhance salient features. Then, we propose the deep feature invertible module (DFIM) to efficiently extract the complementary deep information. Finally, to enable customized fusion and feature enhancement for each task, we propose the adaptive feature synthesis amplification module (AFSAM). To the best of our knowledge, this is the first attempt to perform digital photography image fusion in RQ neural networks. Extensive experiments demonstrate that the MRQE-Net outperforms state-of-the-art methods across multiple metrics for each subtask.

Abstract:
The Ball-Pivoting Algorithm (BPA) is a crucial technique for 3D surface reconstruction from point clouds, which relies heavily on selecting an appropriate ball radius. The effectiveness of BPA is significantly influenced by the point sampling density. In areas of low sampling density, a small ball radius can create gaps, while a large radius in high-density regions may oversimplify the surface and miss finer details. This paper addresses this challenge by introducing Auto-DBPA, an Automatic radius selection with Density-aware BPA. Auto-DBPA adapts dynamically by adjusting the ball radius according to the local sampling density. Our approach offers a scalable solution for reconstructing complex scenes and objects with varying levels of detail. Unlike conventional methods that require partitioning the point cloud into clusters and merging the reconstructed parts, which can cause visible seams and discontinuities, our unified reconstruction pipeline dynamically adjusts radii across the entire point cloud. To achieve this, we use hierarchical clustering and compute Fast Point Feature Histograms (FPFH) for each density cluster, capturing local geometric properties. We then leverage these geometric features to predict the optimal radius values for each cluster, adequately adapting the ball radius to local density variations. We also address the non-differentiability of BPA, which arises from its geometric computations and lack of gradient information, by introducing an innovative solution based on contextual bandits. Our approach employs the contextual bandits framework to effectively select the optimal ball radius based on local geometric features, significantly enhancing reconstruction quality. Our method is scalable and particularly effective in scenarios with varying density levels where a single-radius solution is inadequate. Results on the ABC, FAUST, SceneNN and ScanNet datasets demonstrate that our method successfully handles 3D reconstruction from varying point cloud densities, outperforming manual tuning, classic methods, and learning-based approaches.

Abstract:
The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.

Abstract:
Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP_\mathrmall and 14.6% AP_\mathrms, outperforming RegionCLIP by 3.5% AP_\mathrmall and 3.0% AP_\mathrms, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP_\mathrmall and 17.26% AP_\mathrms, surpassing the RegionCLIP by 6.6% AP_\mathrmall and 6.1% AP_\mathrms, respectively.

Abstract:
Tensor wheel (TW) decomposition has been successfully applied to low-rank tensor completion (LRTC) to recover missing pixels in hyperspectral images (HSIs). However, existing TW models ignore the low-rank information in the factor space and the nonlinear geometric structure of HSIs, resulting in unsatisfactory recovery accuracy. To address these limitations, we propose a tensor wheel completion with low-rank factor prior and adaptive graph regularizer (TW-FPAG) for HSI recovery. First, we discover and prove the rank relationship between the tensor unfoldings and TW ring factors. Leveraging this relationship, we impose the matrix nuclear norm on factors, thereby enhancing the ability of TW decomposition to describe the low-rankness of HSIs. Second, we preserve the geometric proximity information of HSIs by constructing the nearest neighbor graph for each ring factor and integrating this information into the TW decomposition using the graph Laplacian. Finally, we extensively evaluate the proposed TW-FPAG on the hyperspectral image, multispectral image (MSI), and hyperspectral video (HSV). The experimental results demonstrate the superiority of our TW-FPAG method, with significant performance improvements of 2.81 to 5.07 dB for HSIs under a 10% sampling ratio compared to the best-compared results.

Abstract:
Stereo image Super-Resolution (SR) aims to enhance image resolution by leveraging complementary information in stereo pairs. Convolutional Neural Networks (CNNs), widely used in stereo image SR for their strong local pattern extraction capabilities, often fail to capture long-range dependencies critical for stereo correspondence. On the other hand, Swin Transformers have demonstrated superior performance in modeling long-range dependencies for stereo image SR tasks. However, their computational complexity scales quadratically with the window size, leading to a trade-off between global receptive fields and computational efficiency. To tackle these challenges, we propose StereoMamba+, a novel stereo image SR method designed to adaptively capture both local and global dependencies in stereo pairs. Leveraging the Mamba architecture as its backbone, StereoMamba+ integrates an Adaptive State Space Module (ASSM) that efficiently extracts and fuses global and local features, maintaining linear computational complexity. Additionally, a Gated Enhanced Feed-Forward Network (GEFN) selectively amplifies essential features and depth cues, and a Residual Frequency Block (RFB) is employed to capture global features in the frequency domain. To further enhance stereo correspondence, we introduce a Stereo Bi-Directional Cross-Attention Module (SBCAM), aligning unique features along both horizontal and vertical epipolar lines to improve stereo consistency. Extensive experiments demonstrate that our proposed StereoMamba+ method achieves state-of-the-art performance on 2× and 4× stereo image SR tasks, delivering PSNR improvements of up to 0.45 dB, while maintaining competitive parameter efficiency compared to existing methods.

Abstract:
The capability to jointly process multi-modal information is becoming essential. However, the development of multi-modal learning is hindered by the substantial computational requirements and the limited availability of paired multi-modal data. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a simple yet efficient and effective approach, treating speech and image modalities as discrete text modality and approaching multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, resulting in a significant reduction in computational cost. Furthermore, by incorporating back translation into multi-modal translation, unpaired data can also be utilized for training. TMT can perform six modality translation tasks and consistently outperforms its single-model counterparts. TMT significantly reduces the required data size (in bits) for training, to approximately 0.2% for speech data and 0.04% for image data, respectively.

Abstract:
Object instance segmentation is a key prerequisite for service robots to perform daily chores in unstructured environments. Traditional supervised learning-based segmentation solutions rely on massive annotated datasets, which are impractical for the wide variety of objects in real-world scenarios. To this end, we propose a novel zero-shot instance segmentation approach (TransZSIS) that enables precise instance segmentation without relying on external semantic embeddings or auxiliary information to address the unseen object instance segmentation (UOIS) problem. First, the RGB and depth images are segmented into irregular patches based on a super-pixel segmentation algorithm to generate a unified segmentation map, and then the comprehensive feature vectors of each patch is extracted and paired. Further, a Transformer-based architecture is introduced to capture the correlation between different patch-pair and the intrinsic characteristics of each patch-pair. To predict patch-pair relationships, TransZSIS uses a four-layer fully connected neural network (FCNN) to classify the transformer-encoded features and refine them with a graph-based processing tactic to achieve object instance segmentation. Extensive evaluations on both synthetic and real datasets demonstrate that TransZSIS achieves superior performance compared with state-of-the-art baseline methods. Also, we implement real experiments to verify that our solution can achieve robot grasping by segmenting unseen objects.

Abstract:
Monitoring wildlife behavior and population changes is critical for conservation efforts. However, specialized analysis of large volumes of wildlife images is extremely challenging, necessitating the use of artificial intelligence techniques to automatically detect, segment, and classify species captured by trap cameras. Despite the increasing use of AI in wildlife monitoring, challenges with data quality and availability persist. The Snapshot Serengeti (SS) dataset only has image-level labels and very few bounding box labels, and there’s no dataset with pixel-level labels due to the significant annotation costs. To this end, we create and release the large-scale Semantic Segmentation for Snapshots of the Serengeti (S^4) dataset, consisting of 24K high-quality images across 47 species with precise masks, for both common and rare species. This dataset serves as a resource for developing semantic segmentation algorithms in wildlife studies. Additionally, we introduce HitBack, a novel method leveraging Hierarchical-Semantic Cross Attention (HCA) and Background Contrast (BC) for weakly supervised semantic segmentation (WSSS). The HCA module is used to capture both the shared and distinct features across species, and the BC module is designed to enhance foreground activation by ensuring consistency in the backgrounds. Extensive experiments on the newly proposed S^4 benchmark show that, our HitBack presents competitive performance when compared with the state-of-the-art models. The mIoU of HitBack is +10.4%, +14.7%, and +18.4% higher than that of ToCo, SIPE, and MCTformer, respectively. In addition, our HitBack even obtains performance that surpasses the fully-supervised and semi-supervised methods when annotation data is limited. Code and datasets will be available at Github.

Abstract:
Recently, deep unfolding networks (DUNs) have emerged as a promising technique for image Compressive Sensing (CS) reconstruction by unfolding optimization algorithms, where each stage of the DUNs corresponds to an iteration of the optimization algorithm. DUNs can be divided into convex optimization based methods and non-convex optimization based methods. On the one hand, DUNs based on convex optimization algorithms cannot handle non-convex optimization problems, thereby limiting their use when the prior term is a non-convex function. On the other hand, although DUNs based on non-convex optimization algorithms can handle more complex prior terms to make global optimal solutions closer to the ground truth, there is a high probability that they converge only to a local optimum. Therefore, in practical applications, it is necessary to consider the various characteristics of the problem comprehensively, then design appropriate prior terms and choose convex or non-convex optimization in DUN. This paper proposes ViP-DUN method to learn suitable prior terms and adaptively use convex or non-convex optimization. ViP-DUN learns deep prior terms and variable metrics in a data-driven manner to achieve adaptive use of convex or non-convex optimization. Moreover, we designed a lightweight multi-scale information fusion module in ViP-DUN at the network structure level to further enhance the network’s processing capability. Experiments demonstrate that our proposed method can improve image reconstruction quality at multiple compression rates through the adaptive capabilities of the network.

Abstract:
In smoke interference scenarios, visible image degradation significantly limits the availability of detailed information for fusion. Moreover, a single fusion network often struggles to effectively extract complementary and key information. To overcome these challenges, we propose a infrared and visible image fusion method tailored for smoke interference, termed the Repeated Key Feature Embedding Fusion Network (REFusion). To fully leverage low-frequency and high-frequency information from source images, we design the network with a pre-fusion stage and a fine-fusion stage. To address the heterogeneity between infrared and visible images, we introduce a Cross-Source Context Association Network (CSCAN). By integrating the self-attention mechanism and a cascading pyramid pooling module, we compute attention interaction weights across modes to capture the most relevant context for fusion. To prevent the fusion from favoring one source disproportionately, we enhance the fine-fusion stage with a Differential Information Reinforcement Module (DIRM) and a three-branch network. This structure establishes long-term dependencies between complementary information and excels in capturing high-frequency details. We further design a tailored loss function to enhance the quality of fused images. Extensive experiments demonstrate that REFusion surpasses nine state-of-the-art methods in both visual quality and quantitative metrics, particularly in reproducing scenes obscured by smoke.

Abstract:
Defect detection in multimedia data plays a pivotal role in industrial manufacturing. However, existing methods are primarily designed for closed-world scenarios and can only identify defect classes in the training data, limiting their ability to effectively detect unknown class defects that arise during production. To address this critical limitation, we propose a novel approach by introducing industrial defect open set recognition (IDOSR), which overcomes the challenge of recognizing unknown defect classes. Furthermore, to tackle the issues of limited training samples and subtle inter-class differences in IDOSR, we present a high-frequency feature enhancement open set recognition (HFFE-OSR) method. Specifically, HFFE-OSR employs a high-frequency structural feature fusion enhancement strategy to meticulously extract and fuse defect-related high-frequency structural features. This enables the network to comprehensively learn defect target representations even under limited training samples, resulting in robust feature extraction for known classes, thereby improving the discriminability between known and unknown classes and addressing the difficulty of distinguishing between them. Additionally, a class mutual information constraint strategy is introduced to measure and reduce the mutual information among defect features from different classes. This ensures the independence of defect features across known classes, further enhancing their discriminability and significantly improving recognition performance for known classes. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art OSR methods on ID-OSD and MVTec datasets, achieving improvements of at least 7% in accuracy (ACC), 22% in F1 score, and 10% in AUROC, highlighting the effectiveness of our approach in industrial defect detection.